Data-centric AI – It’s all about Data Reliability

Oz Levi, EVP, Technology & Innovation matrixDnA image
Oz Levi, EVP, Technology & Innovation matrixDnA

A look behind the scenes of how AI models are trained reveals a complex and expensive process, in which the data that models are trained on is perhaps the biggest challenge in developing GenAI-based corporate applications. In this article, we take a deep dive into the process of model training, explore the great danger of invisible data disruptions, the rise of the profession of Data Reliability Engineer, and a true story about one particular model that went crazy!

Our day-to-day experience of generative artificial intelligence is nice and simple. I, for example, use ChatGPT and Gemini to create presentations, analyze and write code, draft articles, and even write short stories for my children. It’s easy and requires no effort on my part. However, the simplicity of using AI as a service is deceptive, because it leads managers to think that AI-based processes can be implemented and used in large organizations – which must be accountable to the regulator or their customers – with the same ease. But the truth is that, behind the scenes of these friendly apps, lie very complex processes.

 

Behind the scenes of model training

Large language models, and in particular Assistant models (like ChatGPT) are not created out of thin air. Their construction and training depend on a long, complex and particularly expensive process that makes use of basic building blocks called ‘tokens’. In general, a training process for complex models, like the one we are familiar with that drives ChatGPT, includes four main stages:

Pre-Training – at this stage, training is carried out on a huge data set to create a base model, which is still very far from the model responsible for ChatGPT. The training process is complex and long, and requires enormous computing power. At the beginning 2023, details were leaked about the training process of Meta’s LLaMA model – the information on which the model was trained, in a process that took 21 days, contains about 65 billion parameters on 2,048 A100 GPU processors, at a cost of about 5 million dollars. And this is still only a base model. For comparison, GPT 3.5 (the old OpenAI model) contains 175 billion parameters.

This is where the magic happens. The model learns significant general representations that help it break down the language structure and the relationships and interrelationships between the different words. It is important to note that a language model does not really know what words are, because it has no appreciation for such a concept. In fact, the model undergoes training on tokens (like a symbol), which are actually numerical representations (integers) of the letters. Base models deal with only one task, and that is to predict the next most probable token from a statistical point of view.

 

Going up a level: an Assistant model that knows how to answer questions

Supervised Fine Tuning – at this stage, the base model undergoes additional training, with the aim of creating an Assistant model that is skilled in answering questions, not just completing documents. Although a base model can be asked to behave as an Assistant model by using Prompt, it is still not ‘smart’ enough to be an Assistant model without this Supervised Fine-Tuning step. Here, the model’s capabilities can also be focused around a certain content world (such as finance). This is basically the stage at which the model is given thousands of examples and helped to understand what is expected of it, as well as what the best answer it can give is, through a technique that deconstructs the answer and its various parts into elements.

 

The next step: artificial intelligence experiments on humans (not really, but it turns out that we are rather good at improving model performance)

Reward Modeling – at this stage, humans are brought into the picture, to improve the performance of the model. As it happens, we are very good at choosing the most correct or best answer from several answers. Using this method makes it possible to create a loss function that is used in the next step of model training. At its core, the loss function represents a measure of the ‘level of compatibility’ with actual reality of the results/ predictions of the Machine Learning model. The product is in fact a quantification that shows how far the model’s predictions are from true values. Or, put more simply, ‘the difference between what is and what should be’.

Reinforcement Learning – at this stage, positive behaviors (or correct tokens) are reinforced, according to the loss function created, while reducing the likelihood of unwanted behaviors (which are not reinforced at all). In this way, the model is adjusted to give optimal results against the loss function definition.

Now that we understand what is required to develop a language model, we can also understand that not every organization is able to do it. In-depth, specialized knowledge is required, as well as huge financial resources. Therefore, most large language models are used through an API, and most cloud providers and several startups already offer services that enable this. At the same time, the open-source world has not remained indifferent, and today there is a large number of very high-quality models, based on open source, that can be used for the implementation of AI-based solutions.

 

Under the RAG – The data-centric model – data takes center stage

Copernicus spent his final days trying to convince the world that the Earth was not at the center of the solar system. Nowadays, everyone is talking AI and it seems as though it all revolves around the models. But it is becoming more and more clear that AI models, however complex and smart they may be, are not the core – rather it is the data on which they go through their training, and from which they are created along with the data used for question answering, that lies at the heart of the matter.

Organizations that want to use AI models on top of their data are likely familiar with the term RAG (Retrieval Augmented Generation), a technique which allows “hooking” your organizational data to any model so that it will be able to respond based on specific data, but they must also address other organizational and technological questions such as: Who is responsible and how to measure data quality? How do you verify its reliability? And how do you measure the correctness of the answers that generative models give to the users, a domain commonly referred to as AI Governance.

Data Products – keeping a watchful eye on your data

One of the most important concepts in the world of data today is that of data products – applications or tools that are based on data, and accessible to organizations as a product. Behind the scenes of any such data product, be it an AI model or even a corporate dashboard – there are many infrastructures and processes that work together. A significant malfunction can manifest itself in information that does not arrive on time, dissatisfied customers and loss of income. But, the really complex faults are not necessarily what bring down processes. Rather it is the faults, which are sometimes difficult to locate, that cause changes and inaccuracies in data. When this data meets our AI model, whether in training or real life, it is already too late. This is what we call data downtime.

In order to measure data quality, and be aware of changes (drifts) in the information, organizations need to develop capabilities and work processes, and even make use of dedicated products from the world of data quality. These measurement tools and processes fall under an emerging category called data observability, with relevant platforms being MonteCarloData, BigEye, Elementary Data and others. But, using tools and infrastructure monitoring in itself is not enough. A trusted professional is required to handle such issues and respond when there are data failures.

 

Is it a plane? Is it a bird? No, it’s a Data Reliability Engineer!

A new field that is taking its first steps in the world, and already exists in organizations such as Uber, Disney and others, is that of Data Reliability Engineering. DREs are actually data engineers who specialize in infrastructure and information reliability issues in the organization. They work on data observability systems and are responsible for four main areas:

  • Data quality
  • Pipeline monitoring
  • Performance optimization
  • Incident response

With the help of DREs, organizations can quickly reduce the time taken to detect a fault (time to detection), get to the cause of the fault (root cause analysis), and handle faults with a measured SLA (time to resolution).

In the last year, new research about generative AI is published almost daily. Interesting examples include a report on ChatGPT performance degradation, and a scientific study that revealed a model gone mad.

These days, many startups are working on ways to make the corporate AI experience accessible as a service, and there is also growth in internally-developed enterprise applications. All this happens when there is constant development in knowledge around working with language models. New roles, or new areas of responsibility, are expected to develop and be added to existing professions, from prompt engineering, to implementation of governance processes on model answers, to management of the data that the models produce.

In the last year, studies have been published every day that shed new light on model training processes and GenAI technology. Important examples include one that deals with the behavior changes of ChatGPT, and describes how the quality of its products has decreased by about 35% in recent months; and a second that examines a phenomenon called Model Autophagy Disorder (MAD for short), which occurs when a GenAI model undergoes training on synthetic information. These studies emphasize the importance of basing models and applications on clean, high-quality information.

The application of GenAI models is not a quick fix, but rather a long-haul journey. There is still much to learn and discover, and the field is developing every day at a dizzying pace. Organizations that want to integrate GenAI capabilities should be aware of this, and make sure they do so with the guidance of professionals who have in-depth knowledge and can catapult the organization forward in a safe and responsible manner.

Find out more
Please complete your details and we will contact you
*
*
*
*
JOIN THE MATRIX NATION Back to home page - Matrix