Skip to main content

Validating GenAI Models

Using model risk management to see through the ‘magic trick’

All GenAI applications are sources of model risk. For Financial Services this means regulations require them to be subject to Model Risk Management (MRM) controls, a key one being validation, prior to approval and use.

Many CROs (accountable for model risk frameworks) and Heads of MRM are working out how best to validate GenAI models. This article frames the problem and offers solutions based on our (Deloitte, Model Risk Management) market leading approaches and recent hands-on experience.

Audience: CROs, Heads of AI programmes, AI Leads, Model Developers, Model Validators, and Model Risk Management professionals.


At a glance

Validation of AI models1 is necessary, both from the perspective of effectively managing model risk, and to meet regulatory requirements.

Validation of Generative AI models (of which Large Language Models (LLMs) are the most prominent example) is particularly challenging for many reasons including the structure and complexity of the underlying core models.

A necessary but significant change in validation mind-set is emerging, with less focus on validating underlying methodologies and increased attention on validating uses of models.

As developers innovate and models become more sophisticated, so must validators keep pace. Leveraging some well-established and refined principles (see: Adapting model validation in the age of AI), validators are already increasing their hands-on experience to develop new techniques that improve the Board, Executive Management and model users’ understanding of the risks of using LLMs.
 

LLMs do not give perfect outputs, they are models, and are sources of model risk
 

“All models have some degree of uncertainty and inaccuracy.”

"[Model risk] is the potential for adverse consequence from decisions based on incorrect or misused model outputs and reports.“

"Because model risk is ultimately borne by the bank as a whole, the bank should objectively assess model risk and the associated costs and benefits [of models] using a sound model-validation process.”

– US Regulators
 

Introduction

Regulatory expectations around the use of Artificial Intelligence (AI) models continue to evolve. The EU AI Act came into force at the start of August 2024, and imposes clear constraints and expectations on companies—including financial institutions—developing and using AI models. The UK regulator’s SS1/23, and the accompanying Policy Statement (PS6/23) are clear that AI uses are captured as models and so subject to the Model Risk Management (MRM) expectations the PRA has published2.

Large Language Models (LLMs) are increasingly being assessed for—or are already used in—a range of use cases in banks, including: chatbots for customer service, customer and market research, topic summaries, fraud detection, anti money laundering processing, and even more esoteric areas such as trading strategies. This wide range of potential LLM use-cases, and ensuring validators have the “requisite technical expertise and sufficient familiarity with the line of business using the model” (SS1/23) pose key challenges for banks, with some already carrying out skills gap analysis (against the demands of the inventory), rolling out training programmes and hiring in an increasingly competitive market.

A further fundamental challenge for model risk managers is how to adapt and enhance existing MRM frameworks. The framework needs to capture additional risks both in the aggregate and at the model-use level, these are often the outputs of validation. In the aggregate, there are both requirements to incorporate LLMs into existing model reporting, and typically a demand to provide the Board and Executive Management with information about:

  • How and where LLMs are used;
  • The decisions LLMs support; and
  • The risks LLMs present to the bank.

At the use level—where we have carried out LLM validations for several global banks—we see greater emphasis is being put on validators to help the model users better understand the capabilities and limitations of the model; in what circumstances we expect the LLM to work well and when it is likely to produce less complete, reliable, and accurate outputs.

This presents several challenges when compared against “traditional” model validation approaches, not least the potentially challenging mind-set shift from validating the inner workings of a model to validating the specific use case of a model.

The validator’s challenges include (but are not limited to)
 

“A person that memorises all the cookbooks in the world, but has never been in a kitchen, is still not a chef”
 

Inconsistent Responses: because LLMs do not “understand” queries but rather process and respond to them—based on context, word order and the LLM’s own internal processes (e.g., processes such as ‘tokenisation’ governing the way LLMs break down text into smaller units or ‘tokens’), continuous training, and settings—they can produce inconsistent outputs. This includes when asked the same query using different words, and even giving different answers when asked the same question repeatedly. This presents a challenge for traditional validation processes, a core expectation of most traditional models being that the same inputs should return the same outputs.

Generative AI (GenAI), of which Large Language Models (LLMs) are the most prominent example, are a sophisticated form of ML which synthesises new content based on instructions from a user, for example, generating summaries of news stories to a similar standard as a human analyst.

LLMs are trained on massive datasets of text and code, encompassing books, articles, websites, and more. This data is fed into complex algorithms, which identify patterns and relationships within the text, effectively teaching the LLM grammar, vocabulary, and even different writing styles, all at once. The learning process involves predicting the next word in a sequence, given preceding words. Through countless iterations of this exercise, the model refines its processing of language and its nuances. However, it is important to remember that, like any form of AI, LLMs do not "understand" the information they process. They rely solely on patterns they extract from the data. LLMs can generate human-like text, translate languages and answer questions in an apparently informative way, but they are not intelligent per se.

Defining "Good" Performance: LLMs can generate highly coherent and fluent text, but defining what constitutes "good" can be difficult, especially where technical and/or topic-specific knowledge is needed. Metrics like “BLEU (BiLingual Evaluation Understudy) scores”, which measure the similarity between generated text and human-written text, can help, but can also be misleading as output texts could have good BLEU scores, but still not answer the query.

Factual Accuracy and Grounding: LLMs can generate plausible sounding, but factually incorrect information. LLMs produce responses based on patterns and relationships between words and phrases encountered in their training data. LLMs do not ground their responses in real world experience. A person that memorises all the cookbooks in the world, but has never been in a kitchen, is still not a chef. Likewise, LLMs similarly lack real-world context, experience and understanding in their outputs. And, if the topic is poorly covered or misrepresented in the training data, this will likely be reflected in the outputs. As a result, a feature of LLMs is that they can “hallucinate”, or more precisely, make seemingly illogical connections and produce inaccurate outputs. Validating the factual accuracy of LLM outputs is crucial, especially in use cases where returning correct responses is vital.

An example of inaccuracy is that if you ask an LLM “how many L’s are there in the word HALLUCINATE” they will often answer that there is only one. The LLM does not look at the word HALLUCINATE and count the number of instances of the letter L. It breaks the question down into parts, analyses the parts, and then returns a response that is the most likely to be correct based on its parsing of the query and its tokenisation and prediction process. Validating the uses of LLMs helps users understand these sorts of model weaknesses.

Interpretability and Explainability: LLMs are considered "black boxes" due to their complex internal workings and model architecture. Understanding why an LLM generates a particular output is difficult, making it challenging to identify the root cause of some errors or biases.

Data Requirements and Computational Costs: Training and evaluating LLMs requires massive, high-quality datasets and significant computational resources, as well as there being a cost for commercial access to LLMs, usually based on the volume of queries processed. This can limit the accessibility of LLM development and validation to organisations with substantial resources.

Bias and Fairness: LLMs are trained on massive datasets, which can contain bias in the underlying data3. This can lead to models exhibiting biases in their output, potentially perpetuating harmful stereotypes or discrimination. Ensuring fairness and avoiding bias is not a new problem for model risk managers, but in LLMs, identifying the underlying source of, and then mitigating biases is likely to be more difficult.

Validation Tests: Most traditional tests and techniques do not fully capture an LLM's performance:

  • Data: “Ground truths” (correct and accurate data, required for validation) may not exist, and where they do, they are not necessarily easily comparable to a correct model output (e.g., if the response is worded in a different manner).
  • ‘Task’ level, not data level testing: For testing LLMs, there is a limited concept of ‘in-sample’ data used for development and ‘out-of-sample’ data used for validation. Rather there are ‘in-sample tasks’ that the model has been trained on, or is well-suited for, and ‘out-of-sample tasks’ it has not been trained on. However, as most bank’s applications use pre-trained commercially available LLMs that have been trained on a wide range of tasks determining ‘out-of-sample tasks’ is challenging.


The validator’s solutions include (but are not limited to)
 

Resolving these challenges is crucial to ensure the risks inherent in the use of LLMs are properly understood and can be managed. New approaches and techniques are needed for validating LLMs, and while these are evolving at pace, (where needed) regulatory comfort may take time to earn.

For LLMs, the overall validation objective moves from a traditional "is the model correct/accurate" view, to an approach that reports, to the business (model owner and users) for their consideration, the level of model risk posed by using the model compared to the potential benefits the model provides. This is not only more appropriate for financial models in an SS1/23 MRM environment, but also for GenAI models, for which ‘right’ or ‘wrong’ is not as easy to quantify. This is not to say that the accuracy or correctness of LLM outputs is not important—in many banking use cases it may be fundamental—but the approach to determining if the model is ‘fit for purpose’ needs to put greater emphasis on the use risk, and the importance of users understanding the uncertainty in the outputs when using them to make business decisions.

Techniques are evolving fast, and unlike the LLM models, human validators do have real-world experience and, (milking our analogy) like chefs: the more time spent in the kitchen, the more problems we need to solve and the better we become at finding solutions. Some of the more popular techniques include:

  • BLEU/ROUGE/METEOR: measure the similarity between machine generated and the reference text, with each metric placing emphasis on different aspects of the text.
  • BERTScore: measures the similarity between machine-generated and reference text. It uses text embeddings from BERT to improve capture of semantic similarity.
  • QAG (Question-Answer Generation): uses an LLM to generate questions on one text and answer them with another (e.g., before and after summarisation). Captures the transfer/loss of key relevant facts.
  • G-Eval: prompt an LLM to use a Chain of Thoughts (CoT) to evaluate an output – planning and reflecting on its reasoning process.

LLM validation approaches will be as varied as the use cases to which the models are put, and each instance will need careful design and execution. Based on some of our recent engagements supporting banks in validating GenAI models, here are some of the latest suggestions around the design of a validation regime for LLMs:

1. Design and incorporate validation into processes from the ground up

Validating GenAI models is a new field, and consensus around best practice is still emerging. Our experience points both to the importance of being flexible and of ensuring that, where possible, validation ‘tollgates’ are considered throughout the development process. Pragmatism and better communication between the lines of defence—while ensuring validators maintain sufficient independence to offer an objective, unbiased and critical opinion—will enable easier and better validation and encourage more efficient model lifecycles.

2. Design and develop appropriate, specific measures for LLM assessment and validation.

Given the breadth of uses to which LLMs are being put, including complex and critical applications, it is important to develop robust and specific measures for their assessment and validation. Generic metrics can fail to capture the nuances of LLM performance in some tasks. Appropriate measures, tailored to the intended use case, are essential for ensuring LLMs behave as expected, produce reliable outputs, and avoid unintended consequences. While this field is evolving rapidly, establishing clear standards and approaches remains an ongoing effort.

In some cases, this will involve using an LLM to validate an LLM. It is possible to set an LLM a task to perform a separate set of processes that check that a target LLM is operating the core task correctly4. For all but the simplest models, this option will be necessary to achieve the scale and coverage required for a robust validation. Using LLMs to evaluate the performance of other AI/ML models is promising, but not without its challenges. These "evaluator" model uses need to be carefully tailored to specific situations and may require significant computing resources. It should be noted that when completing the model inventory, the use of an LLM process to validate another LLM process would have to be captured as a model, which would itself require proportionate validation. In addition, LLM-based evaluation methods require a degree of “human-in/over-the-loop”. However, overall, this approach demands significantly less manual involvement than the “fully human” approaches described below.

Another aspect of the validation process for GenAI usage is a human to LLM check. Having a suitably qualified person undertake a spot/sample check on the outputs generated by the LLM is one way of ensuring that outputs are relevant. The extent to which this enables the firm to meet the broad regulatory expectation of a “human-in-the-loop” will depend on the importance or purpose to which the model is put. Hence, the more important the purpose, the greater the likelihood of needing a more complete human review of the LLM’s outputs.

Where possible, a further validation check assesses if the LLM has been trained on the right data. Training cut-offs in the data may deliver inaccurate results (e.g., using a model whose data period cuts off before a new piece of regulation came into force). Validation teams should review the training data, ensuring it is relevant and complete and that the training of the LLM is being kept up-to-date with evolving expectations of the underlying process. Given most users of LLMs will use a third-party model, it will be crucial to ensure that model validation is considered up front when assessing LLM providers.

3. Recognise the importance of Ongoing Performance Monitoring (OPM)

Continuous performance monitoring is important for managing risks associated with LLMs, especially when testing data is limited. This involves regular checks and adjustments to maintain accuracy, potentially using other models in combination with human oversight for evaluation. Designing the model performance monitoring during model development provides the best chance of alignment with appropriate performance metrics.

4. Consider the model’s users

A key element of ensuring appropriate usage is training users. To maximise the consistency and quality of output, the consistency and quality of input is an important factor (i.e., what is often referred to as GIGO, or “garbage in, garbage out”). Users should be trained and continuously reminded of the potential for uncertainty in LLM outputs, particularly in response to variations in their inputs. The training of users could involve responding to system generated prompts such as flagging where LLM outputs may need further review. LLMs are capable of a wide variety of transformations of data, for example finding information from foreign language sources and translating it. Where this happens, the output should be flagged for extra attention. Validation processes should take these flags into account and check that they do not give rise to materially different or less reliable responses. For LLMs in particular, validation reports should be written with the model users in mind. To reduce the potential risk of “adverse consequence from decisions based on incorrect or misused model outputs” validators must ensure that the LLM’s weaknesses and limitations, assessments of uncertainty and any implications for the user are presented in clear and concise language.

5. Understand the cost/benefit implications

The cost of running LLM queries can be substantial. When developing a project using LLMs, it is important to consider the cost of running validation queries along with the cost of running the model itself early in the design phase. This needs to take into account the risk level of the model and the effectiveness of differing validation approaches. Understanding validation options at the design stage will allow for informed cost/benefit decisions to deliver the appropriate level of comfort with the model at the right cost.
 

Final Thoughts
 

It’s over a year since our previous blog on LLM validation and a lot has changed. Validators’ knowledge and expertise, through hands-on real-world experience, continues to expand and grow. As the “chefs spend more time in the kitchen”, so do their libraries of validation techniques. One constant we see in all banks, though, is the need for a robust model risk framework, overseen by the Board, Executive Management, and an appropriately authorised and competent model risk management function, one that includes adequate resources for model validation.

Arthur C. Clarke wrote that “Any sufficiently advanced technology is indistinguishable from magic.” The world of AI, and particularly GenAI/LLMs, is an example of this. The challenge for model risk managers is to design validation approaches and processes that will give the Board and Execs the right information to allow them to pierce the magician’s veil.

References:
 

1 AI is an umbrella term for a range of technologies and techniques that enable computers to perform complex tasks in ways that mimic human reasoning and intelligence.
Machine Learning (ML), a subset of AI, takes a data-driven approach. Instead of pre-programmed rules, we feed the ML model a vast dataset. The model then analyses the data and “learns” the patterns, correlations, and relationships in the data. When presented with new data, the model can then make predictions based on the analysis it has undertaken.

2 Primarily relevant for banks (approved to use ‘Internal Models’) the Bank of England expects the principles of SS1/23 to be considered more broadly (including Insurers), as noted in FS2/23 and their letter to the Department of Science, Innovation and Technology

3 For example, the majority of CEOs of large companies are men, so an LLM could infer, incorrectly, that a characteristic of good leaders is that they are men.

4 This is an area where traditional validation experts will likely be particularly uncomfortable, as there is a valid case to say that the validation of the LLM by the LLM is subject to all the same challenges listed above. In addition, the first principle of validation is that it must be independent, and an LLM “checking its own homework” may not meet that expectation.