Skip to main content

Adapting Model Validation in the Age of AI

Refining Principles and Developing Tools for Effective Validation amid Rising Concerns

The rapid advance of artificial intelligence (AI) in general and Large Language Models (LLMs) in particular is making quality assurance and validation of models a hot topic. Validation is becoming particularly important for LLMs as problems, such as memorisation of training data, encoding of bias and generation of inappropriate content are causing both regulators and the public to press for controls on AI and LLMs. This blog looks at the key issues and indicates possible ways forward for AI model validation by refining principles and developing tools.

Faster learning in a faster world

One hundred million active users. It took Instagram two and half years, TikTok nine months, but ChatGPT reached the landmark in only two months. In the process, OpenAI’s ChatGPT has become the best-known, but by no means the only, LLM. Beyond ChatGPT, the wider adoption of LLMs such as Google’s Bard and StabilityAI’s StableLM appear to indicate a tipping-point in the ubiquity of AI and natural language processing. With their ability to reveal new insights, LLMs seem to be affecting every area of our lives at break-neck speed. But what are the implications for finance employers and employees of the capabilities and speed of adoption of this technology? To make the best possible decisions in an increasingly complex world it is vital to ensure continuous evolution of approaches and methods for validating and providing quality assurance on LLMs (in particular) and AI more widely. In this blog we focus on the single, but what many of our clients tell us is the critical issue, of whether LLMs have kickstarted an AI revolution in finance in the vital domain of model validation.

If AI (and LLMs in particular) are becoming so smart through learning, why do we need to be concerned about validation? The most obvious answer is that the vast amount of data required to train such models is being scraped from the internet with often little or no attribution or permission. AI software and applications use state-of-the-art machine learning models and techniques (sometimes public but increasingly private) to consume such large-scale data to identify features and train models to provide a stream of novel insights through new capabilities. Therefore, the approach to performing quality-assured validation for AI models more widely assumes critical importance as a research topic from both an academic and commercial perspective.

Validation of pricing and risk management models in finance has a well understood methodology and generally accepted best practices. In contrast, LLMs have already attracted considerable media coverage on issues of bias (e.g., race, gender, age, geography) and toxicity (e.g., abuse and offense), making it challenging to validate such models in the face of an increasingly rapid and ever-widening spread of AI. From a model validation perspective, the first challenge is the sheer volume of data required to train and test AI models such as LLMs is frequently several orders of magnitude greater than required by traditional models. For example, to give a sense of scale, OpenAI’s ChatGPT-4 was trained on 100 trillion parameters, which is orders of magnitude times more data than will typically be used to develop and test traditional risk models. To put this in context, LLMs are trained on about 10,000 times more data than a human ever sees. A consequence of this greater size and dimensionality is that classical asymptotic theory can often fail to provide useful predictions and standard statistical techniques can break down, making validation by traditional means infeasible.

Second, the high degree complexity in the algorithms underlying AI models and LLMs (in particular), requires new validation approaches and techniques capable of providing insight into how and why new results are produced. This leads to the third and perhaps most challenging validation issue,  results are emerging from LLMs that are proving difficult to explain with current methods. Emergent behaviour occurs when quantitative changes in a system result in qualitative changes in behaviour. This is increasingly being observed when an ability is not present in smaller LLM models but is present in larger LLM models. LLM models have been scaled in three directions, namely, amount of computation, number of model parameters and size of the training dataset. Emergence leads to unpredictability and appears to increase with scaling, making it difficult for researchers to anticipate the consequences of widespread LLM use. Interestingly, in about 5% of the tasks, researchers have found what they called “breakthroughs,” in the form of rapid, dramatic jumps in performance at some threshold scale with the threshold varying based on the task and model.  In other words, scaling laws do not appear to work as predictors of emergent abilities, making it difficult to design model validation approaches.

Proprietary LLMs like OpenAI’s ChatGPT-4 are, to date,  the most prominent examples of AI in the public imagination. However, improvements in model training like low-rank adaption (LoRA) are making comparable performance accessible to open-source alternatives like LLaMA, from Meta (Large Language Model Meta AI). Which framework will come to dominate so-called ‘foundational’ AI models, or if they coexist, is an open question.

Mökander et al (2023)1 helpfully describe a three-tiered approach to auditing/validating LLMs: model governance, and performance at the model and application-specific levels. The stage that is closest to the end-user, the application specific stage, essentially involves mapping the range of possible outputs to the expected range of inputs. These might be checked for accuracy, bias, and compliance to sector-specific regulations. The complexity of the models means that small changes to the inputs may lead to qualitatively different outputs, creating a vast possible response space from a single query. With a closed model, potentially without the access and control required to ensure reproducibility, the complexity of inputs and applications that can practically be validated is limited.

Model-level audit or validation is a more general exercise in the robustness, performance, truthfulness, and information security of the underlying LLM, unrestricted to a specific application. The transparency of this process with an open-source model is assured. There is also a lack of a profit motive to maximize published performance and the ‘wisdom of crowds’ to identify and fix bugs. However, open access provides equal opportunity for good and bad actors to find flaws, which may be an important consideration for use in finance. Fragmentation is also a concern with open models. Different versions may exist simultaneously, and validation may need to consider variability and back-compatibility.

The challenges outlined above make it clear that whilst AI models have validation requirements that share common features with those used for traditional models, it is not sufficient to simply limit validation of AI models to using only traditional model validation tools. Instead, it is important to apply extra scrutiny and effort for AI models to validate the entire model life-cycle process, beginning with data preparation for training and testing to improve model performance, through to infrastructural activities around AI model building (such as feature engineering and parameter optimization). Figure 1 identifies the five additional planks of a validation and quality assurance strategy required for AI models, as helpfully identified by Tao et al2.

Figure 1: Five key elements of validation and quality assurance for AI models

Whilst high-level languages like R and Python have become the standard for AI model development, care needs to be taken to ensure that internally consistent, correct, non-contaminating validation testing is carried out – a task which can be complex and time consuming. Validation schemes such as hold-out sets and cross-validation are vital if models are to be thoroughly and reliably quality assured. From a practical perspective, such validation techniques need to be modular so that what happens inside the procedures such as cross-validation can be applied consistently, minimising the likelihood of problems such as contamination and bias. One of the best approaches for achieving this is applying validation methods that support multiple levels of validation so that they can be nested one level inside another, thereby minimising potential weaknesses in the validation process. A further important function of validation techniques like cross-validation is to prevent overfitting, where the machine learning model gives accurate predictions for one dataset but not for another.

Develop Tools

The good news from the validation perspective is that there have been extensive public and private resources devoted to testing LLMs. One example is the BIG-bench3  AI validation tool, which currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Unfortunately, despite the impressive statistics tests have generally been written and added in an ad-hoc fashion, whereby test users add tasks by extending the existing BIG-bench code or supplying template parameters for that code. In such an approach, it is up to the user to convert the expected LLM behaviour into a sequence of test vectors that can be executed, making it difficult to maintain, modify or extend test functionality.

An alternative approach by Kuchnik et al4, is a novel LLM validation tool called ReLM. The approach uses regular expressions, which are sequences of characters that specify match patterns in a longer series of text. ReLM is the first queryable test interface for LLMs and is a Regular expression engine for Language Models that enables identification of commonly occurring patterns (in the form of sets of strings), with standard regular expressions. ReLM is the first system capable of expressing a query as a complete set of test patterns. This enables both developers and validators to explain and measure LLM behaviour directly over datasets that would otherwise be too large to enumerate.

Where Next?

The reasonably good news for teams validating AI models is that computing power continues to increase almost in-line with Moore’s law5, which provides help with data transformation, training, testing, model optimisation and validation. Further good news is the increased research output in the key domains of mathematics and algorithm design, much of which is being targeted towards improving the efficiency, accuracy and robustness of AI models. The further good news is that the emergence and ongoing improvement in testing tools such as BIG-bench and ReLM provides a vital infrastructure to enable validators to begin to modify their approaches to keep pace with the rapid pace of development in AI.

However, increased focus needs to be brought to bear on further development of validation principles to ensure that conceptual framework that continues to emerge remains fit for purpose. A key element here will be to ensure that crowd-sourcing of learning-based AI software does not become contaminated by the increasing appearance of training and testing data, as well as synthetically generated data. Arguably, a key reason behind the difficulty in validation is that most AI models are black boxes. As we may not know what happens inside such black boxes, it is hard to validate whether each computational step is fully understood and under control. This suggests that an alternative direction to improve ease of validation is to develop more interpretable models as they provide explanations for their responses. Therefore, in addition to defining test sets to see if the responses meet the expectations, it is possible to validate whether the response is produced following reasonable logic.

A final thought concerns whether employers and employees, developers and validators alike, will be able to keep pace with the increasingly rapid development and deployment of ever-more sophisticated AI models and applications. A recent example of this speed and spread of change in AI occurred when Meta launched Threads, its challenger to Twitter, which gained 10 million users in just seven hours.

1Mökander, J., Schuett, J., Kirk, H.R. et al. Auditing large language models: a three-layered approach. AI Ethics (2023). https://doi.org/10.1007/s43681-023-00289-2

2C. Tao, J. Gao and T. Wang, "Testing and Quality Validation for AI Software–Perspectives, Issues, and Practices," in IEEE Access, vol. 7, pp. 120164-120175, 2019, doi: 10.1109/ACCESS.2019.2937107.

3“Beyond the Imitation Game: quantifying and extrapolating the capabilities of language models”, Srivastava et al, arXiv preprint https://arxiv.org/abs/2206.04615; 2022.

4“Validating large language models with ReLM”, Kuchnik, M., Smith, V. and Amvrosiadis, G. arXiv preprint https://arxiv.org/abs/2211.15458; 2023.

5Moore GE (1965) Cramming more components onto integrated circuits. Electronics 38(6). ftp://download.intel.com/museum/Moores_Law/Articles-Press_Releases/Gordon_Moore_1965_Article.pdf

Did you find this useful?

Thanks for your feedback

If you would like to help improve Deloitte.com further, please complete a 3-minute survey