12 May 2026

8 minute read

From back testing to benchmarking: Enabling model risk oversight

Better decision making before model failure

Authors:

Maxime Filippini | Senior Manager, Advisory & Consulting
Ananya Joshi | Senior Consultant, Advisory & Consulting

From back testing to benchmarking: Enabling model risk oversight

08 May 2026

Discover how benchmarking strengthens VaR oversight, improves model governance, and detects weaknesses earlier

0:00 / 0:00

This podcast episode is based on the Deloitte Luxembourg article below and includes content generated, assisted, or edited using artificial intelligence technology. It has been reviewed by a human prior to publication. The voices featured are synthetic. This podcast is provided for general information purposes only and does not constitute any kind of professional advice rendered by Deloitte Luxembourg. Deloitte Luxembourg accepts no liability for any loss or damage whatsoever sustained by any person who uses or relies on the content of this podcast.

To the point

Benchmarking can support Value-at-Risk (VaR) model oversight more effectively than back testing on its own.

While back testing is useful for identifying model failures, it offers limited visibility into whether a model is gradually weakening, or why that might be happening. Benchmarking helps address these gaps.

In this article, we present three ways risk managers can use benchmarking as a governance tool to support better model decisions:

One-on-one comparisons can reveal when challenger models begin to outperform the model in production.
The Model Confidence Set (MCS) offers a more robust framework to choose between competing models.
Disagreement across strong models can help contextualize VaR over shooting and clarify model behavior.

Introduction

For many institutions, VaR remains a key tool for measuring portfolio market risks. Its enduring appeal lies in its simplicity, as it reduces a low-probability loss event in a single metric. However, VaR is still a model and therefore subject to model risk, meaning it requires robust governance. While regulators generally expect these models to be validated and demonstrated as fit for purpose, the ongoing monitoring of their performance is usually limited to a simple, barebones back testing exercise.

Back testing is designed to assess whether the model remains aligned with its statistical properties. But when misalignments appear, back testing by itself gives risk managers only limited insight into the underlying causes.

Introducing other models into the analysis can help. A practice commonly known as “benchmarking,” compares a reference model with credible alternatives, helping identify which modeling choices are most likely driving any deterioration in performance. This approach allows risk managers to identify potential issues earlier and take remedial action more promptly and with greater confidence.

In that sense, model benchmarking is more than just a validation tool. It can strengthen model governance and improve model risk oversight, and should therefore be part of every risk manager's toolkit.

Evaluating alternatives via one-on-one comparisons

In a traditional back testing program, the reference model is evaluated based on the statistical properties of its observed over shootings—that is, the days on which losses surpass the previously forecasted VaR. In many institutions, these evaluations are conducted over relatively brief time horizons (typically one year), which can lead to conclusions that lack robustness.

On the other hand, comparing reference model performance against carefully chosen alternative models allows risk managers to link observed shortcomings with specific model features. To do this effectively, the models need to cover the spectrum of modeling approaches (e.g., parametric, historical) and assumptions (e.g., normality of risk-factor returns or their dynamic behaviors). They may also consider operational and general modeling limitations, including process complexity or lack of interpretability.

These performance comparisons require a loss function tailored to VaR forecasts, so that performance can be evaluated consistently across models. Bootstrap-based inference can then be used to determine whether apparent outperformance is statistically meaningful.

^{Figure 1: The probability of an alternative model outperforming the reference model is computed, and a conclusion is obtained if it is sufficiently high (or low) to not have stemmed from chance. Enlarge image}

This type of benchmarking is best used as an early-warning system. If certain models begin to outperform the reference model in a statistically significant way, the exercise helps identify which reference model features are becoming less suitable, and which alternatives are to be favored. This, in turn, provides crucial information on the decisions on whether to adjust parameters or make a more fundamental change in methodology, for example, switching from a parametric approach to a historical one.

Finding the set of dominating models: The Model Confidence Set

One-on-one comparisons have an important limitation: their conclusions do not scale well when choosing the best model from many candidates. As the number of comparisons increases, the risk of distorted inference grows. Some models may appear to outperform others purely due to chance, which can ultimately lead to misleading conclusions.

This is why practitioners use a more holistic framework for model selection: the Model Confidence Set (MCS) approach, introduced by Hansen in 2011.¹

Rather than evaluating each alternative against a reference model, MCS identifies the subset of models that cannot be statistically distinguished from the best-performing model at the chosen confidence level.

This methodology is most useful when a genuine model selection decision is needed, for example, when a new framework is being introduced or when the current model has become clearly unfit for purpose. In practice, MCS is applied across representative portfolios and candidate models, iteratively eliminating inferior performers until only the statistically defensible set remains.

The resulting set may still include multiple models. At that point the risk manager can choose a preferred methodology based on considerations other than performance, such as model complexity or the familiarity of the risk management function with the model assumptions and theoretical foundations.

The MCS approach supports model selection in a way that avoids biases, provided a broad enough model universe is selected. As a result, it typically reduces the need for frequent model changes.

^{Figure 2: The Model Confidence Set algorithm removes a model from the starting universe until a final set is reached. Enlarge image}

Contextualizing model performance via a model disagreement metric

A third benefit of benchmarking is that it can enhance traditional back testing.

A key part of the regulatory back testing of VaR models used in Undertakings for Collective Investment in Transferable Securities (UCITS) funds is analyzing the source of over shootings and reporting them to senior management. In practice, these explanations are often focused on market narratives, without clear links to model features (e.g., a sharp change in intra-portfolio correlations). While market moves explain part of the result, risk managers should not ignore potential weaknesses in the model itself. They should analyze the pattern and characteristics of over shootings carefully to extract as much insight as possible about the model’s behavior.

A more informative approach is to measure the dispersion of VaR forecasts across the models in the MCS, for example through a coefficient of variation. This disagreement metric creates context for the analysis of over shooting. If an over shooting occurs when all strong models give similar VaR forecasts, it reflects an unusually severe but broadly recognized market move. If it occurs when models disagree widely, modeling uncertainty becomes a more credible explanation.

This distinction is valuable for both oversight and action. It helps risk managers avoid suffering from status quo biases and therefore be better at detecting the formation of structural model weaknesses. It also helps identify cases where a deeper review of the model may be warranted, complementary to the regular back testing exercise.

^{Figure 3: Disagreement among the Model Confidence Set, with the over shootings of the reference model overlaid. The color of each marker is defined based on the proportion of the alternative models for which an over shooting also exists. Enlarge image}

Conclusion

Traditional VaR back testing does not, on its own, provide full model oversight. It answers a narrow question: is the model failing in absolute terms? It says far less about whether the model is gradually weakening, whether alternative models may become more suitable, and what should be done in response.

Model benchmarking fills that gap.

With techniques such as one-on-one comparisons or model disagreement measures, risk managers can adequately monitor the performance of their model and attribute performance degradations to specific model features or systemic shifts in market variables.

When a model needs to be retired, a risk manager who benchmarks regularly will already have a clearer view of which alternatives are credible replacement candidates and will be able to confirm it via a MCS exercise, yielding a choice in which they can be confident.

For model oversight, benchmarking represents a real opportunity to be proactive rather than reactive and to detect shortcomings as they manifest instead of waiting for them to turn into modeling failures.

Discover our Future of Advice Blog Homepage

Footnotes

¹ Hansen, Peter R., Asger Lunde, and James M. Nason. “THE MODEL CONFIDENCE SET.” Econometrica 79, no. 2 (2011): pp. 453–97.

Yes

No

From back testing to benchmarking: Enabling model risk oversight

Better decision making before model failure