Skip to main content

Optimal clustering for PD calibration in the presence of noise


One question that we are often asked, is how many grades should a probability of default (PD) calibration employ, under discrete estimation? Consequences of too few grades include loss of risk sensitivity (including, potentially, under-estimation in key cohorts such as accounts in arrears), whereas too many grades can lead to volatile PDs that fail to back-test. Under both scenarios, the underlying cyclicality of the risk ranking model can become obfuscated.

This article describes a challenger algorithm for determining grade boundaries in time series data, in the presence of noise. In a general sense noise refers to modifications to an underlying variable that results in the “true” underlying value being unobservable. In credit modelling, the principal drivers of noise in observations are data quality (e.g. measurement errors introduced via transformations of source data, or outright missing values) and quantisation error associated with discrete variables in logistic regression scorecards.

We found that the number of PD calibration bins is highly sensitive to noise in risk ranking scores, leading us to surmise that the optimal number of bins may vary significantly between banks, depending on their data quality and choice of risk ranking inputs.


The CRR permits firms to choose between grade-level and direct estimation of PDs. Grade-level estimation refers to selecting homogeneous groups of obligors and setting the regulatory PD to the observed average of one-year default rates. The approach is illustrated in the PRA’s stylised example in the Appendix to PS13/17. Direct estimation establishes a continuous mapping between risk ranking score, and calibrated PD that may be mapped to a master scale. In this article, we discuss binning algorithms for grade-level estimation.

The calibration curve can be conceptualised as some monotonic function g() that transforms the probability mass function (PMF) of uncalibrated risk ranking scores f(X) to the PMF of calibrated PDs g(f(X)). The figure below provides an idealised graphical representation:

With prior knowledge of the function form for g(), its parameters can be estimated from one more data point than its polynomial order – for example, a cubic curve may be estimated from four data points; or indeed an intercept adjustment can be estimated from a single data point (typically the mean uncalibrated PD and the calibration target – relatively common in corporate credit modelling). However, in the presence of noise and/or without prior knowledge of the functional form for g(), it is appropriate to sample the curve.

Sampling Procedure

The figure below illustrates a sampled approximation of g():

Within credit modelling, there exist well-established clustering procedures that are effective for scorecard building. However, for calibrated PDs (a Group Means Estimator (GME) or “average of averages”) there is currently little consensus on clustering approaches or optimisation criteria.

We designed a procedure to minimise variance in the least-populated (time slice, bin) tuple. Our procedure traverses ranking scores and observes the minimum number of defaults accumulated in a time slice. Once some threshold minimum number of defaults is reached in the most-sparse year for that bin, a bin boundary is inserted and the accumulation of defaults resets. (The procedure could also step backwards, or start at the centre of the score range and simultaneously step forwards and backwards under a “middle out” implementation). The final bin is then merged-back with the penultimate bin, if the number of defaults in its most-sparse year is below the threshold.

The choice of threshold is subjective. Standard rules-of-thumb for linear models suggest between 20 and 50 defaults per free parameter. With zero noise, the threshold may be reduced, whereas in the presence of noise the threshold should be increased. To select a threshold, we assume that rank order reversals in default rate should never occur over time (i.e. all retail borrowers in a portfolio experience the same credit cycle, so any rank order reversals over time must be spurious), and increase the threshold until no rank order reversals are observed.


We constructed a hypothetical portfolio of 150,000 borrowers observed between 2001 and 2020, with 201 unique risk ranking scores and a gradual improvement in asset quality over time. We then added random noise into both the obligor count and default outcomes, to simulate the inevitable impact of data quality and real-world behaviour of risk ranking scorecards. The table below summarises the smallest threshold values that achieve no rank order reversals, and resulting number of bins, under three noise scenarios:

Noise scenario
Threshold value

Low (“laboratory conditions”)



Medium (“realistic”)



High (“poor data quality”)



The figures below illustrate the estimated default rate by bin, for:

  • Low noise scenario, with low noise threshold value of 21;
  • High noise scenario, with low noise threshold value of 21 – the volatility in default rates as well as rank order reversals are clearly visible; and
  • High noise scenario with high noise threshold value of 98 – the volatility in default rates is reduced to the extent that no rank order reversals occur

We also observed a tipping point at which the minimum number of defaults (over time) in the worst bin increases by an order of magnitude, as the threshold value is varied. This is consistent with encountering a highly predictive risk ranking score, e.g. encountering the dummy for “is the account in arrears”. This can be overcome by setting a higher threshold to in-effect re-balance the defaults across the bins, albeit at the cost of fewer bins and a larger concentration in the best bin. The table below illustrates the effect, for the high-noise scenario. A threshold value of 200 may be preferable to 98 for the sake of distributing variance more evenly across the bins, albeit at the cost of losing two bins.

Threshold Value
Rank order reversals
Min defaults in worst bin


The results seem to suggest that variability between firms’ numbers of PD calibration grades can reasonably be expected. The appropriate number of discrete grades for PD calibration is strongly influenced by the amount of noise in risk ranking scores and default observations. The main sources of noise in credit risk data are likely to be data quality issues as well as quantisation error in logistic regression scorecards. Issues that seemingly affect a small proportion of records can easily lead to significant spikes in the distribution of obligors or defaults by risk ranking score. The optimisation procedure cannot interpolate across these discontinuities, leading to higher threshold values and fewer grades overall. Firms with concerns about their number of PD calibration grades could consider running a challenger approach to clustering, on internal data, to help inform the potential weaknesses and trade-offs in their current clustering choices.