The “Analytics Olympics" |
Based on what 50,000 customers did with a mobile phone company – can you predict the actions of another 50,000 customers?
Will they switch to another provider, buy another product, upgrade or do nothing at all? That was the challenge in the KDD Cup for 2009. KDD (Knowledge Discovery and Data Mining) is a special interest group of ACM (Association for Computing Machinery), and is one of the premier international forums in the field of data mining and predictive analytics. For the last 13 years KDD has run a contest where the challenge is to be the most accurate predictor of an outcome. This year a Deloitte team with staff from the Auckland and Sydney offices entered against teams from industry and academia around the world.
The Challenge
The challenge data set was supplied by French telecommunications company, Orange. Having started out as Orange’s real data on customer history and usage, it was heavily disguised before being made public for the contest: text values were encrypted and variable names disguised. This effectively removed a significant element of what we might normally do in a modelling engagement: exploring, augmenting and profiling the data from a business perspective. We were also fortunate to have the data structured as it was, as another big piece of work in a typical modelling assignment is getting the data into the right shape. Instead, the contest became one of minds, methods and machinery – how accurate a team could be in the time available.
We apply a number of tools to a problem such as this, depending on the data and the nature of the problem. In this case the number of records was no issue (having 50,000 customer records to analyse is these days a modestly sized problem) however the number of variables was 15,000, which is considerable but not unrealistic. A number of tools quite simply cannot cope with this much data, however one tool in the bag (The SAS System) was able to readily handle this data with no complaint, enabling us to get stuck into the prediction problem.
Modelling Process
Those 50,000 customer records tell what customers did in the past. The 15,000 variables contained all the information known about those customers – such as who they are and where they live, what products they have, the services they subscribe to, call usage, unused call minutes and so on. In addition there are flags to indicate a subsequent outcome: whether they have churned, bought or upgraded.
From that history, we want to be able to predict the future – what is the likely outcome for some different customer records. To a company such as Orange this is vital knowledge. For example, if they know who is likely to switch to a rival company (known as churn), they can take preventative action such as a customer care call or offer a free add-on. The cost of this might be uneconomic to offer to the entire customer base, so a predictive model can be useful to identify those customers most at risk. In this contest Orange provided a second data set of another 50,000 customer records where the outcome is only known to them, not us.
When we make a predictive model we use a modelling tool to identify the relationships between variables and the relative importance of those variables. Usually more than one tool will be used – they each have their strengths and weakness -- but the real test is how well they predict the result. Having created a model we score the second data set customer by customer – essentially making our prediction of what that customer would do based on their data and our model. This scored set becomes the basis for comparing how accurate the competing teams are, and the overall score for the contest is how accurate those scored sets are.
Results
The competition has just closed and based on overall scores we managed to get in to the top 10% of the worldwide field of industry practitioners and academics. We’ve enjoyed sharpening up our skills on this reasonably meaty analysis.
However what is compelling for me is just how accurate a modelling prediction can be. Despite using disguised data, these predictive modelling techniques can be astoundingly accurate. Being able to predict those customers that are going to leave with 73% accuracy, who’s going to become a customer with 84% accuracy, and who’s going to be “up-sold” and buy another product with 89% accuracy have to be compelling propositions indeed for any marketer.
Applications
This contest was a marketing application and gave us very real practice in ways to increase our prediction accuracy. However the techniques apply to any situation where there is historical data and an outcome to predict. Examples include credit scoring, collections, human resources, operations, fraud and so on. Key factors to consider in deciding whether predictive modelling might solve a business problem are:
Economics: Is there a pressing need to resolve this problem – is it costing money or is there money to be made?
Data: Is there sufficient data available or able to be acquired, and is it relevant to the problem at hand?
Actionable: Are the insights from the model able to be acted upon? Is it able to be used to influence an outcome?
