An Empirical Analysis of the Training and Feature Set Size in Text Categorization for eDiscovery
One size does not necessarily fit all
There seems to be some confusion and misconceptions about how many documents should be included in the training set in predictive coding models. This paper approaches the question from a statistical perspective and looks at the variety of ways that the training set is determined. Some have used a fixed number amount, while others have suggested statistical sampling as a means to arrive at this number. While statistical measures such as confidence level and confidence interval are helpful in calculating the size of the validation set, they do not provide a good means for approximating the required size of the training set, (i.e., the number of training documents needed for the classification model to approach its peak performance.)
Research conducted by Ph.D.s from the Deloitte Analytics Institute suggests that the number of documents needed for the model to reach its peak performance can be quite different from one categorization problem to another. In other words, when it comes to training sets, one size does not fit all. Instead, it appears that the size of the training set is less dependent on the total size of the document population and much more dependent on the complexity of the categorization problem at hand. Further, the complexity of the categorization problem itself can be approximated by the number of features, or in this case significant terms (e.g., words), required for the classification model to approach its peak performance.
This paper includes related work in the field and then dives into the study that our analytics professionals performed, including experimental setup and the results obtained on data from four real-life matters and concludes with a summary of our findings and conclusions.
As used in this document, “Deloitte” means Deloitte LLP and its subsidiaries. Please see www.deloitte.com/us/about for a detailed description of the legal structure of Deloitte LLP and its subsidiaries. Certain services may not be available to attest clients under the rules and regulations of public accounting.