Join us on a trip into the world of technology assisted investigations, where we use the latest technology to analyse datasets of millions of documents in a matter of a few weeks to distinguish the signal from the noise and to identify the facts that matter.
We use as an example one of our clients, an international company, headquartered in Switzerland, which has been accused of anti-competitive behaviour by a foreign authority. This company commissioned us to assist its external legal counsel in responding to the allegations by carrying out an investigation and identifying key facts from the email data of key people of interest. After receiving a dataset comprising roughly 1 million email and attachments, Deloitte used natural language processing techniques (NLP) to rapidly identify key evidence within a large volume of user generated content. With the aid of technology, Deloitte was able to do so swiftly, efficiently and at lower cost compared to traditional methods.
Modern investigations typically involve very large volumes of text data. Most text data is nowadays in electronic form and the process of identifying evidence within electronically stored information (ESI) is known as E-Discovery. The biggest challenge E-Discovery practitioners face is to analyse and review large volumes of text data to a satisfactory standard whilst keeping to a reasonable and proportionate time frame and cost.
Technology assisted investigation approaches based on concepts such as NLP play a crucial role in improving speed, quality and lowering costs. NLP is a discipline at the intersection of linguistics and computer science and relates to the large-scale processing and analysis of unstructured text data with the aim of gathering relevant facts and insights in a structured manner. NLP based techniques are suitable for both targeted investigations where a clear starting point or allegations are present as well as exploratory investigations where very little a priori information is available.
The case introduced above was particularly representative of the overarching trends of tight deadlines and large data volumes. The client was only given two weeks to analyse over one million collected documents, the so called ’document population’, in order to provide the authority with legally binding information. The document population consisted of emails, documents and Excel spreadsheets.
To meet the challenging deadline, the use of NLP was crucial.
The primary objective of the investigation from the client’s perspective was to thoroughly examine the allegations made by the foreign authority.
The details of the allegation and the persons potentially involved were used to define a procedure for analysing the document population. The company’s legal counsel was aware prior to the investigation of a selection of 20 relevant communications between persons of interest, the ’Sample’, which was a reference point for the investigation of the allegation.
To avoid a lengthy and costly manual review of large numbers of documents, the investigators together with the lawyers implemented a sequential three-step technology-supported process to enable the client in making a statement to the authorities in a timely manner. It is important to note that while the methods were used sequentially for this specific case the three methods introduced below can be used in conjunction but also independent of each other. Of the three only the text classification model requires human input or judgement:
Text clustering is the automated aggregation of the document population into subgroups or ’clusters’. The clusters consist of documents with similar meaning and context. The clustering is based on an automated frequency analysis of the words in each document and does not require any assessment or input by the user. A visualisation of these clusters allows the investigator to understand the major concepts in the data more quickly and to review it more efficiently.
Using the results of the clustering process, documents conceptually similar to the relevant sample were identified. Next, in a manual review, the investigators reviewed the documents within the review platform and marked these documents as relevant or not relevant to the case. The outcome of this manual review was later used to develop a text classification algorithm (further details in section 3).
The use of keyword search terms during the initial phase of an investigation is a commonly used method to reduce the volume of documents for manual review by the legal counsel. In comparison to the clustering-based approaches, search terms required much more input from investigators. The challenge in using search terms was to define a suitable list of keyword search terms. This list was compiled and refined using an iterative process of trial and error whilst examining the numbers of hits. These refined terms enabled an easier identification of content relevant to the investigation.
In addition to keyword searching, NLP was able to find words which tended to appear in similar contexts within the document population. This allowed for ‘concept searching’ – a search for a particular word also searched for other words with similar meaning in the document set. For example, a search for ’scared’ also identified documents with the term ’afraid’.
Using a powerful data analytics tool, lists of “similar terms” consisting of variations of terms and other expressions used in similar contexts were generated for an initial keyword. The tool enabled investigators to adjust the emphasis on certain concepts in the search results. The combination of using variations of terms and similar terms with specific emphasis resulted in a higher relevance rate than a pure keyword search.
Another benefit of concept search is that unlike keyword searches they can detect when individuals are communicating in code or euphemisms, as those terms will likely show up under “similar terms” in the results.
When search terms or concept searches are being selected effectively, they can add value by identifying further highly relevant documents which provide the investigation team with key facts and insights. With this newly acquired knowledge, a text classification model was trained to further advance the analysis, as described below.
In the text classification model, a small set of documents already reviewed and coded by the investigators was submitted to an algorithm which analysed the semantic content of the documents and identified relationships between the semantic content and the review decision (relevant/non-relevant). This process of submitting documents coded relevant or not relevant by the investigators for the purposes of identifying patterns indicating relevancy is called the ‘training phase’ and can be repeated again and again after additional documents are reviewed by human reviewers, to further refine the accuracy of the classification. The outcome from the training phase was a ‘text classification model’.
After identifying patterns present in documents with specific coding decisions, the text classification model used these patterns to automatically classify all the other documents in the entire document population by assigning a probability of relevance to each of the other documents.
The investigators focused on the documents assessed as relevant by the model with a high degree of confidence. This represented the quickest way of identifying a large number of relevant documents relevant for the authorities’ request.
This ability of the classification model to evaluate an entire document population was extremely useful. The performance of the model was improved through the continuous consideration of newly reviewed documents. In particular the effect of repeated training rounds was strong for documents with an initial probability of relevance of around 50%. A 50% probability of relevance or non-relevance indicates that the model is uncertain to which category it belongs.
The text classification model was also combined with the results of clustering to make sure that documents from all clusters were are reviewed and added to the training set. This ensured that all clusters of concepts were considered for the training phase and reduced the risk of entire classes of documents being excluded from the analysis.
Despite the limited initial information of 20 email communications, the investigators were able to examine a set of approximately 2,400 documents from an initial population of around one million, all within the two week deadline. Of these, approximately 800 documents were assessed as relevant, which represents a relatively high relevance rate of 33% within the targeted subset provided by the analytics. This was far more effective than a standard process using keyword search terms.
The combination of technology-based methods used enabled the investigators to identify a substantial number of documents which were relevant for the case. Based on these documents, our client’s legal counsel was able to provide a conclusive statement to the foreign authority.
Modern investigations are technology-based and multi-disciplinary. The close cooperation between our client’s legal advisors and our experienced e-discovery team remains one of the key success factors in the defensible conduction of technology driven investigations.