Zum Hauptinhalt springen

Saving Time for Deeper Analysis

Deloitte's form interpreter and extraction tool "Formalyzer" transforms semi-structured forms into tables

The Deloitte tool Formalyzer tool solves common OCR (Optical Character Recognition) issues by combining multiple Computer Vision and Natural Language Processing (NLP) methods to automate mundane form filling.

The Need

Sound analysis is based on data… often, the more, the better. Organizations have long relied on machines to analyze large volumes of data, and the hunger for data has only increased with the ubiquity of Artificial Intelligence (AI) use cases. For machines to process data, it must be structured as a table or database that can be programmatically queried. Yet data doesn't always come formatted perfectly. Many documents in digital form are "unstructured" narrative or "semi-structured" forms – available, just not easily used. Or the digital documents are mere images – scans without embedded OCR (optical character recognition) – from which content cannot be selected in the first place.

The ubiquitous Portable Document Format (PDF) guarantees formatting consistency in a compact file size. It is also notoriously unhelpful to those seeking to extract tabular data from its contents. This difficulty lies in the fundamental design of PDFs to be easily read by humans, not machines. Unlike spreadsheets, PDFs store tables and text as vector graphics. Formatted templates – with fields distributed across the page and often mixed with images – are no better. Futile copy-paste attempts leave a chaotic mess of concatenated and often out-of-sequence numbers.

The result: analysts are too often left with no option other than to manually transfer data into editable formats (spreadsheets). This labor intensive and error-prone process ties up qualified resources with menial tasks. It represents not only a costly productivity drain, but invites fatigue-related manual errors, and leaves analysts less time to do what they were hired to do… analyze.

 

 

Our solution: Formalyzer

Deloitte’s table extraction tool Formalyzer addresses this very issue, joining multiple Computer Vision and Natural Language Processing (NLP) methods to provide a simple solution to this all-too-common problem. Formalyzer uses a small sample of documents to learn the layout of a particular form. Users "train" the tool's neural networks to recognize where to extract text or numerical values from locations on the page. Just a handful of samples suffice to create the templates that Formalyzer follows. Users may even specify multiple templates – to handle multiple pages, or inconsistent formats, or imperfectly scanned forms – and Formalyzer will use the most successful ones to extract the contents.

 

The templates then equip Formalyzer to process thousands of similar forms, extracting the values distributed across the pages into the fields of a database. The input forms may be either PDFs with embedded text (OCR layer) or so-called “dirty scans” (only images). The intuitive graphical user interface guides the user through training on new templates, uploading documents for processing, viewing individual results on-screen, and exporting results for data processing in other applications.

 

 

 

Advantages/Benefits

  • Analysts can focus on analysis vs. data collection and aggregation
  • Reduced transmission error
  • Automatically finds variables and their associated values
  • Stores these as a table in a CSV or other document file type
  • Reads thousands of documents via batch-processing
  • Flexible to take on new and multiple formats
  • Can be implemented anywhere – on a public or private cloud, on local machines

 

 

Example Use Cases

  • Facilitating balance sheet analysis (e.g., for underwriting SME / corporates)
  • Reading out tax forms, or other non-tabular forms
  • Automation of data entry (such as from Energy Performance Certificates)
  • Integration into existing workflows: ingesting scans and sending results to the following process step

Here you can download the Formalyzer fact sheet

Get in touch