Doc AI — Problems and Solutions for Unmatched Accuracy

Anil Sharma
trillo-platform
Published in
7 min readMar 26, 2024

--

This article sits somewhere between a whitepaper and a cheat sheet. It introduces the concept of using an LLM for your documents along with the associated problems and solutions.

Doc AI has hype around it and creates a wrong notion that it is simple and easily attainable. But, using Doc AI with high accuracy requires solutions to several problems. Below, we outline some of these challenges and their solutions, and a pathway to solve others.

Note: The term “document” means text content that can come from any source, a file, an API, email, etc.

What is Doc AI

For this article, Doc AI is defined as a Large Language Model (LLM) such as ChatGPT or Gemini applied to a set of your documents or content.

In general, document repositories contain a trove of information that is not easily accessible to users unless they first know which documents to open and then read through the content of each document. With the massive size of most repositories and applicable content, this is not realistic.

Doc AI, using LLMs along with Natural Language Processing (NLP) revolutionizes finding information within document and content repositories with accuracy in the high 90s percentile.

Doc AI — Search and Query Your Document

Doc AI Usage Patterns

The following Doc AI patterns illustrate how it works.

1. Extract structured data from unstructured data with very high accuracy: This works using traditional NLP parsers (for example, Google’s Document AI processors) and schema-driven parsers based on LLMs. The goal of high accuracy makes it a tough problem. Some of these problems are discussed below.

2. Semantic search: It means searching a document using an English phrase. Imagine it as, Google Search on your content. It is more complex than just indexing. LLMs are very sensitive to chunking and prompts. Chunking or partitioning paragraphs or blocks while maintaining the right context is very important.

3. Question and Answers: This is more than just a keyword or phrase search. For example, Doc AI using an LLM enables you to ask about context sensitive questions. For example, you could ask “What is the total value of contracts signed in the last year?”.

4. Summarization: This provides a concise overview of the document.

5. Inference: This is a more advanced use case of LLMs. It is about inferring something from the document — for example, an assessment of malpractice from a medical transcription or a problem that may be causing a symptom such as, “I can’t connect to the internet”. The LLM may return a contextually-relevant answer such as “Can you check if your router has a blinking red light?”.

6. Hybrid of all: In general, each use case will be a combination of some or all of the above.

Use Cases

The use cases for Doc AI are numerous. Below, we divide the use cases into simple and complex. The complex use cases require a Retrieval-Augmented Generation (RAG) approach or model fine-tuning.

Simple Use Cases

1. Parse paper forms: This use case refers to parsing digitized versions of paper forms such as POs, invoices, medical reports, bank statements, etc. It is a simple use case but with one complexity where table parsing is still inaccurate. We discuss the problem and solution in more detail below.

2, Search and Q&A on Customer Service Portal — Searching the customer service portal of a bank, cable service provider, or manufacturer can be incredibly frustrating. These search results often return thousands of documents in the results. Finding the needed answer is this sea of results that is often exasperating and fruitless. Using Doc AI with semantic search and Q&A returns immediate succinct answers to the user

3, Multiple use cases for data extraction and search — Several use cases require data extraction and semantic search indexing (vectorization). These support user interactions based on a query. These queries require either data look up, search, or Q&A using LLM prompts. Example queries might include:

  • “What are risk factors in the contract?”
  • “What is the value of the lease?”
  • “How many leases will expire in December 2024?”
  • “Which customers are likely to cancel the service?”

Additional use cases can be extrapolated from the above examples.

Complex Use Cases

More complex use cases of Doc AI deal with reasoning or matching (the cross-product of two documents’ vectors).

1. Document Matching: Matching a document against another set of documents to find similarities and differences. A simple example could be, evaluating a newly written document to judge whether it complies with existing corporate policy documents. Another example might be determining the risk factors of a new contract document compared to other contracts written in the past.

2. Inference: There are cases when information is inferred based on historical documents. For example, given a set of contracts, infer what risks are missed. Or, given a set of medical record guidelines for ICD/CPT code assignment, infer the ICD/CPT code from a medical transcription.

Potential difficulties when using Doc AI and LLMs

Several major issues must be solved with Doc AI using LLMs to avoid skepticism and disappointment. Let’s examine

Potential Problems

1. OCR or native PDF parsing inaccuracies: Many documents are stored as PDF files (including form-filled and scanned images of existing documents). Existing PDF extraction libraries often generate inaccurate results, including improperly ordering of text content as well as incorrectly interpreting characters and what text line the character is associated with. Starting with bad data leads to faulty and inaccurate results.

2. Table parsing: Table parsing is still a hard problem. It is difficult to identify boundaries of columns and rows in a PDF. Even when there are lines, embedded tables, cell merging, etc. make it hard to correlate column labels with cell values. Overflows of content across columns also introduce another dimension.

3. Chunking changes context: One chunking strategy does not fit all documents (even if they are the same type).

4. Prompt sensitivity: Doc AI uses prompts behind the scenes to extract data. The quality of data extraction changes based on the prompt designs.

5. LLM model upgrade may break a working system: While upgrading an LLM model may provide benefits, it may also break features and accuracy that existed with the old model.

6. Processing millions of documents and yet track failures: When running millions of documents through a pipeline, there will be several failures that need to be tracked and fixed. It requires a solid foundation platform that can manage a large scale compute farm.

7. Access control in search, QA, and summarization: Without a proper authorization system (such as RBAC) in place, millions of documents may be available to unauthorized users. This could result in confidential and private data being viewed by users who do not have the need or authority to view such information

8. Integration: Although more of a software engineering problem. Doc AI will likely need to be integrated into the application. The following integration points are needed:

  • Data Ingestion: Not all content may come from a file system. Data may come via APIs, SFTP, enterprise applications, and content management systems.
  • Export Results: Not all resultant data may be consumed by a human actor. Exported data may be collected and pushed to other automated system.
  • Enterprise Workflow: Doc AI may be a part of an enterprise application. For example, an insurance underwriter chooses a policy document and the rest of the Doc AI work happens in the context of the policy.

Problem Solutions

1. OCR or native PDF parsing inaccuracies: Use character matrix and configurable line spacing hints to generate content rather than relying on the PDF extraction library’s output.

2. Table parsing: The table parsing can be improved by applying a few techniques:

  • Extract only required columns (specify minimal set).
  • Stretch columns horizontally using character matrices returned by the library and create space between columns. Using the vicinity of characters, assign overlapping content to the nearest character (on left or right).
  • Use LLM to associate columns labels and cell values.

3. Chunking changes context: Chunk documents in 2–3 ways and use each set for data extraction. Pick the one with the best score.

4. Prompt sensitivity: Use multiple prompts and few-shots. Let the system pick the best result.

5. LLM model upgrade may break a working system: Track prior documents processing, select a few of these same documents (as many as economically possible) to QA upgrade by generating a report that highlights differences.

6. Processing millions of documents while tracking failures: Report failures via a database table and UI (with drill-down of logs of each step).

7. Access control in natural language queries: This requires two steps:

  • Storing Access Control Matrix: First off, Doc AI needs to know the permissions available for each user. If the number of users is more than a few 10s, this may require a complex configuration (if manually done prone to error). A good solution is to capture information from the source (where you are getting the content) or use a set of policy documents.
  • Applying Access Control Matrix Efficiently: This gets complex and slow if the access control rules are complex and needs to be applied to millions of documents.

8. Integration: Use a platform that makes it very easy to write RESTful connectors.

9. Use of metadata and admin console: During POC, domain experts need to try various prompts, chunk sizes, and schemas (data to be extracted). An interface should be provided to enable users to experiment with a limited set of data to find the best metadata. Results should be tracked for each run so users can compare them.

10. Continuous Quality Assurance (QA): Collect audit logs and generate reports so a team of experts can QA the system periodically.

Trillo Doc AI — An Out of the Box Solution to the Above Problems

Trillo Doc AI is a solution built on top of Trillo Workbench that solves the above problems out-of-the-box. It can easily be deployed in your GCP and Azure environments (AWS is coming soon). You can create a POC with a handful of documents in a week or two. If you find the results are great, simply call it a production deployment and start using it.

Trillo provides implementation services to assist you in the adoption of Doc AI technology (directly and partner-assisted).

The foundation platform, Trillo Workbench provides extreme agility to customize Trillo Doc AI to your needs.

--

--

Anil Sharma
trillo-platform

Founder and architect of cloud-based flexible UI platform trillo.io.