Sitemap

AWS IDP vis-à-vis Trillo Doc AI on GCP for RAG

6 min readMay 28, 2024

Several popular and powerful applications of Generative AI (Gen AI) are through Retrieval-Augmented Generation (RAG). RAG based solutions require enterprise content. Not as original files, but in a form that can be used in Gen AI prompts. Unless there is an enterprise strategy, each team builds its own ad hoc pipelines to acquire content in the right format. It is a time consuming and tedious process that stifles innovation. This document discusses and compares two platforms that can accelerate enterprise content strategy for Gen AI.

RAG meaning in nutshell — RAG application means a ChatGPT or Gemini like application that uses some of your organization’s content in the context of the prompt.

AWS Intelligent Document Processing (IDP) and Trillo Doc AI are platforms to make fully baked content readily available for generative AI (Gen AI) applications and research.

AWS IDP is a reference architecture of document processing with sample open source code. Trillo Doc AI is a production grade platform available through the Google Cloud marketplace.

There is a wide gap between what AWS IDP provides and what a production grade document processing platform would require. The gap is due to the lack of a scalable pipeline to process 100s of millions of documents, security, role / rule / policy based access control, audit logs, business process integration etc. Trillo Doc AI is a production grade platform that bridges this gap.

Whereas AWS IDP consists of sample code via a public git repo, Trillo Doc AI provides an open metamodel and serverless functions framework. It provides source code of serverless functions and metadata. It uses a core platform called Trillo Workbench for other services — scalability, security, access control, audit logs, and admin UI (source code available on request). Trillo Workbench source code is not available except for the admin UI. But you will never need its source code. It is a zero cost, robust, proven and enterprise grade work horse.

The rest of this article covers the following topics:

  1. Why is document processing important for Gen AI?
  2. A conceptual architecture of document processing.
  3. Trillo Doc AI architecture on GCP.
  4. AWS IDP reference architecture.
  5. A checklist of comparison of AWS IDP and Trillo Doc AI.

There are other equivalent solutions such as AODoc, FileNet, Box, etc. Some of them stop at the content management (ingestion and providing UI to manage folder hierarchy and file sharing). They do not provide further processing of documents that is needed for RAG applications. Some of them do not provide open architecture like AWS IDP and Trillo Doc AI do. The open architecture or open source is important for the fast pace innovation of Gen AI. Therefore, in this document we will limit ourselves to AWS IDP and Trillo Doc AI.

Disclaimer: Other tools are worthy. It is our opinion that they do not provide an open and convenient programming model for customization. It is a fast changing area therefore this information may not be the latest.

Why Document Processing is important

Every organization is looking to apply Gen AI in business use-cases. There are two primary use-cases’ categories.

  1. Generate new content using prompts — such as code generation, image/ video generation, text content such as marketing material generation.
  2. Generative AI RAG — there are numerous use-cases of RAG in an enterprise. RAG solutions provide contexts and prompts to generate responses. Each context is generated from the enterprise content.

RAG based use-cases, innovations and research will require that the enterprise content is made readily available in the right format to the teams of innovators and implementers. Without it, it is like asking teams to build an analytics application without having a data warehouse. The teams will have to acquire content first. It may take months and be a very inefficient process. By the time you finish acquiring content, your competitions would already have won the battle.

To conclude, in order to spur innovation using Gen AI, the first step is to make the enterprise content readily available. The most efficient way to get the content ready is having an enterprise strategy and using proper toolings.

Press enter or click to view image in full size

Generic Architecture of Document Processing

The following diagram shows the architecture of a generic document processing system.

Press enter or click to view image in full size

A generic document AI processing pipeline consists of the steps below. Trillo Doc AI and AWS-IDP pipelines are specific implementations of it using GCP and AWS services respectively.

  1. Ingestion: Content from multiple sources is brought into a cloud storage bucket for processing.
  2. Processing: Document processing pipeline processes it using a set of NLP, LLM services, and open source libraries. It runs multiple substeps as follows: a. Extraction: Text and images are extracted from each document. The raw text and images are stored in new locations on buckets. b. Redaction: Optionally, PHI and PII data is redacted from text and images before storing. c. Structured Data Extraction: Text is parsed to extract structured data from documents. The structured data extraction requires a multitude of steps — entity extraction, form fields extraction, and extraction of data in tables.
  3. Chunking: Text is chunked into smaller yet complete chunks (i.e. avoiding splitting a chunk such that the central idea of the chunk is not lost, and thus leading to less accurate semantic meaning).
  4. The above two steps require the ingestion of domain knowledge into the pipeline.
  5. Storage: Processed data is stored in a variety of databases — relational DBs, vector DBs, search indexing platforms.
  6. RBAC: Sometimes, the source document repositories provide access control information (such as enterprise content management systems). This access control information is stored and applied in the API gateway. The access control information can also be enriched or customized.
  7. API Gateway: An API gateway delivers content using HTTPs, or bulk transfer from bucket to bucket. RAG pipelines and domain specific applications are consumer of API gateway.
  8. Admin UI: Document processing platform provides an Admin UI for the management, monitoring,

Trillo Doc AI on GCP for Document Processing

Press enter or click to view image in full size

Trillo Doc AI processing pipeline is a specific implementation of the generic document processing pipeline using GCP services such as Google Vertex AI (which includes Gen AI, LLMs APIs), Google Document AI (NLP based parsers for forms, tables, purchase orders, invoices, etc.), AlloyDB (vector database), Cloud SQL (managed relational database), Google Cloud Storage for buckets.

AWS IDP for Document Processing

Press enter or click to view image in full size

AWS-IDP processing pipeline is a specific implementation of the generic document processing pipeline using AWS services except for 3 pieces that are missing and noted below. The above diagram looks different because it opens up steps that are hidden within Trillo Doc AI diagram. Else, they are equivalent.

The missing pieces from AWS-IDP are:

  1. RBAC: AWS-IDP does not provide a component for role based access control. It is very critical for enterprise applications.
  2. Folder Hierarchy: It does not have a mention of folder hierarchy. Upon visual inspection of the code repository, I did not find anything. A logical folder hierarchy is very important for the use of documents in multiple applications.
  3. Search Engine: For small phrase embedding, it does not integrate with a search engine (like Solr, Elastic search that uses tf-idf and BM25 algorithms for embedding).

Out of these 3, 3rd is easy to add to the pipeline. 1 & 2 would require several months of work, if not years.

Checklist — a comparative analysis of Trillo Doc AI and AWS IDP

Press enter or click to view image in full size

When to build your own platform?

Document processing as described above is the glue logic. Normally you will not spend millions of dollars building it if you can subscribe to a 3rd party product for thousands.

You may consider building your own document processing pipeline if it meets all of the following criteria.

  1. The content processing needs are simple.
  2. There are few document formats (mostly PDF and docx).
  3. There is no data extraction needed. Especially since there are no complex tables.
  4. Your use case does not require access control, governance, or compliance.
  5. You will build once and do not see much need to change, maintain or enhace it.
  6. The number of documents to be processed is in thousands.
  7. You will keep using it for more than 5 years.

--

--

Anil Sharma
Anil Sharma

Written by Anil Sharma

Founder and architect of cloud-based flexible UI platform trillo.io.

No responses yet