Make Documents Searchable with AI
Extract text from any document and search for specific details — ask for a passport number, get the exact number and the name it belongs to.
The Challenge
Users upload PDFs, photographs, screenshots, and scanned documents. The app needs to extract meaningful, searchable content from all of these — regardless of format or quality.
The Approach
A multi-stage pipeline that processes documents through classification, OCR, entity extraction, and tagging:
- PDF to image conversion using Docling (self-hosted)
- Document classification to identify text documents vs other content
- High-resolution OCR via OpenAI LLMs for text extraction
- Entity extraction to identify specific information — PAN numbers, passports, IDs — and link related data (e.g., passport number with holder’s name)
- AI-generated tags based on extracted content
Implementation
Documents first get converted to images. The system classifies whether the image contains text. Text documents go through OCR using OpenAI’s vision capabilities at high resolution.
Once text is extracted, the LLM identifies entities — identity documents, dates, amounts, names. It also generates tags based on the content, making documents discoverable through search.
The pipeline handles PDFs, photos of documents, screenshots, and scans through the same flow.
Results
Users upload any document and it becomes searchable. Identity documents are automatically recognised and tagged. Content that was previously locked in images is now fully indexed and retrievable.
Search for “passport number” and get the exact number along with the name it belongs to. The extracted entities make precise queries possible — not just keyword matching, but actual structured retrieval.