GovernmentAI & Automation2024

Privacy-First AI for Document Digitization

An on-premise AI system automating the digitization of physical government records containing mixed English and Devanagari scripts — built for strict data sovereignty with zero cloud dependency.

Privacy-First AI for Document Digitization

Client

Regional Public Sector Authority (NDA)

Role

AI Lead & Backend Engineer

Timeline

1 week

Team

1 dev

Overview

A regional government authority needed to digitize thousands of physical archival records containing mixed English and Devanagari scripts. The challenge: data privacy laws required all processing to stay entirely on-premise, ruling out cloud OCR or LLM APIs. We built a fully local AI pipeline delivering high-accuracy extraction, automated script filtering, and structured database output.

Process

Designed a multi-stage pipeline: image enhancement via OpenCV → OCR text extraction → local LLM script filtering via Ollama → structured SQL storage → Streamlit dashboard for staff review. Each stage was optimized for accuracy and speed on government-spec hardware.

Key Features

Custom OpenCV image preprocessing (denoising, deskewing, thresholding)
Multi-engine OCR using Tesseract and EasyOCR for maximum accuracy
Local LLM (Llama-3 via Ollama) for Devanagari/English script separation
FastAPI pipeline orchestrating end-to-end document processing
Automated schema mapping from raw OCR to structured SQL database
Streamlit dashboard for upload, review, and verification by staff
Fully air-gapped operation — zero external API calls

Challenges & Solutions

Built custom OpenCV pipeline with adaptive thresholding, morphological transformations, and contrast normalization — increasing OCR confidence scores by 35%.

Prompt-engineered a local Llama-3 model via Ollama as a linguistic classifier, achieving high-speed and accurate script separation without external APIs.

Architected the entire system to run on-premise using local LLMs and local OCR engines — eliminating all cloud dependencies (OpenAI, Google Vision, etc.).

Deployed a Streamlit UI allowing staff to upload scanned images, monitor processing, and review extracted database entries in real-time — no technical knowledge required.

Results

OCR Confidence

baseline+35%

after preprocessing

Pilot Documents Digitized

0%100%

searchable database

Manual Language Sorting

100% manualautomated

eliminated

Cloud Dependency

requiredzero

fully on-premise

Staff Onboarding Time

weeks of training2 hours

Streamlit UI

Data Privacy Compliance

at risk100%

sovereign pipeline

Goals

  • Create a scalable, secure pipeline for digitizing thousands of physical archives
  • Achieve high OCR accuracy on poor-quality legacy scans
  • Automatically separate Devanagari and English content without manual tagging
  • Maintain full data sovereignty with zero external API exposure

Tech Stack

  • Python
  • OpenCV
  • Ollama
  • FastAPI
  • Streamlit
  • Tesseract
  • SQL

Target Users

  • Government administrative officers
  • Data entry departments
  • Records management staff

Key Learnings

  • Local LLMs are highly capable for niche linguistic tasks like script classification
  • Image preprocessing quality is the single biggest factor in OCR accuracy
  • Privacy-first architecture is achievable without sacrificing AI capability
  • Simple interfaces (Streamlit) are transformative for non-technical government users

Future Plans

  • Add Devanagari-to-English translation module
  • Expand to handwriting recognition (HTR) for older archival records
  • Build batch processing queue for bulk archive digitization
  • Implement searchable full-text index across the entire document database