Building a Document Processing Pipeline With Python in 2026
The Modern PDF Processing Stack
A production document pipeline in 2026 typically has five stages: ingest → extract → enrich → structure → deliver.
Here's how to build each one cleanly.
Stage 1: Ingest
Accept PDFs via REST upload, S3 event trigger, or email parsing. Validate early — check file size, MIME type, and page count before committing to processing.
from fastapi import UploadFile, HTTPException
MAX_BYTES = 50 * 1024 * 1024 # 50 MB
async def ingest(file: UploadFile) -> bytes:
if file.content_type != "application/pdf":
raise HTTPException(400, "PDF only")
data = await file.read()
if len(data) > MAX_BYTES:
raise HTTPException(413, "File too large")
return data
Stage 2: Extract
Use pypdf for text-based PDFs and pytesseract or easyocr for scanned documents.
Detect which path to take by checking if extracted text is above a minimum density threshold.
import pypdf, io
def extract_text(pdf_bytes: bytes) -> str:
reader = pypdf.PdfReader(io.BytesIO(pdf_bytes))
pages = [p.extract_text() or "" for p in reader.pages]
full_text = "\n\n".join(pages)
# If text density is too low, fall back to OCR
if len(full_text.split()) < 100 * len(reader.pages):
return ocr_fallback(pdf_bytes)
return full_text
Stage 3: Enrich With AI
Pass extracted text to an LLM for summarisation, classification, or entity extraction. Use structured output (JSON mode) to get reliable, parseable results.
import anthropic
client = anthropic.Anthropic()
def enrich(text: str, instructions: str) -> dict:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"{instructions}\n\nDocument:\n{text[:50000]}"
}]
)
return message.content[0].text
Stage 4: Structure the Output
Normalise your output into a consistent schema regardless of input variety. A Pydantic model enforces this at the boundary.
from pydantic import BaseModel
from typing import Optional
class DocumentResult(BaseModel):
summary: str
category: str
key_entities: list[str]
action_items: list[str]
confidence: float
page_count: int
word_count: int
Stage 5: Deliver
Write results to your database, push a webhook, or return the structured JSON directly. For async pipelines, use a task queue (Celery + Redis or ARQ) so large documents don't block the API.
import asyncio
from arq import create_pool
async def queue_document(pdf_bytes: bytes, callback_url: str):
pool = await create_pool()
await pool.enqueue_job("process_document", pdf_bytes, callback_url)
Performance Notes
- Parallelise page extraction — process pages concurrently for large documents
- Cache OCR results — OCR is expensive; cache by file hash
- Stream large responses — for 100+ page summaries, stream the AI response rather than waiting
- Set hard timeouts — a stuck OCR job should fail fast, not block your queue
The Full Pipeline in Production
At SynthPDF, our pipeline handles documents from 1 page to 500+ pages using exactly this architecture. The key insight: invest in stage 2 (extraction quality) — everything downstream depends on it. Garbage extraction produces garbage enrichment, no matter how good your AI model is.