Build a PDF Processing Pipeline in Python

The Modern PDF Processing Stack

A production document pipeline in 2026 typically has five stages: ingest → extract → enrich → structure → deliver.

Here's how to build each one cleanly.

Stage 1: Ingest

Accept PDFs via REST upload, S3 event trigger, or email parsing. Validate early — check file size, MIME type, and page count before committing to processing.

from fastapi import UploadFile, HTTPException

MAX_BYTES = 50 * 1024 * 1024  # 50 MB

async def ingest(file: UploadFile) -> bytes:
    if file.content_type != "application/pdf":
        raise HTTPException(400, "PDF only")
    data = await file.read()
    if len(data) > MAX_BYTES:
        raise HTTPException(413, "File too large")
    return data

Stage 2: Extract

Use pypdf for text-based PDFs and pytesseract or easyocr for scanned documents. Detect which path to take by checking if extracted text is above a minimum density threshold.

import pypdf, io

def extract_text(pdf_bytes: bytes) -> str:
    reader = pypdf.PdfReader(io.BytesIO(pdf_bytes))
    pages = [p.extract_text() or "" for p in reader.pages]
    full_text = "\n\n".join(pages)
    # If text density is too low, fall back to OCR
    if len(full_text.split()) < 100 * len(reader.pages):
        return ocr_fallback(pdf_bytes)
    return full_text

Stage 3: Enrich With AI

Pass extracted text to an LLM for summarisation, classification, or entity extraction. Use structured output (JSON mode) to get reliable, parseable results.

import anthropic

client = anthropic.Anthropic()

def enrich(text: str, instructions: str) -> dict:
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"{instructions}\n\nDocument:\n{text[:50000]}"
        }]
    )
    return message.content[0].text

Stage 4: Structure the Output

Normalise your output into a consistent schema regardless of input variety. A Pydantic model enforces this at the boundary.

from pydantic import BaseModel
from typing import Optional

class DocumentResult(BaseModel):
    summary: str
    category: str
    key_entities: list[str]
    action_items: list[str]
    confidence: float
    page_count: int
    word_count: int

Stage 5: Deliver

Write results to your database, push a webhook, or return the structured JSON directly. For async pipelines, use a task queue (Celery + Redis or ARQ) so large documents don't block the API.

import asyncio
from arq import create_pool

async def queue_document(pdf_bytes: bytes, callback_url: str):
    pool = await create_pool()
    await pool.enqueue_job("process_document", pdf_bytes, callback_url)

Performance Notes

Parallelise page extraction — process pages concurrently for large documents
Cache OCR results — OCR is expensive; cache by file hash
Stream large responses — for 100+ page summaries, stream the AI response rather than waiting
Set hard timeouts — a stuck OCR job should fail fast, not block your queue

The Full Pipeline in Production

At SynthPDF, our pipeline handles documents from 1 page to 500+ pages using exactly this architecture. The key insight: invest in stage 2 (extraction quality) — everything downstream depends on it. Garbage extraction produces garbage enrichment, no matter how good your AI model is.