Skip to main content
medical-ocr is a multi-engine OCR pipeline for medical and legal documents. It extracts structured data — ICD codes, CPT codes, medications, timelines, impairment ratings — from PDFs and scanned documents.

GitHub

nometria/medical-ocr

PyPI

medical-ocr on PyPI

Install

# System dependencies
brew install tesseract poppler          # macOS
apt-get install tesseract-ocr poppler-utils  # Ubuntu

# Install base package
pip install medical-ocr

# With GPU-accelerated OCR (EasyOCR + OpenCV)
pip install medical-ocr[gpu]

# With Google Cloud Vision fallback
pip install medical-ocr[gcp]

Usage

# Set API key for LLM refinement pass
export OPENAI_API_KEY=sk-proj-...

# Process a medical document — extract all fields
medical-ocr report.pdf --all --format json

# Extract specific fields only
medical-ocr report.pdf --fields icd,cpt,medications

# Batch process a directory
medical-ocr ./patient-records/ --all --format json --output ./extracted/

# Run as REST API
medical-ocr --api
# POST /extract  { "file": "path/to/document.pdf" }

Extraction pipeline

StepEngineWhat it does
OCRTesseract (primary)Extracts raw text from pages
Secondary OCREasyOCR (optional)Higher accuracy for handwriting
Fallback OCRGoogle Cloud VisionFor complex layouts
ClassifyRules + LLMIdentifies document type
ExtractLLM refinementPulls structured data from OCR text

Supported document types

Document typeExtracted fields
Treatment recordsDiagnoses, procedures, dates, providers
PrescriptionsMedications, dosages, frequencies, refills
Imaging reportsBody part, findings, impression, radiologist
IME reportsImpairment ratings, restrictions, causation opinions
Operative reportsProcedure codes (CPT), surgeons, facility
Discharge summariesAdmission/discharge dates, diagnoses, follow-up
Lab resultsTest names, values, reference ranges, flags
Bills/EOBsCPT codes, charges, allowed amounts, dates

Output format

{
  "document_type": "treatment_record",
  "icd_codes": ["M54.5", "G89.29"],
  "cpt_codes": ["99213", "97110"],
  "medications": [
    { "name": "Ibuprofen", "dosage": "800mg", "frequency": "TID" }
  ],
  "body_parts": ["lumbar spine", "right shoulder"],
  "work_restrictions": ["no lifting > 10 lbs", "limited standing"],
  "dates": {
    "first_visit": "2025-01-10",
    "last_visit": "2025-03-15"
  }
}