Medical OCR - Nometria

medical-ocr is a multi-engine OCR pipeline for medical and legal documents. It extracts structured data — ICD codes, CPT codes, medications, timelines, impairment ratings — from PDFs and scanned documents.

GitHub

nometria/medical-ocr

PyPI

medical-ocr on PyPI

Install

# System dependencies
brew install tesseract poppler          # macOS
apt-get install tesseract-ocr poppler-utils  # Ubuntu

# Install base package
pip install medical-ocr

# With GPU-accelerated OCR (EasyOCR + OpenCV)
pip install medical-ocr[gpu]

# With Google Cloud Vision fallback
pip install medical-ocr[gcp]

Usage

# Set API key for LLM refinement pass
export OPENAI_API_KEY=sk-proj-...

# Process a medical document — extract all fields
medical-ocr report.pdf --all --format json

# Extract specific fields only
medical-ocr report.pdf --fields icd,cpt,medications

# Batch process a directory
medical-ocr ./patient-records/ --all --format json --output ./extracted/

# Run as REST API
medical-ocr --api
# POST /extract  { "file": "path/to/document.pdf" }

Extraction pipeline

Step	Engine	What it does
OCR	Tesseract (primary)	Extracts raw text from pages
Secondary OCR	EasyOCR (optional)	Higher accuracy for handwriting
Fallback OCR	Google Cloud Vision	For complex layouts
Classify	Rules + LLM	Identifies document type
Extract	LLM refinement	Pulls structured data from OCR text

Supported document types

Document type	Extracted fields
Treatment records	Diagnoses, procedures, dates, providers
Prescriptions	Medications, dosages, frequencies, refills
Imaging reports	Body part, findings, impression, radiologist
IME reports	Impairment ratings, restrictions, causation opinions
Operative reports	Procedure codes (CPT), surgeons, facility
Discharge summaries	Admission/discharge dates, diagnoses, follow-up
Lab results	Test names, values, reference ranges, flags
Bills/EOBs	CPT codes, charges, allowed amounts, dates

Output format

{
  "document_type": "treatment_record",
  "icd_codes": ["M54.5", "G89.29"],
  "cpt_codes": ["99213", "97110"],
  "medications": [
    { "name": "Ibuprofen", "dosage": "800mg", "frequency": "TID" }
  ],
  "body_parts": ["lumbar spine", "right shoulder"],
  "work_restrictions": ["no lifting > 10 lbs", "limited standing"],
  "dates": {
    "first_visit": "2025-01-10",
    "last_visit": "2025-03-15"
  }
}

Documentation Index

GitHub

PyPI

​Install

​Usage

​Extraction pipeline

​Supported document types

​Output format

Install

Usage

Extraction pipeline

Supported document types

Output format