medical-ocr is a multi-engine OCR pipeline for medical and legal documents. It extracts structured data — ICD codes, CPT codes, medications, timelines, impairment ratings — from PDFs and scanned documents.
GitHub
nometria/medical-ocr
PyPI
medical-ocr on PyPI
Install
Usage
Extraction pipeline
| Step | Engine | What it does |
|---|---|---|
| OCR | Tesseract (primary) | Extracts raw text from pages |
| Secondary OCR | EasyOCR (optional) | Higher accuracy for handwriting |
| Fallback OCR | Google Cloud Vision | For complex layouts |
| Classify | Rules + LLM | Identifies document type |
| Extract | LLM refinement | Pulls structured data from OCR text |
Supported document types
| Document type | Extracted fields |
|---|---|
| Treatment records | Diagnoses, procedures, dates, providers |
| Prescriptions | Medications, dosages, frequencies, refills |
| Imaging reports | Body part, findings, impression, radiologist |
| IME reports | Impairment ratings, restrictions, causation opinions |
| Operative reports | Procedure codes (CPT), surgeons, facility |
| Discharge summaries | Admission/discharge dates, diagnoses, follow-up |
| Lab results | Test names, values, reference ranges, flags |
| Bills/EOBs | CPT codes, charges, allowed amounts, dates |