Eval Methodology
We publish our evaluation methodology because healthcare accuracy claims must be verifiable. Here is exactly how we measure translation accuracy.
Last updated: May 22, 2026
How We Test
Comprehensive test cases spanning real medical document types across all supported country corridors. Each test case includes an input document and verified expected code mappings from authoritative sources.
Synthetic PHI
All test documents use synthetic patient data — realistic medical content with no real protected health information.
Authoritative Expected Output
Expected code mappings are verified against authoritative official coding system publications — not generated by AI.
Document Types Covered
Dental plans, prescriptions, lab results, discharge summaries, radiology reports, cardiology, oncology, referrals, vaccination records, and insurance claims.
Edge Cases Included
Test set includes cases specifically designed to break deterministic mapping: rare codes (NDC, PZN), multilingual input, and cross-system disambiguation (FDI vs BEMA).
What We Measure
Accuracy is a composite metric. We track four dimensions separately to give a complete picture of translation quality.
Code Mapping Accuracy
Primary metric. Fraction of expected code mappings that the pipeline produces correctly — source code to correct target code, right system.
Safety Alert Recall
Did the pipeline detect drug unavailability, dosage differences, and notation hazards that require a clinician's attention? 100% recall on current test set.
Coverage
Percentage of source codes the knowledge base can map from verified sources, without falling back to AI assistance.
Calibration
How well confidence scores track actual accuracy. A 90% confidence badge should be correct ~90% of the time. We continuously improve calibration.
Our Knowledge Base
The deterministic mapper looks up codes in a structured knowledge base built from authoritative sources. No AI guessing — every mapping traces to a primary source.
Authoritative Sources Only
Built from official coding system publications and regulatory databases — not AI-generated data. Every mapping traces to a primary authoritative source.
Not AI-Generated
Knowledge base mappings are manually verified against official sources. An AI-derived mapping would introduce hallucination risk into a safety-critical system.
Safety Data from Regulators
Drug availability and safety alerts come from official regulatory databases — the same sources clinicians consult.
Scale
Comprehensive verified mappings covering dental codes, ICD-10 variants, drug codes, procedure codes, and notation systems across all supported countries.
Our Approach
We evaluated multiple translation approaches. Our default uses verified knowledge base lookups — the most accurate, fastest, and lowest-risk approach for medical code mapping.
Why verified mappings are the default
LLM-generated code mappings carry hallucination risk. For a system used in clinical workflows, reproducibility and auditability matter as much as accuracy. Our verified knowledge base delivers the best accuracy with zero hallucination risk.
AI-Assisted Fallback
Verified mappings first, with AI assistance for codes not yet in the knowledge base.
Continuous Improvement
Accuracy is not a fixed claim. We re-run the full eval on every release and expand the test set when we add new corridors.
Automated Eval Pipeline
Every release triggers a full evaluation across all test cases. Accuracy regressions block deployment.
Test Case Expansion
Our test suite grows continuously. New corridors require new test cases before launch. Edge cases are prioritized — they expose accuracy ceilings.
Knowledge Base Grows with Each Corridor
Adding a new country corridor means researching and verifying mappings from that country's coding authorities — not prompting an AI.
Multi-Axis Quality Review
Beyond automated accuracy, we evaluate translation output across multiple quality dimensions: accuracy, consistency, structure, and readability.
Transparency
Healthcare buyers need to verify accuracy claims. We share our methodology so you can assess whether our evaluation matches your clinical requirements.
"We publish our eval methodology because healthcare accuracy claims must be verifiable. Our 94.5% figure is based on a comprehensive test suite across multiple country corridors — not all possible medical documents. If your use case involves coding systems or corridors not yet in our test set, we will tell you, and we will test it before you go live."
We welcome review from clinical informatics teams, healthcare IT buyers, and researchers.
Per-corridor accuracy breakdowns for all supported country pairs — real eval data, not marketing numbers.