TranslateMD

Eval Methodology

We publish our evaluation methodology because healthcare accuracy claims must be verifiable. Here is exactly how we measure translation accuracy.

Last updated: May 22, 2026

94.5%
Overall Accuracy
comprehensive test suite
11
Countries
supported corridors
6
Country Corridors
with verified mappings

How We Test

Comprehensive test cases spanning real medical document types across all supported country corridors. Each test case includes an input document and verified expected code mappings from authoritative sources.

DE→US Germany → United States
IN→US India → United States
MX→US Mexico → United States
TR→US Turkey → United States
TH→US Thailand → United States
Multi Cross-corridor & safety alerts

Synthetic PHI

All test documents use synthetic patient data — realistic medical content with no real protected health information.

Authoritative Expected Output

Expected code mappings are verified against authoritative official coding system publications — not generated by AI.

Document Types Covered

Dental plans, prescriptions, lab results, discharge summaries, radiology reports, cardiology, oncology, referrals, vaccination records, and insurance claims.

Edge Cases Included

Test set includes cases specifically designed to break deterministic mapping: rare codes (NDC, PZN), multilingual input, and cross-system disambiguation (FDI vs BEMA).


What We Measure

Accuracy is a composite metric. We track four dimensions separately to give a complete picture of translation quality.

Code Mapping Accuracy

Primary metric. Fraction of expected code mappings that the pipeline produces correctly — source code to correct target code, right system.

Safety Alert Recall

Did the pipeline detect drug unavailability, dosage differences, and notation hazards that require a clinician's attention? 100% recall on current test set.

Coverage

Percentage of source codes the knowledge base can map from verified sources, without falling back to AI assistance.

Calibration

How well confidence scores track actual accuracy. A 90% confidence badge should be correct ~90% of the time. We continuously improve calibration.


Our Knowledge Base

The deterministic mapper looks up codes in a structured knowledge base built from authoritative sources. No AI guessing — every mapping traces to a primary source.

Authoritative Sources Only

Built from official coding system publications and regulatory databases — not AI-generated data. Every mapping traces to a primary authoritative source.

Not AI-Generated

Knowledge base mappings are manually verified against official sources. An AI-derived mapping would introduce hallucination risk into a safety-critical system.

Safety Data from Regulators

Drug availability and safety alerts come from official regulatory databases — the same sources clinicians consult.

Scale

Comprehensive verified mappings covering dental codes, ICD-10 variants, drug codes, procedure codes, and notation systems across all supported countries.


Our Approach

We evaluated multiple translation approaches. Our default uses verified knowledge base lookups — the most accurate, fastest, and lowest-risk approach for medical code mapping.

Why verified mappings are the default

LLM-generated code mappings carry hallucination risk. For a system used in clinical workflows, reproducibility and auditability matter as much as accuracy. Our verified knowledge base delivers the best accuracy with zero hallucination risk.

AI-Assisted Fallback

Verified mappings first, with AI assistance for codes not yet in the knowledge base.


Continuous Improvement

Accuracy is not a fixed claim. We re-run the full eval on every release and expand the test set when we add new corridors.

Automated Eval Pipeline

Every release triggers a full evaluation across all test cases. Accuracy regressions block deployment.

Test Case Expansion

Our test suite grows continuously. New corridors require new test cases before launch. Edge cases are prioritized — they expose accuracy ceilings.

Knowledge Base Grows with Each Corridor

Adding a new country corridor means researching and verifying mappings from that country's coding authorities — not prompting an AI.

Multi-Axis Quality Review

Beyond automated accuracy, we evaluate translation output across multiple quality dimensions: accuracy, consistency, structure, and readability.


Transparency

Healthcare buyers need to verify accuracy claims. We share our methodology so you can assess whether our evaluation matches your clinical requirements.

"We publish our eval methodology because healthcare accuracy claims must be verifiable. Our 94.5% figure is based on a comprehensive test suite across multiple country corridors — not all possible medical documents. If your use case involves coding systems or corridors not yet in our test set, we will tell you, and we will test it before you go live."

Questions about our methodology?
research@translatemd.io

We welcome review from clinical informatics teams, healthcare IT buyers, and researchers.

Security & Compliance
View Security Page

HIPAA-ready, GDPR-compliant, PHI-safe processing.

Per-Corridor Accuracy
View accuracy by corridor →

Per-corridor accuracy breakdowns for all supported country pairs — real eval data, not marketing numbers.