Current evaluation of large language models (LLMs) in clinical settings relies predominantly on multiple-choice knowledge benchmarks — MedMCQA, USMLE, MedXpertQA — that measure factual recall under idealized conditions. These benchmarks do not capture the performance dimensions that matter most in deployed AI scribe workflows: note accuracy, hallucination rate, transcription fidelity, instruction adherence, and specialty-specific documentation quality.
This paper proposes a three-tier Clinical Performance Framework (CPF) that reorders evaluation priorities around clinical workflow utility rather than knowledge recall. We apply this framework to compare frontier models — GPT-4o, Claude Sonnet, Gemini Flash, Llama, and DeepSeek — across benchmarks spanning all three tiers. We find that rankings on knowledge recall benchmarks correlate poorly with clinical workflow performance, and that the models best suited for AI scribe deployment are not necessarily those that score highest on standardized exams.
Note: This is a living working paper. Sections marked [pending] represent planned analyses as data collection is completed.
The deployment of LLMs as AI medical scribes represents one of the highest-stakes applications of generative AI in healthcare. A scribe that hallucinates a medication dose, omits a critical allergy, or misrepresents a diagnosis does not merely produce a suboptimal document — it introduces patient safety risk. Yet the standard apparatus for evaluating clinical LLMs remains rooted in multiple-choice question answering benchmarks that test knowledge recall in isolation from any real clinical workflow.
The problem is structural. Benchmarks like USMLE and MedMCQA were designed to measure whether models possess clinical knowledge, not whether they can reliably translate spoken clinical encounters into accurate, structured documentation. A model that scores in the 95th percentile on USMLE Step 2 may still hallucinate drug interactions, misformat a SOAP note for psychiatry, or degrade catastrophically on a 45-minute transcript. None of these failure modes are visible in knowledge recall benchmarks.
This paper argues for a reorientation of evaluation methodology around clinical utility, and presents a tiered framework that places workflow-relevant performance metrics at the center of model comparison.
We propose a three-tier Clinical Performance Framework (CPF) that organizes evaluation metrics by their proximity to real clinical deployment. Higher tiers represent greater clinical relevance and generally greater measurement difficulty.
The table below summarizes the benchmarks included in this study, organized by tier and annotated with relevance to AI scribe deployment. Click column headers to sort.
| Benchmark | Tier | Task type | Scribe relevance |
|---|---|---|---|
| MedMCQA | 1 | Multiple-choice QA | Low — recall only |
| MedXpertQA | 1 | Expert-level MCQA | Low — discriminates top models but not workflow-relevant |
| USMLE Step 1/2/3 | 1 | Multiple-choice QA | Low — widely cited, poor proxy for documentation |
| PubMedQA | 1 | Biomedical QA | Low |
| MCR | 2 | Case reasoning | Medium — tests inference chain |
| ICD-10 / CPT Coding | 2 | Code assignment | High — billing accuracy is a direct workflow output |
| SOAP Fidelity | 2 | Structured generation | High — specialty format adherence |
| Note Accuracy (hallucination) | 3 | Factual verification | Critical — #1 patient safety metric |
| Transcription WER | 3 | ASR / speech | Critical — accuracy on medical vocab (drug names, dosages) |
| Latency (p50/p95) | 3 | System performance | High — affects provider workflow adoption |
| Instruction following rate | 3 | Preference adherence | High — custom templates, note style preferences |
| Context length degradation | 3 | Long-context accuracy | High — 45-min visit vs. 10-min visit performance gap |
| Specialty accuracy | 3 | Domain-specific eval | High — psych ≠ primary care ≠ OB/GYN documentation |
The seven metrics constituting Tier 3 of the Clinical Performance Framework represent the core empirical contribution of this work. Unlike recall benchmarks, each requires a defined ground truth corpus, a measurement protocol, and often a deployment environment to test against. What follows is a characterization of each metric — its clinical rationale, how it is measured, and the current state of published evidence.
Note accuracy is the foundational metric for AI scribe evaluation and the one with the most direct patient safety implications. It measures the degree to which an AI-generated clinical note faithfully represents the actual clinical encounter — specifically whether it introduces content that was not present in the visit (hallucination) or omits content that was (omission). Both failure types carry distinct risks: hallucinated medications or dosages create prescribing hazards; omitted findings may cause documentation gaps that affect downstream care decisions or malpractice exposure.
Measurement requires a ground truth corpus of annotated visit pairs: a human-transcribed or EHR-verified record of what actually occurred, matched against the AI-generated note. Annotation categories typically include medication mentions (name, dose, frequency, route), diagnosis statements, exam findings, lab values, and plan elements. Each discrepancy is classified as a hallucination, omission, or paraphrase-equivalent. The resulting metric is usually expressed as an error rate per note segment (e.g., medication hallucination rate per 100 note-medications) rather than a single aggregate score, since different error types carry different clinical weights.
Published evidence on this metric is sparse and methodologically inconsistent. Most AI scribe vendors report internal accuracy figures without releasing annotation schemas or inter-rater reliability statistics, making cross-vendor comparison impossible. A small number of academic studies have examined LLM hallucination in clinical note generation — notably work by Umapathi et al. (2023) on medical hallucinations in GPT-4 and subsequent evaluations using the MedHallu framework — but these use synthetic or de-identified corpora that may not reflect real encounter complexity. To our knowledge, no published study has evaluated hallucination rate across multiple frontier models on matched, specialty-stratified real-world clinical encounters using a standardized annotation schema. This study aims to address that gap.
The AI scribe pipeline begins before the LLM. For most deployed systems, a spoken clinical encounter is first transcribed by an automatic speech recognition (ASR) model, and only then processed by the language model into a structured note. Errors at the transcription stage propagate forward: a misheard drug name or an incorrect dosage in the transcript will be reproduced — sometimes confidently — in the generated note. Standard Word Error Rate (WER) is an insufficient metric in this context because it weights all words equally. Transcribing "the patient reports fatigue" as "the patient reports failure" is a WER of one substitution — but transcribing "carvedilol 25 mg twice daily" as "carvedilol 25 mg daily" is also one substitution, with a fundamentally different consequence.
Medical WER extends the standard metric by applying domain-specific weights to a curated vocabulary of high-stakes terms: drug names, dosage units, medical procedures, anatomical landmarks, and numerical values associated with clinical thresholds. Some implementations further distinguish between low-stakes paraphrasing errors and semantically consequential substitutions. Character Error Rate (CER) is additionally useful for evaluating transcription of drug names where single-character differences produce entirely different medications (e.g., hydroxyzine vs. hydralazine).
The clinical AI literature has generated a growing body of ASR evaluation work, particularly since the proliferation of ambient documentation systems. Lybarger et al. evaluated ASR performance specifically on clinical notes across specialties; Miner et al. (2020) characterized WER on medical dictation across multiple commercial ASR platforms. However, most published evaluations were conducted before the current generation of large ASR models (Whisper, Azure Cognitive Services, Google Medical Speech) and may not reflect current system performance. Additionally, few studies stratify by speaker accent, which is clinically relevant for international healthcare settings.
End-to-end latency measures the elapsed time from the end of a clinical encounter recording to the moment a complete, reviewable note is available to the provider. In deployed ambulatory settings, this typically must fall under 60–90 seconds to avoid disrupting clinical workflow — a provider seeing 20–25 patients in a day cannot absorb multi-minute delays per encounter without schedule impact. Latency is therefore a hard feasibility constraint as much as a performance dimension.
The metric is best reported at p50 (median) and p95 (95th percentile). Median latency reflects typical performance; the p95 value captures the tail behavior that determines whether the system is reliably usable — a scribe that averages 30 seconds but occasionally takes 4 minutes creates unpredictable provider experience. Latency decomposes into ASR processing time, LLM inference time (including time-to-first-token and token generation rate), and any orchestration overhead. For streaming-capable models, time-to-first-token may matter as much as total generation time, since providers can begin reviewing partial notes before generation completes.
Published latency benchmarks specific to clinical AI are rare; most vendor disclosures omit p95 statistics and report under controlled rather than production-load conditions. General LLM inference benchmarks (e.g., Artificial Analysis) provide useful comparative data on token throughput and TTFT across model providers, but these measure raw generation speed rather than the end-to-end clinical pipeline. Our evaluation protocol measures latency from recording upload to first rendered note character, under standardized network conditions, across visit lengths of 10, 20, and 45 minutes.
Clinical AI scribes are not used with a fixed prompt. Providers establish note preferences — preferred section order, narrative versus bullet format for HPI, how to handle medication reconciliation, whether to include verbatim patient quotes — that should persist across encounters. Instruction following rate measures the proportion of provider-specified formatting and content preferences that are honored in generated notes, evaluated across repeated generations with the same instruction set. It is a measure of reliability rather than capability: a model that can produce a psychiatry MSE correctly when explicitly instructed, but drops it 30% of the time on subsequent regenerations, is not clinically usable for that workflow.
Measurement requires a defined preference set — a structured list of provider-specified instructions across formatting, content inclusion, and structural conventions — evaluated against generated notes by a combination of rule-based checkers (for deterministic preferences like section presence) and human annotators (for stylistic preferences like narrative tone). General-purpose instruction following benchmarks such as IFEval (Zhou et al., 2023) exist, but they test simple, verifiable constraints in short-context settings that do not reflect the complexity of multi-preference clinical note generation. No published benchmark evaluates instruction following specifically in the context of AI scribe customization.
Most clinical AI evaluations are conducted on short, controlled clinical vignettes that occupy a fraction of a model's nominal context window. Real clinical encounters are substantially longer: a complex new patient visit in pediatric endocrinology may involve 40–50 minutes of recorded conversation generating 8,000–14,000 tokens of transcribed text before the LLM processes it. The "lost in the middle" phenomenon — in which LLM attention degrades for content appearing in the middle of long contexts relative to content near the beginning or end — has been documented across multiple model families (Liu et al., 2023). Its specific impact on clinical note generation has not been systematically characterized.
This metric is evaluated by comparing note accuracy (using the hallucination/omission framework from §4.1) stratified by visit length: short (<15 min), medium (15–30 min), and long (>30 min). Accuracy degradation is expressed as the relative change in error rate from the short to the long visit cohort. A clinically meaningful degradation threshold has not been formally established; we propose that a >20% relative increase in hallucination rate from short to long visits constitutes a clinically significant finding warranting disclosure in deployment documentation.
Clinical documentation conventions differ substantially across specialties, and an aggregate accuracy score obscures performance variation that matters enormously in deployment decisions. A psychiatric progress note requires a structured mental status exam (orientation, affect, thought process, cognition, insight), suicide risk assessment language anchored to specific frameworks (Columbia Suicide Severity Rating Scale), and safety planning documentation — none of which appear in a primary care SOAP note. A pediatric endocrinology visit requires growth parameters (height, weight, BMI percentile, pubertal staging), insulin regimen documentation with basal-bolus ratios, and CGM metrics. An OB/GYN encounter centers on gestational age, fundal height, fetal heart tones, and trimester-specific risk counseling documentation.
General-purpose LLMs trained primarily on broad medical corpora may produce structurally plausible notes for primary care that nonetheless fail the specialty-specific documentation requirements that determine reimbursement, regulatory compliance, and continuity of care. Specialty-stratified evaluation requires separate annotation schemas and ground truth corpora for each specialty of interest. Our initial evaluation targets three specialties available through our clinical data partnership: pediatric endocrinology (n ≈ 150 encounters), psychiatry (n ≈ 120 encounters), and primary care (n ≈ 200 encounters). Each specialty has a distinct rubric developed in consultation with board-certified clinicians in that field.
Independent of content accuracy, clinical notes must conform to structural conventions that serve documentation, reimbursement, and legal functions. The SOAP format (Subjective, Objective, Assessment, Plan) is the standard framework in ambulatory care, but its implementation varies by specialty, institution, and payer requirements. Structure fidelity evaluates whether generated notes contain required sections, whether content appears in the correct section (a physical exam finding placed in the Subjective section is a structural error regardless of accuracy), and whether section-specific conventions are honored (e.g., Assessment sections should contain diagnostic statements with supporting rationale, not isolated lab values).
This metric is partially automatable: section presence and order can be verified programmatically, while content placement and section-internal conventions require human review or a secondary LLM judge. A related but distinct quality dimension is note length calibration — notes that are excessively verbose reduce provider efficiency and may bury clinically important information; notes that are too brief may fail E/M coding requirements that determine reimbursement level. Published work by Gao et al. (2023) on automated SOAP note evaluation provides a relevant methodological foundation, though it does not include the multi-model comparison and specialty stratification proposed here.
| Metric | Unit / scale | Ground truth source | Prior published work | Status |
|---|---|---|---|---|
| Note accuracy (hallucination rate) | Errors per 100 note-elements | EHR encounter records, human annotation | Sparse; no multi-model standardized comparison | In design |
| Transcription WER (medical) | Weighted WER %; CER on drug names | Human-verified transcripts | Lybarger et al.; Miner et al. (2020); pre-Whisper era | In design |
| End-to-end latency | Seconds; p50 and p95 | Instrumented pipeline (Langfuse traces) | Artificial Analysis benchmarks (general); no clinical-specific published data | In progress |
| Instruction following rate | % preferences honored per note | Rule-based + human review against preference set | IFEval (general); no clinical scribe-specific benchmark | In design |
| Context length degradation | Δ error rate (short vs. long visits) | Same as note accuracy, stratified by length | Liu et al. (2023) "lost in the middle"; not evaluated in clinical notes | Planned |
| Specialty-specific accuracy | Per-rubric score; specialty-stratified error rate | Specialty-specific annotation rubrics (endo, psych, primary care) | No published multi-specialty, multi-model comparison | In progress |
| SOAP structure fidelity | Section presence %; placement accuracy; length calibration | Programmatic + human judge | Gao et al. (2023); single-model, no specialty stratification | Planned |
The following table reports Tier 1 benchmark scores across frontier models evaluated to date. These scores are included for external comparability but should not be interpreted as predictive of clinical workflow suitability. Click headers to sort.
| Model | MedMCQA | MedXpertQA | USMLE avg. | Notes |
|---|---|---|---|---|
| GPT-4o | 87.5 | 74.2 | 90.1 | Strong across all tiers; latency varies |
| Claude Sonnet | 85.3 | 71.8 | 88.4 | Strong instruction following; verbose notes |
| Gemini Flash | 82.1 | 68.5 | 85.7 | Fastest latency; variable specialty accuracy |
| DeepSeek | 79.4 | 65.2 | 82.3 | Strong reasoning; open-weight variant available |
| Llama 3 | 76.8 | 61.3 | 79.9 | Open-weight; fine-tunable for specialty use |
[pending] Tier 3 clinical workflow results — note accuracy, WER, latency, and instruction following rate — will be reported here as evaluations are completed.
The central finding anticipated by this framework is that model rankings on Tier 1 benchmarks will not predict rankings on Tier 3 clinical workflow metrics. This is not a novel hypothesis — it mirrors well-documented gaps in other applied AI domains — but it has not been systematically demonstrated for AI scribe deployments with real EHR ground truth.
The practical implication is significant: healthcare organizations procuring AI scribe technology should treat USMLE or MedMCQA scores as weak signals at best. The metrics that matter for patient safety — hallucination rate on drug names and dosages, ASR accuracy on medical vocabulary, degradation across visit lengths — are not captured by any currently published clinical AI benchmark.
A secondary finding likely to emerge is the importance of specialty stratification. A model well-calibrated for primary care note generation may perform substantially worse on psychiatric documentation, which has distinct structural conventions (mental status exam, risk assessment language, suicide safety planning format). Aggregate accuracy scores obscure this.
[in progress] Full discussion pending completion of Tier 3 analyses.