Abstract image

Research insight

Research insight

The medical audio benchmark healthcare AI has been missing

The medical audio benchmark healthcare AI has been missing

Learn how MedMosaic, a clinician-validated benchmark accepted at ICML 2026, exposes the limitations of current healthcare audio AI models and demonstrates why clinically grounded data and rigorous evaluation are essential for reliable clinical reasoning.

Learn how MedMosaic, a clinician-validated benchmark accepted at ICML 2026, exposes the limitations of current healthcare audio AI models and demonstrates why clinically grounded data and rigorous evaluation are essential for reliable clinical reasoning.

6 min read time

Table of contents

Share

Summarize

AI Summary by Centific

Turn this article into insights

with AI-powered summaries

Topics

Healthcare AI
Audio AI
AI Evaluation
Multimodal AI
Synthetic Data
Clinical AI
Healthcare AI
Audio AI
AI Evaluation
Multimodal AI
Synthetic Data
Clinical AI

Author(s)

Author(s)

Harshit Rajgarhia

Harshit Rajgarhia

Centifc logo

Shuubham Ojha

Centifc logo

Asif Shaik

Centifc logo

Akhil Pothanapalli

Centifc logo

Rachuri Lokesh

Abhishek Mukherji

Abhishek Mukherji

Centifc logo

Prasanna Desikan

Centific logo

Sasakorn Phanitsombat

Much of what happens in medicine is spoken, not written. It’s in the sound of labored breathing, the hesitation in a worried patient’s voice. The back-and-forth of a doctor and patient working through symptoms in conversation. These are clinically meaningful signals, and they exist only in audio, yet they’re largely missing from the audio benchmarks used to test AI.

MedMosaic, a new benchmark from the Centific AI Research (CAIR) team, was built to address that disconnect. Accepted at ICML 2026, MedMosaic tests audio language models on more than 46,000 realistic medical audio questions that demand reasoning, not just pattern matching. The results expose a problem: even the strongest model we tested answered only about two-thirds of questions (roughly 68%) correctly, and most models fell well below that.

Current healthcare models benchmarks don’t reflect real clinical conditions

Interpreting sound is harder than it looks. Real clinical audio is noisy, and multimodal, unfolding across long conversations, not a clean isolated clip, which leaves long-range temporal reasoning and subtle acoustic markers largely untested.

Centific developed MedMosaic to rigorously characterize and overcome the mismatch between real clinical audio and existing benchmarks. Those benchmarks leave today’s models unreliable on the very signals clinicians depend on. Current benchmarks fall short in three important respects:

  • These cues live in the audio itself, yet most benchmarks reduce it to short, single-turn clips that strip them away.

  • Audio-language models remain fragmented across separate domains, with distinct families specialized for environmental sound, speech, or music, so that cross-domain reasoning over speech and physiological sound together is rarely tested.

  • These recordings are multi-turn conversations under low-resource conditions that demand multi-hop reasoning rather than surface-level pattern matching.

MedMosaic was built to evaluate models under those realistic constraints.

Figure 1

Centific contributes a rigorous, expert-validated benchmark for clinical audio reasoning

MedMosaic was designed to test the capabilities that matter most in clinical audio reasoning but remain largely absent from existing benchmarks. It advances the evaluation of healthcare audio AI in four important ways:

  • Provides a medical audio QA benchmark spanning multiple reasoning regimes: 46,701 questions over heterogeneous medical audio sources, including non-speech physiological sounds and conversational speech, testing short-context inference, long-context temporal reasoning, multi-turn conversational reasoning, and reasoning about physiological sounds, enabling systematic evaluation of both audio-only and multimodal models.

  • Introduces a scalable synthetic audio generation pipeline: a controlled framework that produces synthetic medical speech embedding physiological artifacts such as coughs, emotional and distress markers, and temporally structured information, allowing the dataset to scale while preserving explicit control over reasoning complexity.

  • Defines a reasoning-focused evaluation protocol that goes beyond multiple-choice QA to include open-ended and voice-embedded settings, where the question and answer are carried directly in the audio waveform. Open-ended items demand unconstrained reasoning over extended audio with concise answer generation, while voice-based items require models to handle an abrupt shift from interpreting a clinical conversation to understanding the question being asked.

  • Demonstrates the scalable generation of challenging benchmarks: the pipeline produces all 46,701 question-answer pairs with minimal human oversight, yet the benchmark remains highly difficult, with state-of-the-art models reaching only 68.1% (Gemini-2.5-Pro), 60.5% (Gemini-2.5-Flash), and 42.8% (Qwen-2.5-Omni-7B) weighted average accuracy, establishing synthetic generation as a viable paradigm for constructing rigorous, domain-specific benchmarks.

The result is a benchmark that measures capabilities healthcare AI systems will need in production, rather than capabilities they can demonstrate under idealized test conditions.

How Centific generated and validated MedMosaic

None of this works if the synthetic data isn’t trustworthy. The credibility of MedMosaic therefore depends on how the dataset was generated and validated. Its question–answer pairs were generated synthetically with Gemini-3-Flash, over audio that mixes synthetic voices with real clinical conversations. A stratified sample of 145 pairs was then validated by two healthcare professionals acting as subject-matter experts, who scored each pair against a five-dimension rubric. They accepted 72.4% without modification, asked for minor revisions on 22.8%, and rejected 4.8%, a result the authors present as strong clinical validity. The lesson isn’t that clinicians must assemble the data themselves, but that synthetic medical data needs rigorous expert validation to be trustworthy.

Explore a sample of the MedMosaic dataset.

Centific tested 13 Models on real medical audio. The best model got only 68% right

The results, across 13 audio and multimodal reasoning models, are that reasoning over medical audio remained challenging for every system evaluated. The strongest, Gemini-2.5-Pro, achieved only approximately 68.1% accuracy, while many open models performed in the 20%–43% range. Three patterns were consistent:

  • Reasoning remains difficult even for state-of-the-art systems

  • Models exhibit uneven competence, performing adequately on speech or on sound but rarely on both

  • Performance varies substantially by question type, indicating unreliable and inconsistent capability rather than reliable clinical understanding.

These findings are not surprising as they reflect exactly the problems our healthcare data work is built to solve. They reinforce three requirements for building healthcare AI systems that can reason reliably about clinical audio.

Figure 3
  • Models need multimodal clinical grounding. MedMosaic shows that models fail when they can handle speech but not sound, or sound but not speech. That inconsistency comes from inconsistent training data. To reason reliably across clinical audio, a model needs to be trained on data that spans the full picture: clinical conversations, ambient sounds, medical imaging, and health records. All data should be annotated not just for what was said, but for what was clinically important. Breadth in the data is what produces breadth in the model.

  • Context compounds across turns. Multi-turn dialogue helped rather than hurt: 5 of 13 models scored their best on multi-turn questions, since earlier turns supply context that sharpens later reasoning. Even at the top, the smaller Gemini-2.5-Flash beat the larger Gemini-2.5-Pro here, which we tie to Pro favoring shorter-horizon reasoning. Handled well, conversation is signal, not just harder input.

  • General-purpose audio isn't clinical audio. Audio Flamingo 3 arrives with a strong reputation for its unified speech-sound-music framework, yet underperformed on MedMosaic. Its training data leans on read speech, audiobooks, music, and environmental sound, none of which capture real clinical acoustics, and we attribute its weaker results to that domain mismatch.

The implication is that reliable healthcare AI will depend less on advances in model architecture than on building clinically grounded data and evaluation frameworks that reflect how medicine is practiced.

Explore the full MedMosaic benchmark on Centific’s site.

The fix isn’t a bigger model — it’s better data and more difficult tests 

MedMosaic shows that current approaches to training and evaluating healthcare AI are insufficient for clinical audio reasoning.

The models tested are already among the most powerful in the world, yet even the best scored only 68.1%. What they lack is not scale. It is training data that reflects the real complexity of clinical audio.

A benchmark like MedMosaic exposes those limitations because it refuses to reward surface-level pattern matching.

Building models that can reason reliably about clinical audio isn’t about training larger models. It requires richer clinical training data and evaluation benchmarks stringent enough to separate genuine clinical understanding from benchmark performance. That’s the standard MedMosaic argues for: better data and more rigorous evaluation.

Centific builds the clinician-validated data and evaluation that healthcare models need

If your organization is developing models for healthcare and hitting a ceiling on clinical audio performance, MedMosaic shows you how Centific can help you move past it.

We provide clinician-validated, multimodal healthcare datasets and safety-first evaluation frameworks that take models from promising benchmark results to dependable clinical tools. Whether the application is ambient documentation, EHR-grounded reasoning, virtual triage, or high-acuity care, we connect research-grade rigor directly to the requirements of production healthcare models.

To learn more about Centific’s healthcare capabilities, visit our healthcare page or contact us to request a demonstration.

Are your ready to get

modular

AI solutions delivered?

Centific offers a plugin-based architecture built to scale your AI with your business, supporting end-to-end reliability and security. Streamline and accelerate deployment—whether on the cloud or at the edge—with a leading frontier AI data foundry.

Centific offers a plugin-based architecture built to scale your AI with your business, supporting end-to-end reliability and security. Streamline and accelerate deployment—whether on the cloud or at the edge—with a leading frontier AI data foundry.

Connect data, models, and people — in one enterprise-ready platform.

Latest Insights

Ideas, insights, and

Ideas, insights, and

research from our team

research from our team

From original research to field-tested perspectives—how leading organizations build, evaluate, and scale AI with confidence.

From original research to field-tested perspectives—how leading organizations build, evaluate, and scale AI with confidence.

Connect with Centific

Stay ahead of what’s next

Stay ahead

Updates from the frontier of AI data.

Receive updates on platform improvements, new workflows, evaluation capabilities, data quality enhancements, and best practices for enterprise AI teams.

By proceeding, you agree to our Terms of Use and Privacy Policy