MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

Platforms

Data Marketplace

Data Canvas

AI Data Foundry

OneForma

AI Localization

Expert Network

Join our Expert Network

Build & Train AI

RL Environments

Data Collection & Creation

RLHF & Preference Optimization

Supervised Fine Tuning

Model Safety & Evaluation

Internationalization

Vertical AI

Physical AI

Healthcare

Vision AI

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Platforms

Data Marketplace

Data Canvas

AI Data Foundry

OneForma

AI Localization

Expert Network

Join our Expert Network

Build & Train AI

RL Environments

Data Collection & Creation

RLHF & Preference Optimization

Supervised Fine Tuning

Model Safety & Evaluation

Internationalization

Vertical AI

Physical AI

Healthcare

Vision AI

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Book a Demo

Paper

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

Published on May 1, 2026

View paper

Author(s)

Harshit Rajgarhia

Shuubham Ojha

Asif Shaik

Rachuri Lokesh

Akhil Pothanapalli

Abhishek Mukherji

Prasanna Desikan

ABSTRACT

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address this challenge, we present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models. A sample of benchmark data is available here.

Connect with Centific

Stay ahead of what’s next

Stay ahead

Updates from the frontier of AI data.

Receive updates on platform improvements, new workflows, evaluation capabilities, data quality enhancements, and best practices for enterprise AI teams.

Book a Demo

Get a live walkthrough

Talk to our team

Careers

See all our open positions