ART: Action-based Reasoning Task Benchmarking for Medical AI Agents

Platforms

Data Marketplace

Data Canvas

AI Data Foundry

OneForma

AI Localization

Expert Network

Join our Expert Network

Build & Train AI

RL Environments

Data Collection & Creation

RLHF & Preference Optimization

Supervised Fine Tuning

Model Safety & Evaluation

Internationalization

Vertical AI

Physical AI

Healthcare

Vision AI

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Platforms

Data Marketplace

Data Canvas

AI Data Foundry

OneForma

AI Localization

Expert Network

Join our Expert Network

Build & Train AI

RL Environments

Data Collection & Creation

RLHF & Preference Optimization

Supervised Fine Tuning

Model Safety & Evaluation

Internationalization

Vertical AI

Physical AI

Healthcare

Vision AI

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Book a Demo

Paper

ART: Action-based Reasoning Task Benchmarking for Medical AI Agents

Published on Jan 13, 2026

View paper

Author(s)

Ananya Mantravadi

Shivali Dalmia

Abhishek Mukherji

ABSTRACT

Reliable clinical decision support requires medical AI agents capable of safe, multi-step reasoning over structured electronic health records (EHRs). While large language models (LLMs) show promise in healthcare, existing benchmarks inadequately assess performance on action-based tasks involving threshold evaluation, temporal aggregation, and conditional logic. We introduce ART, an Action-based Reasoning clinical Task benchmark for medical AI agents, which mines real-world EHR data to create challenging tasks targeting known reasoning weaknesses. Through analysis of existing benchmarks, we identify three dominant error categories: retrieval failures, aggregation errors, and conditional logic misjudgments. Our four-stage pipeline—scenario identification, task generation, quality audit, and evaluation—produces diverse, clinically validated tasks grounded in real patient data. Evaluating GPT-4o-mini and Claude 3.5 Sonnet on 600 tasks shows near-perfect retrieval after prompt refinement, but substantial gaps in aggregation (28-64%) and threshold reasoning (32-38%). By exposing failure modes in action-oriented EHR reasoning, ART advances toward more reliable clinical agents, an essential step for AI systems that reduce cognitive load and administrative burden, supporting workforce capacity in high-demand care settings.

Connect with Centific

Stay ahead of what’s next

Stay ahead

Updates from the frontier of AI data.

Receive updates on platform improvements, new workflows, evaluation capabilities, data quality enhancements, and best practices for enterprise AI teams.

Book a Demo

Get a live walkthrough

Talk to our team

Careers

See all our open positions