Enterprise-Grade Evaluation for Agentic Clinical AI
Benchmarking real-world medical AI agents, not just models.MedART evaluates how AI systems reason, act, and make decisions inside live EHR environments, exposing the capability gaps that matter before deployment.
1,406
Adversarial Conversations
11
Models Tested
4
Policy Categories
13.73
Avg Turns / Sim
Problem Statement & Capability Assessment
Benchmarking real-world medical AI agents, not just models.MedART evaluates how AI systems reason, act, and make decisions inside live EHR environments, exposing the capability gaps that matter before deployment.
Benchmark
Domain
Modality
Capability Tested
Gap Addressed
Benchmark Data
1,406
Adversarial Conversations
11
Policy Categories
11
Attack Methods
11
Avg Turns / Simulation
Average Attack Length (Turns) by Method
11
Reinforce Labs Flint
11
Crescendo
11
Opposite Day
11
Acronym
Models Benchmarked
Spanning foundation LLMs, reasoning models, and small language models from leading AI labs.
Most Robust Model
Claude Sonnet 4.5 leads with only 9.2% average ASR — the only model below 10%. Even so, it was still breached in nearly 1 in 10 adversarial conversations, underscoring that no model is fully attack-proof.
Safety Alignment Gap
Qwen3 Next 80B was the most vulnerable at 59.9% average ASR, with Flint alone breaching it 83.3% of the time. Open-weight models showed significantly weaker safety guardrails compared to proprietary counterparts like Claude and GPT.
Adaptive Attack Advantage
Reinforce Labs Flint achieved the highest average ASR (68.6%) across all models — 14 percentage points ahead of the next best method, Crescendo (54.5%). Tactical diversity and mid-conversation pivoting consistently outperform fixed-playbook attack strategies.
Reasoning Models Still Crack
Grok 4 (reasoning-enabled) scored 45.8% ASR while Grok 4 Non-Reasoning scored higher at 56.5%. Reasoning capabilities alone do not guarantee safety robustness — and in some attack vectors like Opposite Day, Grok 4 was actually more vulnerable (50.0% vs 28.2%).
Diagnostic Observations → Training Data Strategies
Every finding translates into actionable training data specifications — the core of the Diagnose → Prescribe → Deliver flywheel.
Diagnose
What We Found
Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.
Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.
Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.
Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.
Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.
Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.
Prescribe
Training Data Strategy
Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.
Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.
Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.
Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.
Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.
Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.
Security
Disciplined security and privacy practices aligned with global standards to protect sensitive data, intellectual property, and model assets throughout the AI lifecycle.
Centific applies rigorous security, access control, and auditability standards to safeguard enterprise data, human workflows, and AI systems at scale.
Blog
Customer Stories
Proven results
with leading AI teams.
See how organizations use Centific’s data and expert services to build, deploy, and scale production-ready AI.
Connect with Centific
Updates from the frontier of AI data.
Receive updates on platform improvements, new workflows, evaluation capabilities, data quality enhancements, and best practices for enterprise AI teams.
Data
Infrastructure
engineered for Trust.
Confidently scale every part of your AI development lifecycle with secure, compliant, production-ready operations.
Connect data, models, and people — in one enterprise-ready platform.
Seamlessly connect your existing systems, infrastructure, and workflows — all in one unified platform.
Data
Infrastructure
engineered for Trust.
Confidently scale every part of your AI development lifecycle with secure, compliant, production-ready operations.
Connect data, models, and people — in one enterprise-ready platform.
Seamlessly connect your existing systems, infrastructure, and workflows — all in one unified platform.
Seamlessly connect your existing systems, infrastructure, and workflows — all in one unified platform.
Connect data, models, and people — in one enterprise-ready platform.






