Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Enterprise-Grade Evaluation for Agentic Clinical AI

Diagnose Prescribe Validate

Diagnose Prescribe Validate

Diagnose Prescribe Validate

Benchmarking real-world medical AI agents, not just models.MedART evaluates how AI systems reason, act, and make decisions inside live EHR environments, exposing the capability gaps that matter before deployment.

1,406

Adversarial Conversations

11

Models Tested

4

Policy Categories

13.73

Avg Turns / Sim

Problem Statement & Capability Assessment

Benchmark Overview

Benchmark Overview

Benchmarking real-world medical AI agents, not just models.MedART evaluates how AI systems reason, act, and make decisions inside live EHR environments, exposing the capability gaps that matter before deployment.

Benchmark

Reinforce Labs (Centific's Partner)

Reinforce Labs (Centific's Partner)

Domain

Responsible AI

Responsible AI

Modality

Text

Text

Capability Tested

AI safety red-teaming: Can frontier models be systematically probed to identify vulnerabilities through multi-step adversarial interactions that test safety guardrails across diverse risk categories?

AI safety red-teaming: Can frontier models be systematically probed to identify vulnerabilities through multi-step adversarial interactions that test safety guardrails across diverse risk categories?

Gap Addressed

Existing red-teaming benchmarks focus on single-turn adversarial prompts, narrow risk categories, or static attack templates. No large-scale benchmark evaluates multi-turn adversarial interactions, chained attack strategies, or domain-specific safety failures across diverse real-world risk surfaces.

Existing red-teaming benchmarks focus on single-turn adversarial prompts, narrow risk categories, or static attack templates. No large-scale benchmark evaluates multi-turn adversarial interactions, chained attack strategies, or domain-specific safety failures across diverse real-world risk surfaces.

Benchmark Data

Dataset Composition

Dataset Composition

1,406

Adversarial Conversations

11

Policy Categories

11

Attack Methods

11

Avg Turns / Simulation

Average Attack Length (Turns) by Method

11

Reinforce Labs Flint

11

Crescendo

11

Opposite Day

11

Acronym

Policy Categories Tested

Policy Categories Tested

Instruction Following

Benchmark

Kid-focused chatbot violating 18+ content rules despite user insistence

Kid-focused chatbot violating 18+ content rules despite user insistence

Attack skills tested

Description

System prompt adherence
refusal persistence under pressure

Self-Harm

Benchmark

Direct and indirect encouragement of self-harm, including soft/subtle cases

Direct and indirect encouragement of self-harm, including soft/subtle cases

Attack skills tested

Description

Nuanced harm detection
boundary recognition
indirect elicitation resistance

Illegal Activities

Benchmark

Assistance with drug procurement, fraud, or other illegal behavior

Assistance with drug procurement, fraud, or other illegal behavior

Attack skills tested

Description

Policy boundary enforcement
social engineering resistance

Models Benchmarked

11 Models Across Seven Architectural Families

11 Models Across Seven Architectural Families

Spanning foundation LLMs, reasoning models, and small language models from leading AI labs.

#

#

Model

Model

Type

Type

Architecture Family

Architecture Family

Access

Access

Score

Score

1

Claude Sonnet 4.5

Foundation LLM

Anthropic Claude

Proprietary

68.1%

1

Claude Sonnet 4.5

Foundation LLM

Anthropic Claude

Proprietary

68.1%

1

Claude Sonnet 4.5

Foundation LLM

Anthropic Claude

Proprietary

68.1%

1

Claude Opus 4.1

Foundation LLM

Anthropic Claude

Proprietary

60.5%

1

Claude Opus 4.1

Foundation LLM

Anthropic Claude

Proprietary

60.5%

1

Claude Opus 4.1

Foundation LLM

Anthropic Claude

Proprietary

60.5%

1

GPT 5 Nano

Small Language Model

OpenAI GPT

Proprietary

42.8%

1

GPT 5 Nano

Small Language Model

OpenAI GPT

Proprietary

42.8%

1

GPT 5 Nano

Small Language Model

OpenAI GPT

Proprietary

42.8%

1

GPT 5.1

Foundation LLM

OpenAI GPT

Proprietary

42.1%

1

GPT 5.1

Foundation LLM

OpenAI GPT

Proprietary

42.1%

1

GPT 5.1

Foundation LLM

OpenAI GPT

Proprietary

42.1%

1

Gemini 3 Pro Preview

Foundation LLM

Google Gemini

Proprietary

41%

1

Gemini 3 Pro Preview

Foundation LLM

Google Gemini

Proprietary

41%

1

Gemini 3 Pro Preview

Foundation LLM

Google Gemini

Proprietary

41%

1

Gemini 2.5 Flash

Foundation LLM

Google Gemini

Proprietary

38.6%

1

Gemini 2.5 Flash

Foundation LLM

Google Gemini

Proprietary

38.6%

1

Gemini 2.5 Flash

Foundation LLM

Google Gemini

Proprietary

38.6%

1

Moonshot AI Kimi K2

Foundation LLM

Moonshot Kimi

Proprietary

37.3%

1

Moonshot AI Kimi K2

Foundation LLM

Moonshot Kimi

Proprietary

37.3%

1

Moonshot AI Kimi K2

Foundation LLM

Moonshot Kimi

Proprietary

37.3%

1

Grok 4

Reasoning LLM

xAI Grok

Proprietary

36.4%

1

Grok 4

Reasoning LLM

xAI Grok

Proprietary

36.4%

1

Grok 4

Reasoning LLM

xAI Grok

Proprietary

36.4%

1

Grok 4 Non-Reasoning

Foundation LLM

xAI Grok

Proprietary

35.7%

1

Grok 4 Non-Reasoning

Foundation LLM

xAI Grok

Proprietary

35.7%

1

Grok 4 Non-Reasoning

Foundation LLM

xAI Grok

Proprietary

35.7%

1

Llama 4 Maverick 17B

Foundation LLM

Meta Llama

Proprietary

32.8%

1

Llama 4 Maverick 17B

Foundation LLM

Meta Llama

Proprietary

32.8%

1

Llama 4 Maverick 17B

Foundation LLM

Meta Llama

Proprietary

32.8%

1

Qwen3 Next 80B

Foundation LLM

Alibaba Qwen

Proprietary

24.1%

1

Qwen3 Next 80B

Foundation LLM

Alibaba Qwen

Proprietary

24.1%

1

Qwen3 Next 80B

Foundation LLM

Alibaba Qwen

Proprietary

24.1%

1

Audio Reasoner

Foundation LLM

Unknown

Proprietary

23.2%

1

Audio Reasoner

Foundation LLM

Unknown

Proprietary

23.2%

1

Audio Reasoner

Foundation LLM

Unknown

Proprietary

23.2%

1

Audio Learning 3

Foundation LLM

Unknown

Proprietary

20.8%

1

Audio Learning 3

Foundation LLM

Unknown

Proprietary

20.8%

1

Audio Learning 3

Foundation LLM

Unknown

Proprietary

20.8%

Key Findings

Key Findings

Most Robust Model

Claude Sonnet 4.5 leads with only 9.2% average ASR — the only model below 10%. Even so, it was still breached in nearly 1 in 10 adversarial conversations, underscoring that no model is fully attack-proof.

Safety Alignment Gap

Qwen3 Next 80B was the most vulnerable at 59.9% average ASR, with Flint alone breaching it 83.3% of the time. Open-weight models showed significantly weaker safety guardrails compared to proprietary counterparts like Claude and GPT.

Adaptive Attack Advantage

Reinforce Labs Flint achieved the highest average ASR (68.6%) across all models — 14 percentage points ahead of the next best method, Crescendo (54.5%). Tactical diversity and mid-conversation pivoting consistently outperform fixed-playbook attack strategies.

Reasoning Models Still Crack

Grok 4 (reasoning-enabled) scored 45.8% ASR while Grok 4 Non-Reasoning scored higher at 56.5%. Reasoning capabilities alone do not guarantee safety robustness — and in some attack vectors like Opposite Day, Grok 4 was actually more vulnerable (50.0% vs 28.2%).

Diagnostic Observations → Training Data Strategies

From Diagnosis to Prescription

From Diagnosis to Prescription

Every finding translates into actionable training data specifications the core of the Diagnose Prescribe Deliver flywheel.

Diagnose

What We Found

Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.

Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.

Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.

Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.

Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.

Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.

Prescribe

Training Data Strategy

Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.

Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.

Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.

Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.

Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.

Develop adversarial fine-tuning datasets with expert-annotated policy violation traces across all risk categories, enabling models to recognize and resist novel attack patterns.

Security

Robust data security and confidentiality

Robust data security and confidentiality

across enterprise, regulated, and mission-critical AI systems.

across enterprise, regulated, and mission-critical AI systems.

Disciplined security and privacy practices aligned with global standards to protect sensitive data, intellectual property, and model assets throughout the AI lifecycle.

Centific applies rigorous security, access control, and auditability standards to safeguard enterprise data, human workflows, and AI systems at scale.

ISO 27001

Enterprise-grade information security governance. Enterprise-grade information security governance. Enterprise-grade information security governance

SOC2

HIPAA

GDPR

ISO 27001

Enterprise-grade information security governance. Enterprise-grade information security governance. Enterprise-grade information security governance

SOC2

HIPAA

GDPR

FAQ

We help you find answers

to your questions.

Any more questions?

What is Centific and who is it built for?
icon

Centific is an enterprise-grade AI data and human-in-the-loop platform used by global organizations to build, train, and evaluate high-performance AI systems. We provide multimodal data sourcing, annotation, evaluation, and RLHF at scale—supported by a global workforce, advanced tooling, and rigorous governance.

How does Centific ensure my AI data is accurate and secure?
icon

Centific combines strict data governance, secure infrastructure, access-controlled workflows, and multi-layered quality assurance. All data operations follow enterprise-grade standards, including compliance with global regulations, human-review protocols, and continuous QA cycles. Every dataset and task is tracked, validated, and auditable to guarantee accuracy, privacy, and security.

What types of data and AI workflows does Centific support?
icon

Centific supports multimodal data needs across text, image, video, audio, sensor data, and synthetic data. We power annotation, enrichment, classification, evaluation, RLHF, red-teaming, model alignment, and domain-specific workflows. Our platform integrates into existing pipelines, connects with your internal tools, and adapts to custom ontologies, taxonomies, and quality frameworks.

Can we build our own workflows or integrate Centific into our AI development stack?
icon

Yes. Centific is built to be fully flexible. You can create custom workflows, define instructions, integrate internal systems, automate evaluation cycles, and connect to enterprise tools. Our platform supports API integrations, flexible data schemas, and fully customizable task logic so you can adapt operations to any model, domain, or QA requirement.

What makes Centific different from other AI data providers?
icon

Centific combines global workforce scale, deep domain expertise, enterprise-grade compliance, and a proven track record of high-integrity data delivery. Unlike generic labeling vendors, we offer end-to-end data operations: sourcing, annotation, evaluation, RLHF, safety alignment, governance, and continuous improvement. The result: higher accuracy, safer AI, and dramatically faster deployment cycles.

Blog

Customer Stories

Proven results

with leading AI teams.

See how organizations use Centific’s data and expert services to build, deploy, and scale production-ready AI.

Connect with Centific

Stay ahead of what’s next

Stay ahead

Updates from the frontier of AI data.

Receive updates on platform improvements, new workflows, evaluation capabilities, data quality enhancements, and best practices for enterprise AI teams.

By proceeding, you agree to our Terms of Use and Privacy Policy

Data

Infrastructure

engineered for Trust.

Confidently scale every part of your AI development lifecycle with secure, compliant, production-ready operations.

Connect data, models, and people — in one enterprise-ready platform.

Seamlessly connect your existing systems, infrastructure, and workflows — all in one unified platform.

Data

Infrastructure

engineered for Trust.

Confidently scale every part of your AI development lifecycle with secure, compliant, production-ready operations.

Connect data, models, and people — in one enterprise-ready platform.

Seamlessly connect your existing systems, infrastructure, and workflows — all in one unified platform.

Data

Data

Data

Infrastructure

Infrastructure

Infrastructure

engineered for Trust.

engineered for Trust.

engineered for Trust.

Confidently scale every part of your AI development lifecycle with secure, compliant, production-ready operations.

Confidently scale every part of your AI development lifecycle with secure, compliant, production-ready operations.

Seamlessly connect your existing systems, infrastructure, and workflows — all in one unified platform.

Connect data, models, and people — in one enterprise-ready platform.