Diagnose Prescribe Validate

Train agents on reality. Not theory.

RL Environments-as-a-Service (RLEaaS)

Translation & Localization

Multilingual AI

Platforms

Data Marketplace

Data Canvas

AI Data Foundry

OneForma

AI Localization

Expert Network

Join our Expert Network

Build & Train AI

Data Collection & Creation

RLHF & Preference Optimization

Supervised Fine Tuning

Model Safety & Evaluation

Internationalization

Vertical AI

Physical AI

Healthcare

Vision AI

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Platforms

Data Marketplace

Data Canvas

AI Data Foundry

OneForma

AI Localization

Expert Network

Join our Expert Network

Build & Train AI

Data Collection & Creation

RLHF & Preference Optimization

Supervised Fine Tuning

Model Safety & Evaluation

Internationalization

Vertical AI

Physical AI

Healthcare

Vision AI

Platforms

Data Marketplace

Data Canvas

AI Data Foundry

OneForma

AI Localization

Expert Network

Join our Expert Network

Build & Train AI

Data Collection & Creation

RLHF & Preference Optimization

Supervised Fine Tuning

Model Safety & Evaluation

Internationalization

Vertical AI

Physical AI

Healthcare

Vision AI

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Platforms

Data Marketplace

Data Canvas

AI Data Foundry

OneForma

AI Localization

Expert Network

Join our Expert Network

Build & Train AI

Data Collection & Creation

RLHF & Preference Optimization

Supervised Fine Tuning

Model Safety & Evaluation

Internationalization

Vertical AI

Physical AI

Healthcare

Vision AI

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Book a Demo

Enterprise-Grade Evaluation for Agentic Clinical AI

Benchmarking real-world medical AI agents, not just models.MedART evaluates how AI systems reason, act, and make decisions inside live EHR environments, exposing the capability gaps that matter before deployment.

View Leaderboard

1,406

Adversarial Conversations

11

Models Tested

4

Policy Categories

13.73

Avg Turns / Sim

Problem Statement & Capability Assessment

Benchmark Overview

Benchmark

Reinforce Labs (Centific's Partner)

Domain

Responsible AI

Modality

Text

Capability Tested

AI safety red-teaming: Can frontier models be systematically probed to identify vulnerabilities through multi-step adversarial interactions that test safety guardrails across diverse risk categories?

Gap Addressed

Existing red-teaming benchmarks focus on single-turn adversarial prompts, narrow risk categories, or static attack templates. No large-scale benchmark evaluates multi-turn adversarial interactions, chained attack strategies, or domain-specific safety failures across diverse real-world risk surfaces.

Benchmark Data

Dataset Composition

1,406

Adversarial Conversations

11

Policy Categories

11

Attack Methods

11

Avg Turns / Simulation

Average Attack Length (Turns) by Method

11

Reinforce Labs Flint

11

Crescendo

11

Opposite Day

11

Acronym

Policy Categories Tested

Instruction Following

Benchmark

Kid-focused chatbot violating 18+ content rules despite user insistence

Attack skills tested

Description

System prompt adherence

refusal persistence under pressure

Self-Harm

Benchmark

Direct and indirect encouragement of self-harm, including soft/subtle cases

Attack skills tested

Description

Nuanced harm detection

boundary recognition

indirect elicitation resistance

Illegal Activities

Benchmark

Assistance with drug procurement, fraud, or other illegal behavior

Attack skills tested

Description

Policy boundary enforcement

social engineering resistance

Models Benchmarked

11 Models Across Seven Architectural Families

Spanning foundation LLMs, reasoning models, and small language models from leading AI labs.

Model

Type

Architecture Family

Access

Score

Claude Sonnet 4.5

Foundation LLM

Anthropic Claude

Proprietary

68.1%

Claude Sonnet 4.5

Foundation LLM

Anthropic Claude

Proprietary

68.1%

Claude Sonnet 4.5

Foundation LLM

Anthropic Claude

Proprietary

68.1%

Claude Opus 4.1

Foundation LLM

Anthropic Claude

Proprietary

60.5%

Claude Opus 4.1

Foundation LLM

Anthropic Claude

Proprietary

60.5%

Claude Opus 4.1

Foundation LLM

Anthropic Claude

Proprietary

60.5%

GPT 5 Nano

Small Language Model

OpenAI GPT

Proprietary

42.8%

GPT 5 Nano

Small Language Model

OpenAI GPT

Proprietary

42.8%

GPT 5 Nano

Small Language Model

OpenAI GPT

Proprietary

42.8%

GPT 5.1

Foundation LLM

OpenAI GPT

Proprietary

42.1%

GPT 5.1

Foundation LLM

OpenAI GPT

Proprietary

42.1%

GPT 5.1

Foundation LLM

OpenAI GPT

Proprietary

42.1%

Gemini 3 Pro Preview

Foundation LLM

Google Gemini

Proprietary

41%

Gemini 3 Pro Preview

Foundation LLM

Google Gemini

Proprietary

41%

Gemini 3 Pro Preview

Foundation LLM

Google Gemini

Proprietary

41%

Gemini 2.5 Flash

Foundation LLM

Google Gemini

Proprietary

38.6%

Gemini 2.5 Flash

Foundation LLM

Google Gemini

Proprietary

38.6%

Gemini 2.5 Flash

Foundation LLM

Google Gemini

Proprietary

38.6%

Moonshot AI Kimi K2

Foundation LLM

Moonshot Kimi

Proprietary

37.3%

Moonshot AI Kimi K2

Foundation LLM

Moonshot Kimi

Proprietary

37.3%

Moonshot AI Kimi K2

Foundation LLM

Moonshot Kimi

Proprietary

37.3%

Grok 4

Reasoning LLM

xAI Grok

Proprietary

36.4%

Grok 4

Reasoning LLM

xAI Grok

Proprietary

36.4%

Grok 4

Reasoning LLM

xAI Grok

Proprietary

36.4%

Grok 4 Non-Reasoning

Foundation LLM

xAI Grok

Proprietary

35.7%

Grok 4 Non-Reasoning

Foundation LLM

xAI Grok

Proprietary

35.7%

Grok 4 Non-Reasoning

Foundation LLM

xAI Grok

Proprietary

35.7%

Llama 4 Maverick 17B

Foundation LLM

Meta Llama

Proprietary

32.8%

Llama 4 Maverick 17B

Foundation LLM

Meta Llama

Proprietary

32.8%

Llama 4 Maverick 17B

Foundation LLM

Meta Llama

Proprietary

32.8%

Qwen3 Next 80B

Foundation LLM

Alibaba Qwen

Proprietary

24.1%

Qwen3 Next 80B

Foundation LLM

Alibaba Qwen

Proprietary

24.1%

Qwen3 Next 80B

Foundation LLM

Alibaba Qwen

Proprietary

24.1%

Audio Reasoner

Foundation LLM

Unknown

Proprietary

23.2%

Audio Reasoner

Foundation LLM

Unknown

Proprietary

23.2%

Audio Reasoner

Foundation LLM

Unknown

Proprietary

23.2%

Audio Learning 3

Foundation LLM

Unknown

Proprietary

20.8%

Audio Learning 3

Foundation LLM

Unknown

Proprietary

20.8%

Audio Learning 3

Foundation LLM

Unknown

Proprietary

20.8%

Key Findings

Most Robust Model

Claude Sonnet 4.5 leads with only 9.2% average ASR — the only model below 10%. Even so, it was still breached in nearly 1 in 10 adversarial conversations, underscoring that no model is fully attack-proof.

Safety Alignment Gap

Qwen3 Next 80B was the most vulnerable at 59.9% average ASR, with Flint alone breaching it 83.3% of the time. Open-weight models showed significantly weaker safety guardrails compared to proprietary counterparts like Claude and GPT.

Adaptive Attack Advantage

Reinforce Labs Flint achieved the highest average ASR (68.6%) across all models — 14 percentage points ahead of the next best method, Crescendo (54.5%). Tactical diversity and mid-conversation pivoting consistently outperform fixed-playbook attack strategies.

Reasoning Models Still Crack

Grok 4 (reasoning-enabled) scored 45.8% ASR while Grok 4 Non-Reasoning scored higher at 56.5%. Reasoning capabilities alone do not guarantee safety robustness — and in some attack vectors like Opposite Day, Grok 4 was actually more vulnerable (50.0% vs 28.2%).

Diagnostic Observations → Training Data Strategies

From Diagnosis to Prescription

Every finding translates into actionable training data specifications — the core of the Diagnose → Prescribe → Deliver flywheel.

Diagnose

What We Found

Prescribe

Training Data Strategy

Even Claude Sonnet 4.5, the most robust model, was breached in 9.2% of adversarial conversations. No model achieved 0% ASR, confirming that LLM safety alignment remains an unsolved frontier.