

Topics

6 min read time
The most dangerous AI failures are the ones you don’t see coming.
As chatbots become more capable, their risks become more nuanced. Critical safety and compliance breakdowns rarely happen in a single exchange—they develop across multiple turns through subtle manipulation, reframing, and persistent probing. These are the vulnerabilities enterprises worry about most, yet they’re the hardest for automation to detect. Solving this challenge requires a human-in-the-loop approach.
Automated red-teaming systems are only as good as the human oversight behind them. At Centific, we specialize in bridging that gap. Our partnership with Reinforce Labs brings rigorous human-in-the-loop evaluation to Flint, which is the Reinforce Labs multi-turn pressure-testing AI chatbot platform. By embedding expert human judgment into every stage of the evaluation process, we help Reinforce Labs continuously refine Flint’s accuracy and strengthen its ability to detect genuine safety risks.
What Flint does
Flint is automated system for stress-testing AI chatbots through realistic, multi-turn adversarial conversations. Rather than relying on single prompts or static test cases, Flint simulates persistent users who adapt their strategies over time, applying pressure the way real users do.
Reinforce Labs covered Flint’s architecture and evaluation framework in detail in a launch post. Here, we focus on what happens after those conversations are generated, and how human insight turns raw output into durable improvements.
Why multi-turn safety requires human judgment at scale
To ensure consistency and reliability, each evaluation task passed through an additional QA layer, where a second reviewer verified the accuracy and completeness of the original annotation.
Automated signals are effective at catching clear failures, but they struggle when judgment depends on context. In multi-turn conversations, the most important questions are rarely binary. An assistant may narrowly avoid a violation, erode boundaries over time, or enable harm indirectly through earlier turns. These distinctions are important in enterprise settings, but they are difficult to capture with pass-or-fail labels alone.
Human judgment mitigates this risk by evaluating the trajectory, intent, and boundary drift across an entire interaction, catching subtle, contextual patterns of harm that pass-fail automation misses.
Centific partnered with Reinforce Labs to enhance the accuracy and reliability of Flint. Through our human-in-the-loop (HITL) pipeline, our expert reviewers performed three critical quality checks on each multi-turn conversation:
Goal achievement assessment. Rather than issuing a binary pass/fail verdict, annotators assigned a graded score from 1 to 4, ranging from clearly safe (1) to severe violation (4). We provided written justification explaining our reasoning. This nuanced scoring captures the spectrum of potential harm more effectively than a simple yes/no determination.
Policy violation review. Evaluators assessed whether any step within the conversation violated the target model’s safety policies, providing detailed human reasoning for each judgment.
LLM evaluation verification. Reviewers validated the accuracy of the automated LLM evaluations, flagging discrepancies, and documenting our rationale.
These explanations surface nuance that automated metrics often miss, including gradual boundary erosion or enabling behavior that only becomes problematic when viewed across turns.
Our role allows this level of judgment to be applied consistently and at scale. Human review is not treated as a final audit step, but as an integral part of Flint’s evaluation and improvement loop. This ensures evaluations align with real enterprise policy interpretation, capture near-misses alongside clear failures, and generate a feedback signal that strengthens over time.
The result is better coverage, not just broader coverage.
From human annotations to measurable gains
In Flint, human annotations are not treated as static ground truth. They provide a learning signal that sharpens evaluation and directly improves how future conversations are constructed.
Graded labels capture partial violations and near-misses that binary metrics miss, bringing evaluation closer to how policies are reviewed in practice. More importantly, Flint ingests annotated conversations and distills them into policy-specific insights that guide how subsequent attacks are built.
After each conversation, Flint reflects on what worked and what did not. Human reasoning improves these reflections by surfacing which attack vectors made progress, which phrasing or escalation strategies mattered, how guardrails responded under pressure, and where assistants narrowly avoided violations. Over time, this pushes Flint toward more realistic and higher-impact failure modes, rather than shallow or easily detected attacks.
To measure the impact of this human-informed approach, we compared two versions of Flint: a baseline automated agent and a human-informed agent with Centific annotations integrated into its learning loop.
Our annotations spanned safety and fraud policy categories including child safety, self-harm, illegal activity, gift cards and payment, refund abuse, counterfeit, coupon and price abuse, and review tampering.
Both agents were evaluated against the same set of target goals using Gemini-2.5-flash as the target model. Each policy category was exercised across a broad range of multi-turn conversations covering different attack strategies. Attack Success Rate (ASR) measured the percentage of conversations in which the target assistant violated the specified policy.
Policy Category | ASR (Baseline) | ASR (Human-Informed) |
|---|---|---|
Child Safety | 18% | 55% |
Self-Harm | 64% | 73% |
Gift Cards and Payment | 67% | 100% |
Refund Abuse | 55% | 73% |
Coupon/Price Abuse | 75% | 82% |
Review Tampering | 91% | 100% |
Illegal Activity | 89% | 73% |
The human-informed agent showed clear gains in ASR, particularly in categories where failures depend heavily on context. The categories of child safety and self-harm saw the largest improvements, while fraud-related categories consistently reached full coverage.
These improvements stem from fundamental differences in how attacks are crafted, not just in quantity, but in strategy. Guided by human expertise, agents move beyond academic or hypothetical framing, instead adopting realistic, help-seeking personas. They introduce believable pressure through relatable stressors like financial hardship, personal emergencies and urgent circumstances and escalate incrementally rather than making overt requests upfront.
This approach surfaces unsafe model behavior without triggering immediate refusals, exposing failure modes that simpler, more direct attacks consistently miss.
Why this matters for enterprise teams
For trust and safety leaders and product teams, evaluation quality matters as much as coverage.
Human-informed attacks combines the strengths of both approaches. Human judgment captures nuance, near-misses, and policy context that automated systems miss. Automated attacks provides the scale and consistency required to apply that judgment across thousands of multi-turn conversations.
The result is a powerful feedback loop: more realistic attack strategies grounded in actual user behavior, findings that align precisely with policy intent, and higher confidence when it’s time to ship. With every evaluation cycle, Centific’s human reviews sharpen the system—surfacing blind spots and guiding where automated pressure testing can hit harder.
Automation doesn’t replace human judgment—it depends on it. Our collaboration with Reinforce Labs embeds Centific’s expert evaluations into Flint’s core, turning human insight into a scalable, continuously improving evaluation engine.
If you are evaluating AI systems and want to see how this works in practice, we would love to show you.
Learn more about Centific and book a demo to discover how Centific’s expert evaluation are helping Reinforce Labs and how we can support your responsible AI goals.
Are your ready to get
modular
AI solutions delivered?
Connect data, models, and people — in one enterprise-ready platform.
Latest Insights
Connect with Centific
Updates from the frontier of AI data.
Receive updates on platform improvements, new workflows, evaluation capabilities, data quality enhancements, and best practices for enterprise AI teams.

