Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Abstract image

Paper

Human + AI: Large scale Data Curation For Multilingual Guardrails

Author(s)

Harshit Rajgarhia

Harshit Rajgarhia

Abhishek Mukherji

Abhishek Mukherji

Centifc logo

Fen Yik

Centifc logo

Dominika Borek

Centifc logo

Nicole Warren

Centifc logo

Prithiviraj Pradeep

Share

ABSTRACT

As Large Language Models (LLMs) become increasingly central to real-world applications, the demand for high-quality, instructioncompliant, and multilingual training data has surged, particularly in tier-2 languages with limited digital representation. In this work, we introduce an AI-assisted annotation framework designed to optimize authoring of training data for multilingual guardrails, specifically PII detection, in Supervised Fine-Tuning (SFT) of LLMs. Targeting 13 locales, mostly underrepresented, we operationalize a suite of AI tools to augment human annotators without replacing them. Our results demonstrate a 40+% reduction in average handling time while improving instruction compliance, semantic diversity, and data quality. The key contribution of this work is that we explore the emerging paradigm of ’LLM-as-a-Judge’, using LLM not only as generative tools but also as evaluators of human-authored training data.

Connect with Centific

Stay ahead of what’s next

Stay ahead

Updates from the frontier of AI data.

Receive updates on platform improvements, new workflows, evaluation capabilities, data quality enhancements, and best practices for enterprise AI teams.

By proceeding, you agree to our Terms of Use and Privacy Policy