

7 min read time
AI Summary by Centific
Turn this article into insights
with AI-powered summaries
Topics

Leela Krishna

Kriti Banka

Mangesh Damre

Bishwajit Pal
Modern robotics learning, whether imitation learning, reinforcement learning, or vision-language-action models, demands more than raw sensor streams. To train a robot that can reliably grasp a cup, a model needs to understand what is happening in each frame, when key events occur, and how the task succeeded or failed.
A single robot demonstration episode may contain:
Multiple camera feeds capturing the scene from different angles
Joint state and action telemetry, often with 26 or more degrees of freedom, recorded at every timestep
Tactile sensor data from fingertip sensors detecting contact and slip
Language instructions guiding the robot’s behavior
Without structured labels, such as bounding boxes around objects, temporal annotations marking grasp events, and quality assessments of each episode, this rich multimodal data remains difficult to use. Training on unlabeled data leads to models that mimic motion trajectories without capturing the semantics of manipulation.
Centific and Hugging Face: the annotation layer for open robotics data
Hugging Face hosts thousands of open robotics datasets, collected from diverse platforms including bipedal humanoids, dexterous arms, mobile manipulators, and more. These datasets arrive in a standardized packaging format: Parquet files for structured sensor data and MP4 video files for camera feeds. Centific has built a production pipeline that ingests datasets in this format directly from Hugging Face, converts them into the .rrd visualization format used by Data Canvas, and puts them in front of expert annotation teams. The result: open robotics datasets on Hugging Face can become fully annotated, training-ready data at enterprise scale.
Step 1: ingest from Hugging Face
Robotics datasets on Hugging Face are packaged as Parquet files for structured sensor data, including joint states, actions, and timestamps, and MP4 video files for camera feeds. The Data Canvas pipeline connects directly to the Hugging Face Hub, downloads episode data for compatible datasets, and decodes video frames using tools such as FFmpeg, with support for modern codecs including AV1. This works across robot platforms and dataset sizes, from small research collections to corpora with hundreds of thousands of episodes.
Step 2: convert to rerun (.rrd) format
Raw Parquet tables and video files are not designed for human review. Each episode is converted into the Rerun .rrd format, an interactive visualization standard built for multimodal robotics data. A single .rrd file packages camera imagery, time-series telemetry, tactile sensor feeds, and language embeddings on a shared timeline.
Camera imagery: JPEG-encoded frames from every angle, synchronized to a shared timeline
Time-series telemetry: joint states and action commands plotted as interactive charts
Tactile sensor feeds: fingertip contact images aligned frame-by-frame
Language embeddings: task instruction tokens and attention masks
This unified view makes it possible to inspect robot behavior across modalities, align sensor signals with actions, and annotate events with temporal precision.
Step 3: annotate with Data Canvas
This is where the value of the pipeline becomes clear. The .rrd file is uploaded into Data Canvas, where annotation teams, including domain experts trained in robotics workflows, label the data with precision at the frame and event level. Annotators can:
Draw bounding boxes on objects, robot arms, grippers, and fingers across camera frames
Tag object states such as whether the object is in grasp, being manipulated, or falling
Label manipulation steps including approach, gripper alignment, pinch grasp, lift, and handover
Mark sensor events such as tactile contact, slip detection, force spikes, and joint limits
Classify episode outcomes as success, partial success, or failure
Annotate time-series segments to identify grasp tighten events, contact moments, and stable motion phases
These annotations convert raw episodes into training-ready data that captures both motion and manipulation intent.

Figure 1: Data Canvas — LeRobot episode labeling view showing front and wrist camera feeds alongside State (26 joints) and Action (26 joints) telemetry charts synchronized to a shared timeline
Step 4: export training-ready datasets
Once annotated, labeled data is readily available for downstream consumption, structured, versioned, and quality-controlled. Researchers and ML engineers can pull annotated episodes to train object detection models, grasp quality classifiers, manipulation policy models, and failure analysis systems.

Figure2 : End-to-end pipeline from Hugging Face Hub to training-ready annotated data via Data Canvas
Already at scale: 29 complex dexterous datasets
Centific has already processed and annotated 29 complex dexterous manipulation datasets available on Hugging Face, spanning up to 36 degrees of freedom, multiple hand configurations, whole-body control, and long-horizon household tasks. These datasets include deformable object handling, bimanual coordination, multi-finger in-hand manipulation, and two-robot collaboration.
Dex3 multi-finger hand (28-DOF): 3 datasets
The Dex3 hand operates at 28 degrees of freedom, among the highest-DOF manipulation configurations in any open dataset. Tasks covered include precision block stacking, controlled liquid pouring, multi-step food preparation such as toasted bread, camera packaging assembly, geometric object grasping, and precision placement. These tasks demand millimeter-level spatial annotation and frame-accurate event labeling across multiple synchronized cameras.
BrainCo hand (26-DOF): 2 datasets
BrainCo datasets push annotation difficulty further with in-hand object reorientation (Rubik’s Cube, 26-DOF) and deformable or fragile object grasping (Oreo biscuit). Multi-face in-hand manipulation requires tracking fine finger contact states across every frame, a task that exposes the limits of automated labeling and demands expert human annotation.
Dex1 complex tasks (16-DOF): 13 datasets
The largest group covers the full range of real-world manipulation challenges at 16 DOF: deformable bimanual tasks (towel folding, clothes packing), tool use (eraser, cloth wiping), food preparation, multi-object sorting, precision insertion, camera assembly, and, most significantly, two-robot coordination for table cleaning. The dual-robot dataset introduces annotation complexity that single-arm datasets simply do not have: simultaneous action labeling across two independent manipulators sharing a workspace.
Whole-body teleoperation (36-DOF total): 5 datasets
The WBT (Whole-Body Teleoperation) datasets represent the frontier of humanoid robot learning: 36 total degrees of freedom, long-horizon household tasks, and deformable object handling at scale. Loading a dishwasher, making a bed, operating a washing machine, or collecting clothes are tasks that require understanding multi-step intent, object deformation, and interaction with real home appliances. Annotating this data correctly requires domain expertise that goes far beyond standard bounding box labeling.
Z1 bimanual arm (14-DOF): 3 datasets
The Z1 dual-arm datasets cover bimanual pouring, cloth folding, and box stacking, which are tasks that require tight coordination between two arms and precise temporal alignment of actions across both manipulators. Bimanual annotation is inherently more complex than single-arm: every label must account for the relationship between both arms, not just individual motions.
Across these 29 datasets, up to 36 DOF, the work spans precision manipulation, deformable objects, tool use, household tasks, and multi-robot coordination.
Why annotation matters more than you think
Annotation introduces the structure required to interpret multimodal robot data. It connects perception, action, and outcome in a way that supports training and evaluation.
Raw demonstrations are not ground truth
A robot arm moving from point A to point B tells you what happened. Annotations explain why it mattered. Was that motion a reach or an approach? Did the grasp succeed or merely appear to? Without labels, a model learns trajectories. With labels, it learns manipulation semantics.
Multimodal data demands multimodal labels
A camera frame alone might show a hand touching an object. But was there actual contact? The tactile sensor says yes, and the force telemetry shows a spike at that exact timestamp. Annotation across modalities creates the cross-modal supervision signals that produce more reliable AI policies.
Demonstration quality varies
Some episodes are clean demonstrations by expert operators; others contain recoveries from near-failures. Episode-level quality labels and frame-level event annotations let researchers curate training sets intelligently — upweighting successful demonstrations, mining hard examples from near-failure recoveries, and filtering out noisy or corrupt episodes.
Annotation enables evaluation
Annotated datasets become benchmarks. They let teams track whether a new policy grasps more reliably, detects slip earlier, or handles deformable objects more consistently across variations. Without annotations, evaluation is guesswork.
Centific as a robotics data partner
The partnership between Centific and Hugging Face is about making open robotics data usable for AI training at scale. The track record includes 29 complex datasets already annotated, spanning some of the hardest manipulation challenges in the field. Whether you are a humanoid robotics company publishing datasets on Hugging Face, a research lab that needs expert annotation for collected demonstrations, or an enterprise building Physical AI applications from scratch, the infrastructure to go from raw robotics data to training-ready datasets is here.
The path from raw robotics data to reliable AI runs through high-quality annotation. Any robotics dataset on Hugging Face can be transformed into structured, training-ready data through this pipeline. Centific and Hugging Face are building that infrastructure together, and it is already in use.
Are your ready to get
modular
AI solutions delivered?
Connect data, models, and people — in one enterprise-ready platform.
Latest Insights
Connect with Centific
Updates from the frontier of AI data.
Receive updates on platform improvements, new workflows, evaluation capabilities, data quality enhancements, and best practices for enterprise AI teams.

