Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Explore our full suite of AI platforms, data marketplaces, and expert services designed to build, train, fine-tune, and deploy reliable, production-grade AI systems at scale.

Abstract image

Article

Article

Closing the sim-to-real gap with neural retargeting

Closing the sim-to-real gap with neural retargeting

Explore how Centific and N1 Robotics measured the sim-to-real dexterity gap across 1,400+ teleoperated robot episodes, and why neural retargeting may redefine how robotics foundation models are trained.

Explore how Centific and N1 Robotics measured the sim-to-real dexterity gap across 1,400+ teleoperated robot episodes, and why neural retargeting may redefine how robotics foundation models are trained.

16 min read time

Table of contents

Share

Summarize

AI Summary by Centific

Turn this article into insights

with AI-powered summaries

Topics

Robotics AI
Physical AI
Teleoperation
Neural Retargeting
Humanoid Robotics
Robotics AI
Physical AI
Teleoperation
Neural Retargeting
Humanoid Robotics

Author(s)

Author(s)

N1 Robotics logo

Ahmed Abdellah

N1 Robotics logo

Alex Cho

N1 Robotics logo

Emily Shen

Centifc logo

Kriti Banka

Centifc logo

Leela Krishna

Centifc logo

Mangesh Damra

Every robotics team has heard this story:

  • Months of work go into building a simulation environment — physics engines tuned to perfection; domain randomization cranked up, reward functions carefully shaped.

  • The policy achieves near-perfect task success in sim.

  • Then it’s transferred to real hardware.

  • The robot drops the object on the first try.

We measured this pattern directly at Centific. We built a Dexterous Manipulation Benchmark using simulated multi-fingered hands (16 degrees of freedom, 4 articulated fingers), then compared it against over 1,400 real teleoperated episodes from Unitree’s G1 humanoid robot across 29 task datasets — stacking blocks, pouring liquids, folding towels, cleaning tables, organizing tools, packaging cameras, and more.

1,400+ real episodes: the dataset that exposed the gap

Unitree Robotics, the manufacturer of the G1 humanoid, published a large-scale teleoperation dataset (DiverseManip) on HuggingFace, collected by human operators wearing VR headsets and controlling the G1’s arms and grippers in real time. This dataset spans single-arm and dual-arm tasks across three different end-effectors:

End-Effector

DOF

Type

Tasks Covered

Unitree Dex1

1-DOF

Simple parallel gripper (open/close)

Towel folding, table cleaning, tool organizing, block stacking

Unitree Dex3

28-DOF

Multi-fingered dexterous hand

Block stacking, pouring, camera packaging, object placement

BrainCo Revo 2

6-DOF per hand

Anthropomorphic dexterous hand

Rubik's cube grasping, Oreo pickup, precision tasks

This dataset gave us the real-world ground truth to benchmark against simulation. But it also revealed a critical limitation: the data was collected using standard VR controllers with basic IK retargeting. This method systematically destroys the most valuable signals in teleoperation data. This limitation is exactly what N1 Robotics’ neural retargeting technology is designed to solve.


Unitree G1 dexterous manipulation

Figure 1: Unitree G1 dexterous manipulation — dual-arm robot performing Rubik's Cube grasping and precision object manipulation tasks across the three end-effector configurations used in Centific’s benchmark.


 Unitree G1 humanoid robot performing block manipulation tasks

Figure 2: Unitree G1 humanoid robot performing block manipulation tasks — the same robot platform used across all 29 datasets in Centific’s B4 Dexterous Manipulation Benchmark.

The numbers that opened our eyes

We measured four core metrics across simulation and real-world teleoperation. The results were unambiguous:

Metric

What it measures

Simulation

Real teleoperation

Gap

Task Success Rate

Did the robot complete the task?

~95%

~83%

~1.1x

Manipulation Accuracy

How precise was object placement?

~99%

~68%

~1.5x

Grasp Quality

Was the grasp stable and functional?

~68%

~47%

~1.5x

Grasp Adaptiveness

Did the robot adjust mid-task?

~100%

~2%

~50x

Across these metrics, simulation consistently reports near-perfect performance, while real-world results diverge by 1.5x to over 50x on the measures that determine whether a task succeeds in deployment.

The largest difference appears in grasp adaptiveness. Simulation reports constant adjustment, while real operators adapt roughly 2 out of every 100 frames. That result is not a weakness; it reflects how operators anticipate failure before it occurs, using wrist orientation, approach angle, and timing. A policy trained only in simulation does not learn those anticipatory strategies. Recent work from NVIDIA points in a similar direction. In introducing the CaP-X framework, Jim Fan, NVIDIA Director of AI & Distinguished Scientist, describes API-driven robotic systems in which perception, planning, and control are composed at runtime, allowing robots to solve certain manipulation tasks zero-shot without task-specific policy training. This approach shifts part of the problem from learning behavior in advance to assembling it during execution, reinforcing the limits of training-only approaches.

When expanded our analysis across all 29 datasets spanning three different end-effectors (Dex1, Dex3, and BrainCo Revo 2 hands), the strongest dataset showed simulation overestimating grasp adaptiveness by over 80x. Across the full benchmark, 26 out of 29 datasets passed all quality gates, confirming the robustness of our methodology.

Why teleoperation teaches what simulation cannot

These results point to a broader issue: simulation captures outcomes, but not how those outcomes are achieved. When a human teleoperates a robot through a manipulation task, like picking up a slippery bottle, threading a cable, handing a tool to a colleague, they capture details that simulation does not reproduce: the physics of contact as experienced through imperfect sensing and actuation.

Real contact is messy, and that’s the point

In simulation, when a fingertip contacts an object, the force is modeled as a clean vector computed from a friction cone. In reality, the fingertip deforms, surfaces vary at a micro level, and sensors introduce noise, drift, and delay. The operator responds to these conditions by increasing pressure or adjusting wrist orientation to stabilize the grasp.

Our benchmark showed this clearly: simulated grasps scored higher by the simulator’s own metric, but real teleoperated grasps, though “messier,” were perfectly functional. The teleoperator learned to work with imperfections. A sim-trained policy optimized for the wrong objective would fail.

Recovery is the most valuable signal

The most important moments in any manipulation episode are the failures: the fumbled pickup, the object sliding mid-transfer, the awkward re-grasp. The most important moments in any manipulation episode are the failures: the fumbled pickup, the object sliding mid-transfer, the awkward re-grasp. In simulation, failures are rare. In real tasks, recovery happens frequently and provides useful training data. In our teleoperation dataset, nearly one in four episodes included at least one grasp-regrasp cycle. Every one of those is a lesson simulation cannot provide.

In our teleoperation dataset, nearly one in four episodes included at least one grasp-regrasp cycle. Every one of those is a lesson simulation cannot provide.

Teleoperation is grounded in physical reality

Two of our benchmark task types, in-hand rotation and self-correction via finger gaiting, had zero matching episodes in the real dataset. Why? Not because the robot couldn’t attempt them, but because a simple gripper physically cannot rotate an object in-hand using finger gaiting. Simulation lets you define any task; reality constrains which tasks are possible with a given end-effector. Teleoperation data is honest. It contains only what the hardware can actually do.

These results show that teleoperation data is not a substitute for simulation, but a reference point for it. Simulation approximates task outcomes, but it misses the contact dynamics that determine real-world success, often by an order of magnitude on key metrics.

What Centific built: the most comprehensive dexterity benchmark in the industry

To quantify this mismatch, Centific built a benchmark that compares simulation and real-world performance directly. The benchmark measures how large the difference is, where it appears, and what drives it across tasks, datasets, and end-effectors.

A benchmark engineered from the ground up

Centific’s Robotics & Physical AI team designed and executed the B4 Dexterous Manipulation Benchmark spanning 29 real-world task datasets, over 1,400 teleoperated episodes, three different dexterous end-effectors (Unitree Dex1, Dex3, and BrainCo Revo 2), and a simulated Allegro hand baseline. This experiment covered:

  • Household tasks (folding towels, cleaning tables).

  • Precision tasks (pouring medicine, mounting cameras).

  • Logistics tasks (packing boxes, organizing tools).

  • Bimanual coordination (dual-arm cleaning, dishwasher loading).

No one in the industry had systematically measured the sim-to-real dexterity gap across so many tasks, end-effectors, and episodes. Centific’s benchmark is the first to put hard numbers on what the field has long suspected.

Centific’s technical capabilities

Building this benchmark required deep robotics engineering across multiple disciplines. Here is what the Centific team designed, built, and validated:

Capability

What Centific built

Why it matters

4-Metric Evaluation Framework

TSR, DMS, GQS, GAR — four complementary metrics with pass/fail gates

Industry-first standardized way to measure dexterity quality

29-Dataset Benchmark Pipeline

Ingest HuggingFace datasets, extract joint trajectories, compute metrics, generate reports

Reproducible, scalable evaluation across any teleoperation dataset

IK Retargeting Engine

Cross-embodiment retargeting that normalizes Dex1/Dex3/BrainCo data to a common 16-DOF Allegro joint space

Enables fair comparison across different robot hands — revealed the 50x to 1.6x insight

Neural Retargeting Architecture

Production-ready infrastructure for ML-based retargeting (WaldoRT-compatible)

Ready to integrate with n1 Robotics or any neural retargeting model

Sim-to-Real Fine-tuning Pipeline

Alpha-annealing data mixer, asymmetric actor-critic (SAC), dual replay buffer with cross-trial recycling

Implements latest RL research for stable real-world policy fine-tuning

Multi-Domain Benchmark Framework

4 benchmark domains: Airport Contact (B1), PCB Assembly (B2), Egocentric Transfer (B3), Dexterous Manipulation (B4)

Extensible architecture — new domains plug in without breaking existing ones

NVIDIA Cosmos Reason 2 Integration

Verification and error-correction pipeline powered by Cosmos Reason 2 video understanding

AI-powered quality assurance for robotic manipulation episodes

The insight only this benchmark could reveal

The most important finding from Centific’s benchmark was not the ~50x gap itself. It was what happened when we applied retargeting: the gap collapsed from ~50x to under 2x. This proved that the massive raw gap was largely a measurement artifact caused by comparing 1-DOF grippers against 16-DOF simulated fingers. The true dexterity gap is much smaller, but it can only be seen through proper retargeting.

This result shows what retargeting changes. When joint spaces are aligned, the observed difference reflects actual behavior rather than differences in embodiment. It also shows that teleoperated data contains more useful information than raw comparisons suggest, because human operators encode stable grasp strategies before failure occurs.

This insight has direct commercial implications: it means the problem is solvable. With the right retargeting technology, like N1 Robotics’ WaldoRT, the remaining gap can be closed with far less data than anyone previously assumed. Centific's benchmark is the evidence base that makes this case quantitatively.


Centific’s data collection pipeline

Figure 3: Centific’s data collection pipeline — robot joint state recordings showing State and Action telemetry across manipulation tasks in the B4 benchmark.

Centific benchmark episode

Figure 4: Centific benchmark episode — robot arm performing block manipulation task with real-time joint trajectory capture, part of Centific's 29-dataset evaluation framework.


The teleoperation bottleneck: why collecting data is still too slow

If teleoperation data is widely valuable, why isn’t everyone collecting it at scale? Because the current process is painfully slow.

Challenge

Industry Reality

Setup time

2-3 weeks of custom engineering to integrate XR trackers with a new robot

Collection speed

~50 episodes per hour after accounting for resets and operator fatigue

Retargeting

Linear IK mapping that destroys the grasp intent the operator was trying to convey

Cross-embodiment

Every new robot hand requires a custom retargeting pipeline from scratch


The industry standard is linear Inverse Kinematics (IK) retargeting — mapping human joint angles to robot joint angles through geometric correspondence. This sounds reasonable, but it systematically destroys exactly the signals that make teleoperation data valuable.

When a human pinches an object between thumb and index finger, the IK mapper sees two joint angles. It maps them independently to the robot’s two corresponding joints. But the pinch isn’t about individual joint angles; it’s about the coordinated closure that creates force closure on the object. Linear IK strips away this coordination, the grasp intent, and the precision geometry. What remains is a lossy approximation that requires many more demonstrations to learn the same policy. 

Neural retargeting: the breakthrough that changes everything

Here is where the pieces connect. Centific’s benchmark used Unitree’s G1 robot dataset, collected with VR headsets and basic IK retargeting. We found massive gaps, especially in grasp adaptiveness. What if the same Unitree G1 data was collected with a system that preserves grasp intent instead of destroying it?

That system exists. N1 Robotics’ Waldo is built for the same Unitree G1 humanoid that our benchmark is based on. But replaces the VR + IK retargeting pipeline with neural retargeting that faithfully captures what the human operator intended.

Same robot, better data: Waldo on the Unitree G1

Waldo is a complete dexterous teleoperation platform designed specifically for the Unitree G1 humanoid (29-DOF) with BrainCo Revo 2 hands (6-DOF per hand), the same robot and one of the same end-effectors in our benchmark. It includes finger-tracking gloves with EMF sensors, Vive trackers for 6-DOF wrist tracking, a pre-configured inference PC, and software that handles calibration, recording, and export out of the box.

Capability

Industry baseline

Waldo

Setup time

2-3 weeks

15 minutes

Collection speed

~50 episodes/hr

~200 episodes/hr

End-to-end latency

Variable, often >200ms

<100ms

Retargeting method

Linear IK (lossy)

WaldoRT neural mapping (preserves intent)

Using Waldo for teleoperation results in a 4x speedup in data collection and a 100x reduction in setup time. For example, Unitree’s DiverseManip dataset, the same dataset Centific benchmarked, took an estimated 20 hours to collect using VR headsets. With Waldo on the same G1 robot, an equivalent dataset can be collected in roughly 5 hours, with higher fidelity per episode due to neural retargeting.


Waldo vs Meta Quest

Figure 5: Waldo vs Meta Quest — side-by-side comparison. Waldo (left): Episode Count 4, Avg 42.86s per episode, 93.3% success rate, 14/15 successful. Meta Quest (right): Episode Count 1, Avg 300s per episode, 18.2% success rate, 2/11 successful. Same robot, same task, same time window.


Neural retargeting in action

Neural retargeting is the mechanism that makes teleoperation data usable for training.

WaldoRT Neural Retargeting in action

Figure 6: WaldoRT Neural Retargeting in action — human hand pose (left) mapped to robot hand pose (right) using neural network inference at 1 kHz. The model preserves finger coordination and grasp intent that IK mapping destroys. This is the core technology that collapses the 50× gap to under 2×.


This figure shows how neural retargeting maps human hand motion to robot control in real time. The mapping preserves coordination across fingers and maintains grasp intent during execution.

N1 Robotics partnership: what this means for Centific

Centific is partnering with N1 Robotics to bring Waldo into our large-scale data collection pipeline. Key details from our partnership:

  • WaldoRT runs inference at 1 kHz: lightweight, fast, and scales identically from 6-DOF to higher DOF hands with zero latency increase

  • Setup takes under 30 seconds: push a pedal and data collection begins immediately

  • Time between episodes can be as short as 5 seconds, enabling rapid rinse-and-repeat collection loops

  • Inference PC ships pre-configured with an NVIDIA GPU, which is compatible with Centific’s Jetson Thor infrastructure

  • Multi-operator profiles are pre-saved: operators clock in ready to collect without hand retraining

  • Enterprise plan includes 100 hours per month of N1 Robotics managed operator time on Centific’s systems

  • Custom integration builds available for additional end-effectors and robot embodiments beyond humanoids

These capabilities make it possible to collect high-fidelity manipulation data quickly and consistently across operators, tasks, and robot configurations.

CaP-X: when frontier LLMs become robot controllers

A parallel development is reshaping how we think about robot control entirely. CaP-X (Coding as Policies), available at capgym.github.io, demonstrates that frontier large language models can control robots, without any robot-specific training, simply by writing executable control code from natural language task descriptions.

CaP-Agent0: zero-shot manipulation

CaP-Agent0 is a training-free coding agent evaluated across 100+ manipulation tasks spanning LIBERO-PRO, Robosuite, and BEHAVIOR. The findings:

  • Frontier models achieved over 30% average zero-shot success on manipulation, with no task-specific training

  • On LIBERO-PRO with position and instruction perturbations, state-of-the-art VLAs like OpenVLA and pi0 scored 0% across the board

  • Even the best VLA (pi0.5) reached only 13% average success on perturbed tasks

  • CaP-Agent0 reached 18% (better than the best VLA) without any training

These results show that training alone does not guarantee robustness, and that alternative approaches can outperform learned policies even without task-specific optimization.


CaP-Agent0 in action

Figure 7: CaP-Agent0 in action: robot recognizes the affordance of stacking order, placing cubes before the round apple. The instruction: “Stack everything as high as you can.” Autonomous execution (4x speed). Source: capgym.github.io


CaP-Agent0 task completed

Figure 8: CaP-Agent0 task completed. Cubes stacked with apple on top, demonstrating embodied reasoning about object geometry without any task-specific training. Source: capgym.github.io


CaP-RL: code-based post-training

CaP-RL applies reinforcement learning directly to the coding agent. A 7B-parameter model (Qwen 2.5 Coder) jumped from 20% to 72% average success in simulation after just 50 training iterations. The learned policies then transferred to a real Franka robot (84% on cube lifting, 76% on cube stacking) with minimal sim-to-real gap.

Why CaP-X matters for this pipeline

CaP-X addresses the task generalization layer, allowing robots to handle novel instructions without retraining. When CaP-Agent0 generates task code and WaldoRT-collected episodes provide the fine-grained manipulation signal, they cover complementary failure modes. CaP-X handles novel instructions and task reasoning; WaldoRT handles the physical contact fidelity that code alone cannot capture.

The complete pipeline: from simulation to deployment

Our benchmark, combined with recent advances in data-efficient reinforcement learning, points to a five-phase architecture for closing the sim-to-real dexterity gap:

Five-phase pipeline for closing the sim-to-real dexterity gap

Figure 9: Five-phase pipeline for closing the sim-to-real dexterity gap — Phase 1: Portable Data Collection → Phase 2: Neural Retargeting (WaldoRT) → Phase 3: Simulation Pre-training → Phase 4: Real-World Fine-tuning → Phase 5: Deployment with Human-in-the-Loop Correction.


Phase 1: portable data collection

Capture human manipulation with wearable motion-capture rigs (finger tracking gloves + wrist trackers). No robot needed during collection. Approaches like DexCap (RSS 2024) achieve 3x the speed of traditional teleoperation at a fraction of the cost. Critically, store all attempts, including failures, because mixed-quality data provides better state-space coverage than expert-only demonstrations.

Phase 2: neural retargeting

Map human hand motions to robot hand commands using learned models like WaldoRT. This preserves grasp intent, force closure, and finger coordination at 1-2ms per frame, compatible across different end-effectors without custom engineering per robot.

Phase 3: simulation pre-training

Train policies in massively parallel simulation, with thousands of environments running simultaneously, to learn task structure such as approach trajectories, finger coordination patterns, and basic manipulation sequences. Our benchmark shows that simulation transfers task-level structure well, with an approximately 1.1x gap in success rate, even though contact-level details remain poor.

Phase 4: real-world fine-tuning

Fine-tune sim-pretrained policies on real teleoperation data using stabilized RL techniques: alpha-annealing data mixing (gradually shifting from sim to real data), asymmetric actor-critic updates, and warm-start episodes. Research shows these techniques prevent the catastrophic forgetting that typically destroys sim-pretrained policies during real-world fine-tuning.

Phase 5: deployment + human-in-the-loop correction

Deploy the fine-tuned policy with live Waldo teleoperation for human-in-the-loop correction. When the policy fails, the operator takes over, generating the hardest and most valuable training data. At 200 episodes per hour, correction data accumulates rapidly. Each cycle closes the gap further.

Centific + neural retargeting: the roadmap

Centific’s benchmark provides evidence of where simulation falls short and how large the gap remains in real-world performance. The five-phase pipeline defines the architecture. Neural retargeting platforms like n1 Robotics’ Waldo provide the tooling. Together, they form a concrete roadmap for building the data infrastructure that next-generation robotics demands.

Why this matters for foundation models

The next frontier in robotics is dexterous foundation models: vision-language-action (VLA) models that can reason about finger placement, contact forces, and in-hand manipulation from natural language instructions. Models like RT-2, Octo, and pi0 have shown this is possible for simple grippers. But extending to multi-fingered dexterity requires millions of teleoperated episodes with rich grasp data, which is the kind of data that Centific’s pipeline is designed to produce and that neural retargeting makes feasible to collect at scale.

The roadmap

Implications for Dexterous Manipulation Training

The gap between simulated and real-world dexterous manipulation is a fundamental measure of how much real-world contact physics matters and how much of that knowledge can only come from human-guided teleoperation.

Simulation gives you task structure. Teleoperation gives you contact truth. Neural retargeting ensures that contact truth is preserved when mapping human motion to robot control.

The implications extend beyond model performance. The constraint is not simulation quality alone, but the data used to train policies. Robotics companies that invest in teleoperation infrastructure and collect high-fidelity manipulation data will have an advantage, because their models are trained on the contact behavior that determines real-world outcomes.

Dimension

Simulation

Teleoperation

Winner

Task success

Higher scores

Harder, but grounded in reality

Teleop

Manipulation accuracy

Near-perfect

Imperfect but true

Teleop

Grasp quality

Optimized for wrong metric

Messier, but functional

Teleop

Grasp adaptiveness

Scripted, artificial

Anticipatory, human-intelligent

Teleop

Recovery behaviors

Grasps don't fail in sim

1 in 4 episodes includes recovery

Teleop

Data fidelity

Perfect but wrong

Imperfect but true

Teleop

Setup cost (with Waldo)

GPU cluster + months

~$12K hardware + 15 min

Teleop

 

Are your ready to get

modular

AI solutions delivered?

Centific offers a plugin-based architecture built to scale your AI with your business, supporting end-to-end reliability and security. Streamline and accelerate deployment—whether on the cloud or at the edge—with a leading frontier AI data foundry.

Centific offers a plugin-based architecture built to scale your AI with your business, supporting end-to-end reliability and security. Streamline and accelerate deployment—whether on the cloud or at the edge—with a leading frontier AI data foundry.

Connect data, models, and people — in one enterprise-ready platform.

Latest Insights

Ideas, insights, and

Ideas, insights, and

research from our team

research from our team

From original research to field-tested perspectives—how leading organizations build, evaluate, and scale AI with confidence.

From original research to field-tested perspectives—how leading organizations build, evaluate, and scale AI with confidence.

Connect with Centific

Stay ahead of what’s next

Stay ahead

Updates from the frontier of AI data.

Receive updates on platform improvements, new workflows, evaluation capabilities, data quality enhancements, and best practices for enterprise AI teams.

By proceeding, you agree to our Terms of Use and Privacy Policy