
Teaching a robot to pick up a coffee cup sounds like a programming problem: write the code, specify the coordinates, execute the motion. But a leading approach in robotics AI now takes a different path, as training robots by showing them first-person video of humans performing tasks, rather than scripting each movement by hand. The method has shown promising results in controlled settings, but it rests on the assumption that what a robot sees in human video can lead it to specific actions it can perform with its own body.
The core difficulty is what researchers call the embodiment gap, the physical and perceptual mismatch between a human body and a robotic system.
Hendrik Chiche, CEO and Co-Founder of OMGrab Inc., is working to understand and narrow that gap through a combination of deployed robotics, AI models, and new sensing hardware. His work at OMGrab aims to give a concrete look at the potential of where this science stands and what it might take to continue developing it.
What a Robot Sees Is Not What a Robot Can Do
The idea behind egocentric video learning is intuitive. A camera is strapped to a person's head, records them performing a task like folding laundry, stacking shelves, assembling parts, and feeds that footage to a robot as training data, meaning the robot essentially learns from direct demonstration. In theory, this scales far more efficiently than writing task-specific code for every new environment.
Yet in practice, the actual results tend to be trickier. Humans have five flexible fingers capable of adjusting grip pressure in real time. Most industrial and research robots operate with rigid two- or three-fingered grippers that lack that dexterity. The mismatch extends beyond hands. Camera height differs between a human head and a robot's sensor mount. Field of view, joint range of motion, and the physics of contact between a gripper and an object all diverge from what the human demonstrator experienced.
Even high-quality first-person video doesn't automatically encode information a robot can act on, because the body performing an action and the body seeking to replicate it are two very different machines.
Hendrik Chiche's Firsthand Knowledge of the Problem
Hendrik Chiche's path to the embodiment gap problem traces through a series of technical roles that each exposed a different dimension of the challenge. After completing his mechanical engineering degree in France, he worked on MRI super-resolution research at GENCI, where he first applied neural networks to vision-based problems, training models to reconstruct fine-grained detail from limited imaging data.
He then worked on radar-based object detection for autonomous vehicles at Zendar, gaining direct experience with sensor fusion and the difficulty of translating one sensor modality into reliable spatial understanding. Later, at Mainspring Energy, he built time series prediction models to identify machine failures across a fleet of generators, a role that showed him how wide the gap between laboratory AI performance and messy real-world data could be.
This experience spanning different industries gave him the technical intuition that now guides his work at OMGrab: that bridging the embodiment gap requires better data infrastructure, better sensing, and a modular approach to the problem.
How OMGrab Tackles the Embodiment Gap
OMGrab, which Chiche cofounded in late 2025, builds the data collection infrastructure that supports this research: wearable recording hardware and a cloud platform for streaming and storing robotics training data.
The transfer method used relies on inverse kinematics, a well-established mathematical technique that calculates the joint movements a robot needs to reach a desired position. Rather than feeding video into an end-to-end neural network and hoping the model figures out the rest, Chiche designed the pipeline to be modular, drawing on his background in mechanical engineering and applied ML: extract human motion from video, compute its equivalent robot motion mathematically, and execute.
A recent showcase of this method's capabilities took place at a Safeway grocery store, where OMGrab deployed a real robot that learned manipulation tasks from watching egocentric human video. This deployment effectively sought to move the experiment into a variable, real-world environment where lighting, object placement, and surface conditions weren't controlled.
The result was not a solved problem but a measured proof of concept: human-to-robot transfer via egocentric video is feasible for specific manipulation tasks, with the expected caveat that the conditions under which it succeeds and fails are still being mapped.
Building a World Model
Parallel to the transfer problem, Chiche has been developing what researchers call a world model, a system that predicts what will happen next in a visual scene, effectively giving a robot an internal simulation of physical reality. He architected the model to predict color and depth information at 15 frames per second, training it from scratch on less than one hour of video and in some cases as little as one minute.
The data efficiency of Chiche's approach is significant. Most comparable models require large datasets and computing resources, making them borderline inaccessible for smaller research teams. His goal is to build the first desktop-scale world model for robotics that trains on minimal data and runs in real time, which would change the economics of robotics development, allowing rapid iteration without massive data collection campaigns.
The depth prediction component is also specifically relevant to the embodiment gap: three-dimensional scene understanding helps a robot compensate for the differences in perspective between where a human camera sits and where a robot's sensors are mounted.
Leading Academic Research at UC Berkeley
Finally, vision alone, even with depth prediction, misses a key category of information. Factors like how hard a hand squeezes, the different textures of different textures, and the micro-adjustments of finger pressure during a grasp are all tactile signals that cameras can't record.
Chiche's expertise in this area recently led UC Berkeley's Fung Institute to select him as an industry capstone lead, a role in which he directs two graduate research teams and sets the technical agenda for their semester-long projects. In this capacity, Chiche defines the research questions, designs the experimental protocols, and mentors graduate students through the process of translating open-ended scientific problems into structured engineering work.
One team, under Chiche's direction, is developing a next-generation tactile sensing glove designed to capture hand pressure, contact area, and grip dynamics during human manipulation tasks, creating a data stream that doesn't yet exist in most robotics training pipelines. A second team, also led by Chiche, is running controlled experiments to quantify the vision embodiment gap directly, measuring how robot performance decreases as the physical differences between the human demonstrator and the robot learner increase.

These research initiatives seek to study the embodiment gap as a measurable scientific variable, one that can be characterized on a large scale and, in time, reduced with better data and sensing. The tactile glove work also points to a future where robotics data collection can combine vision, depth, force, and touch and apply it to training datasets that more completely capture what a human hand actually does during a task.
These university research efforts feed directly back into OMGrab's commercial mission. Understanding the different kinds of data that robots need to properly mimic and imitate humans informs what data collection hardware needs to capture.
The embodiment gap is not going away, but Hendrik Chiche's work, spanning deployed robots, lightweight world models, and graduate-level sensing research, represents a measured, technically grounded effort to narrow it.
© 2026 ScienceTimes.com All rights reserved. Do not reproduce without permission. The window to the world of Science Times.










