Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation

Shuyuan Yang; Zonghe Chua

Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation

Robotics 2026-03-18 v2

Authors: Shuyuan Yang , Zonghe Chua

Abstract

Autonomy in robot-assisted minimally invasive surgery has the potential to reduce surgeon cognitive and task load, thereby increasing procedural efficiency. However, implementing accurate autonomous control can be difficult due to poor end-effector proprioception. Joint encoder readings are typically inaccurate due to kinematic non-idealities in their cable-driven transmissions. Vision-based pose estimation approaches are highly effective, but lack real-time capability, generalizability, or can be hard to train. In this work, we demonstrate a real-time capable, Vision Transformer-based pose estimation approach that is trained using end-to-end differentiable kinematics and rendering. We demonstrate the potential of this approach to correct for noisy pose estimates through a real robot dataset and the potential real-time processing ability. Our approach is able to reduce more than 50% of hand-eye translation errors in the dataset, reaching the same performance level as an existing optimization-based method. Our approach is four times faster, and capable of near real-time inference at 22 Hz. A zero-shot prediction on an unseen dataset shows good generalization ability, and can be further finetuned for increased performance without human labeling.

Keywords

pose estimation surgical robot surgical robotics

Cite

@article{arxiv.2505.08875,
  title  = {Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation},
  author = {Shuyuan Yang and Zonghe Chua},
  journal= {arXiv preprint arXiv:2505.08875},
  year   = {2026}
}

Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation

Abstract

Keywords

Cite

Related papers