Yuanhong Yu

I am a Ph.D. student at Zhejiang University, supervised by Sida Peng and Xiaowei Zhou. Prior to this, I received my B.Eng. in Computer Science from Northwestern Polytechnical University, advised by Jiaqi Yang.

My research focuses on Computer Vision, 3D Vision, and Vision-Language Models. Alongside academic research, I actively build open-source tools at the frontier of AI agent infrastructure and developer experience.

Computer Vision 3D Vision Vision-Language Models Embodied AI
Yuanhong Yu
ZJU-3DV · Ant Group

Currently working on

Spatial Intelligence

3D understanding from vision

AI Agent Infrastructure

Developer tools ecosystem

Vision-Language Models

Multimodal understanding

Research

Publications

BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation
ICCV 2025

BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation

Yuanhong Yu, Xingyi He, Chen Zhao, Junhao Yu, Jiaqi Yang, Ruizhen Hu, Yujun Shen, Xing Zhu, Xiaowei Zhou, Sida Peng

We present BoxDreamer, a novel framework that estimates 6-DoF object poses by predicting 3D bounding box corners from a single RGB image. Our approach achieves strong generalization to unseen object categories without requiring CAD models or category-specific training, enabling real-world deployment across diverse scenarios.

Latest

Preprints

EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control
arXiv 2026

EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control

Yuzhe Weng, Haotian Wang, Yuanhong Yu, Jun Du, Shan He, Xiaoyan Wu, Haoran Xu

We propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a frame-by-frame, in-context, audio-driven streaming generation paradigm with Sink Frame Window Attention for variable-length video generation with identity consistency.

Building in Public

Open Source

View all projects

Beyond research, I build open-source tools that make AI-assisted development safer and more observable. The agent ecosystem below spans security, observability, and workflow automation for Claude Code and beyond.