Category-Level 6D Object Pose Estimation in the Wild: A Semi-Supervised Learning Approach and A New Dataset

University of California San Diego

Our collected Wild6D dataset consists of a large number of object-centric RGBD videos.

Accurate 6D pose estimation results on diverse daily objects without any human annotations.

Abstract

We present the first method capable of estimating the category-level 6D object pose for in-the-wild data without using any annotations of real-world data.

6D object pose estimation is one of the fundamental problem in computer vision and robotics research. While a lot of recent efforts have been made on generalizing pose estimation to novel object instances within the same category, namely category-level 6D pose estimation, it is still restricted in constrained environments given limited number of annotated data. In this paper, we collect Wild6D, a new unlabeled RGBD object video dataset with diverse instances and backgrounds. We utilize this data to generalize 6D object pose estimation in the wild with semi-supervised learning. We propose a novel model, called Rendering for Pose estimation network (RePoNet), that is jointly trained using the free ground-truths with the synthetic data, and a self-supervised objective on the real-world data. Without using any 3D annotations on real data, our method outperforms state-of-the-art methods on previous datasets and our Wild6D test set (with manual annotations for evaluation) by a large margin.

Video

Wild6D

Wild6D is a large-scale RGBD video dataset for 6D object pose estimation in the wild. Each video in the dataset shows multiple views of one or multiple objects. In total, there are more 5,000 videos over 5 categories: bottle, can, mug, laptop and camera.

Method

Overview of the proposed method. Given the input image and depth map, RePoNet estimates the object pose, NOCS map and shape simultaneously via Pose Network and Shape Network. These two network are bridged via the differetiable rendering module. By comparing the predicted binary mask with the input foreground mask, RePoNet can effectively leverage the real-world data without any annotations.

Semi-supervsed Setting

Illustration of proposed semi-supervised setting. For the synthetic data, we supervise it with all the annotations. While for the real-world data, we train it in a self-supervised manner by comparing the binary mask generated by rendering module with the object foreground segmentation.

Results

Wild6D

We test our method on the Wild6D testset. Green bounding boxes show the prediction results and the red ones indicate the ground-truths.

Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.

NOCS

We test our method on the NOCS REAL275 test set. Green bounding boxes show the prediction results and the red ones indicate the ground-truths.

Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.
Interpolate start reference image.

BibTeX

@article{Fu2022Wild6D,
  author    = {Fu, Yang and Wang, Xiaolong},
  title     = {Category-Level 6D Object Pose Estimation in the Wild: A Semi-Supervised Learning Approach and A New Dataset},
  journal   = {arXiv:2206.15436},
  year      = {2022},
}