We address the problem of learning accurate 3D shape and camera pose from a collection of unlabeled category-specific images. We train a convolutional network to predict both the shape and the pose from a single image by minimizing the reprojection error: given several views of an object, the projections of the predicted shapes to the predicted camera poses should match the provided views. To deal with pose ambiguity, we introduce an ensemble of pose predictors that we then distill it to a single student'' model. To allow for efficient learning of high-fidelity shape representation, we represent the shapes by point clouds and devise a formulation allowing for differentiable projection of these. Our experiments show that the distilled ensemble of pose predictors learns to estimate the pose accurately, while the point cloud representation allows to predict detailed shape models.
Authors
Eldar Insafutdinov
Related Content
Sparse DNNs with Improved Adversarial Robustness
Deep neural networks (DNNs) are computationally/memory-intensive and vulnerable to adversarial attacks, making them prohibitive in some real-world applications. By converting....
Towards Understanding End-of-trip Instructions in a Taxi Ride....
We introduce a dataset containing human-authored descriptions of target locations in an "end-of-trip in a taxi ride" scenario. We describe....
Constructing Deep Neural Networks by Bayesian Network Structure....
We introduce a principled approach for unsupervised structure learning of deep neural networks. We propose a new interpretation for depth....
Predicting Future User Activities with Constrained Non-Linear Models
Prediction of future user activities from their history, all past activities, is a challenging problem. One reason is that the....