×

Learning 3D shape completion under weak supervision. (English) Zbl 1472.68201

Summary: We address the problem of 3D shape completion from sparse and noisy point clouds, a fundamental problem in computer vision and robotics. Recent approaches are either data-driven or learning-based: Data-driven approaches rely on a shape model whose parameters are optimized to fit the observations; Learning-based approaches, in contrast, avoid the expensive optimization step by learning to directly predict complete shapes from incomplete observations in a fully-supervised setting. However, full supervision is often not available in practice. In this work, we propose a weakly-supervised learning-based approach to 3D shape completion which neither requires slow optimization nor direct supervision. While we also learn a shape prior on synthetic data, we amortize, i.e., learn, maximum likelihood fitting using deep neural networks resulting in efficient shape completion without sacrificing accuracy. On synthetic benchmarks based on ShapeNet [A. X. Chang et al., “ShapeNet: an information-rich 3D model repository”, Preprint, arXiv:1512.03012] and ModelNet [Z. Wu et al., “3D ShapeNets: a deep representation for volumetric shapes”, in: Proceedings of IEEE conference on computer vision and pattern recognition, CVPR’15. Los Alamitos, CA: IEEE Computer Society. 1912–1920 (2015; doi:10.1109/CVPR.2015.7298801)] as well as on real robotics data from KITTI [the second author et al., “Are we ready for autonomous driving? The KITTI vision benchmark suite”, in: Proceedings of IEEE conference on computer vision and pattern recognition, CVPR’12. Los Alamitos, CA: IEEE Computer Society. 3354–3361 (2012; doi:10.1109/CVPR.2012.6248074)] and Kinect [B. Yang et al., “Dense 3D object reconstruction from a single depth view”, IEEE Trans. Pattern Anal. Mach. Intell. 41, No. 12, 2820–2834 (2019; doi:10.1109/TPAMI.2018.2868195)] we demonstrate that the proposed amortized maximum likelihood approach is able to compete with the fully supervised baseline of A. Dai et al. [“Shape completion using 3D-encoder-predictor CNNs and shape synthesis”, in: Proceedings of IEEE conference on computer vision and pattern recognition, CVPR’17. Los Alamitos, CA: IEEE Computer Society. 6545–6554 (2017; doi:10.1109/CVPR.2017.693)] and outperforms the data-driven approach of F. Engelmann et al. [“Joint object pose estimation and shape reconstruction in urban street scenes using 3D shape priors”, Lect. Notes Comput. Sci. 9796, 219–230 (2016; doi:10.1007/978-3-319-45886-1_18)], while requiring less supervision and being significantly faster.

MSC:

68T45 Machine vision and scene understanding
68T05 Learning and adaptive systems in artificial intelligence
68T07 Artificial neural networks and deep learning
68T40 Artificial intelligence for robotics
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Abramowitz, M., Handbook of mathematical functions, with formulas, graphs, and mathematical tables (1974), New York: Dover Publications, New York
[2] Agarwal, S., & Mierle, K. (2012). Others ceres solver. http://ceres-solver.org.
[3] Aubry, M., Maturana, D., Efros, A., Russell, B., & Sivic, J. (2014). Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
[4] Bao, S., Chandraker, M., Lin, Y., & Savarese, S. (2013). Dense object reconstruction with semantic priors. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[5] Besl, P.; McKay, H., A method for registration of 3d shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 14, 239-256 (1992)
[6] Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2016). Variational inference: A review for statisticians. arXiv:1601.00670.
[7] Brock, A., Lim, T., Ritchie, J. M., & Weston, N. (2016). Generative and discriminative voxel modeling with convolutional neural networks. arXiv:1608.04236.
[8] Chang, A. X., Funkhouser, T. A., Guibas, L. J., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., & Yu, F. (2015). Shapenet: An information-rich 3d model repository. arXiv:1512.03012.
[9] Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., & Urtasun, R. (2016). 3d object proposals using stereo imagery for accurate object class detection. arXiv:1608.07711.
[10] Choy CB, Xu D, Gwak J, Chen K, & Savarese S (2016) 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Proceedings of the European conference on computer vision (ECCV).
[11] Cicek Ö, Abdulkadir A, Lienkamp S. S., Brox, T., & Ronneberger, O. (2016). 3d u-net: Learning dense volumetric segmentation from sparse annotation. arXiv:1606.06650.
[12] Cignoni, P., Callieri, M., Corsini, M., Dellepiane, M., Ganovelli, F., & Ranzuglia, G. (2008). Meshlab: An open-source mesh processing tool. In Eurographics Italian chapter conference
[13] Collobert, R., Kavukcuoglu, K., & Farabet, C. (2011). Torch7: A matlab-like environment for machine learning. In Advances in neural information processing systems (NIPS) workshops.
[14] Curless, B., & Levoy, M. (1996). A volumetric method for building complex models from range images. In ACM transaction on graphics (SIGGRAPH).
[15] Dai, A., Qi, C. R., & Nießner, M. (2017). Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[16] Dame, A., Prisacariu, V., Ren, C., & Reid, I. (2013). Dense reconstruction using 3D object shape priors. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[17] Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision (ICCV).
[18] Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems (NIPS).
[19] Engelmann, F., Stückler, J., & Leibe, B. (2016). Joint object pose estimation and shape reconstruction in urban street scenes using 3D shape priors. In Proceedings of the German conference on pattern recognition (GCPR).
[20] Engelmann, F., Stückler, J., & Leibe, B. (2017). SAMP: shape and motion priors for 4d vehicle reconstruction. In Proceedings of the IEEE winter conference on applications of computer vision (WACV), pp. 400-408.
[21] Fan, H., Su, H., & Guibas, L. J. (2017). A point set generation network for 3d object reconstruction from a single image. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[22] Firman, M., Mac Aodha, O., Julier, S., & Brostow, G. J. (2016). Structured prediction of unobserved voxels from a single depth image. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[23] Furukawa, Y.; Hernandez, C., Multi-view stereo: A tutorial, Foundations and Trends in Computer Graphics and Vision, 9, 1-2, 1-148 (2013)
[24] Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[25] Gershman, S., & Goodman, N. D. (2014). Amortized inference in probabilistic reasoning. In Proceedings of the annual meeting of the cognitive science society.
[26] Girdhar, R., Fouhey, D. F., Rodriguez, M., & Gupta, A. (2016). Learning a predictable and generative vector representation for objects. In Proceedings of the European conference on computer vision (ECCV).
[27] Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Conference on artificial intelligence and statistics (AISTATS).
[28] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (NIPS).
[29] Güney, F., & Geiger, A., (2015). Displets: Resolving stereo ambiguities using object knowledge. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[30] Gupta, S., Arbeláez, P. A., Girshick, R. B., & Malik, J. (2015). Aligning 3D models to RGB-D images of cluttered scenes. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[31] Gwak, J., Choy, C. B., Garg, A., Chandraker, M., & Savarese, S. (2017). Weakly supervised generative adversarial networks for 3d reconstruction. arXiv:1705.10904.
[32] Haene, C., Savinov, N., & Pollefeys, M. (2014). Class specific 3d object shape priors using surface normals. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[33] Han, X., Li, Z., Huang, H., Kalogerakis, E., & Yu, Y. (2017). High-resolution shape completion using deep neural networks for global structure and local geometry inference. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 85-93.
[34] Häne, C., Tulsiani, S., & Malik, J. (2017). Hierarchical surface prediction for 3d object reconstruction. arXiv:1704.00710.
[35] Im, D. J., Ahn, S., Memisevic, R., & Bengio, Y. (2017). Denoising criterion for variational auto-encoding framework. In Proceedings of the conference on artificial intelligence (AAAI), pp. 2059-2065.
[36] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the international conference on machine learning (ICML).
[37] Jensen, R. R., Dahl, A. L., Vogiatzis, G., Tola, E., & Aanæs, H. (2014). Large scale multi-view stereopsis evaluation. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[38] Jones, E., Oliphant, T., & Peterson, P, et al. (2001). SciPy: Open source scientific tools for Python. http://www.scipy.org/.
[39] Kar, A., Tulsiani, S., Carreira, J., & Malik, J. (2015). Category-specific object reconstruction from a single image. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[40] Kato, H., Ushiku, Y., & Harada, T. (2017). Neural 3d mesh renderer. arXiv:1711.07566.
[41] Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the international conference on learning representations (ICLR).
[42] Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In Proceedings of the international conference on learning representations (ICLR).
[43] Kroemer, O., Amor, H. B., Ewerton, M., & Peters, J. (2012). Point cloud completion using extrusions. In IEEE-RAS international conference on humanoid robots (humanoids).
[44] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In Proceedings of the international conference on 3D vision (3DV).
[45] Law, AJ; Aliaga, DG, Single viewpoint model completion of symmetric objects for digital inspection, Computer Vision and Image Understanding (CVIU), 115, 5, 603-610 (2011)
[46] Leotta, M. J., & Mundy, J. L. (2009). Predicting high resolution image edges with a generic, adaptive, 3-d vehicle model. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
[47] Li, Y., Dai, A., Guibas, L., & Nießner, M. (2015). Database-assisted object retrieval for real-time 3d reconstruction. In Computer graphics forum.
[48] Lin, C., Kong, C., & Lucey, S. (2017). Learning efficient point cloud generation for dense 3d object reconstruction. arXiv:1706.07036.
[49] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision (ECCV).
[50] Liu, S, Ago I. I., & Giles, C. L. (2017). Learning a hierarchical latent-variable model of voxelized 3d shapes. arXiv:1705.05994.
[51] Lorensen, W. E., & Cline, H. E. (1987). Marching cubes: A high resolution 3d surface construction algorithm. In ACM transaction on graphics (SIGGRAPH).
[52] Ma, L., & Sibley, G. (2014). Unsupervised dense object discovery, detection, tracking and reconstruction. In Proceedings of the European conference on computer vision (ECCV).
[53] Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[54] Nan, L.; Xie, K.; Sharf, A., A search-classify approach for cluttered indoor scene understanding, ACM TG, 31, 6, 137:1-137:10 (2012)
[55] Nash, C.; Williams, CKI, The shape variational autoencoder: A deep generative model of part-segmented 3d objects, Eurographics Symposium on Geometry Processing (SGP), 36, 5, 1-12 (2017)
[56] Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., Kohli, P., Shotton, J., Hodges, S., & Fitzgibbon, A. (2011). Kinectfusion: Real-time dense surface mapping and tracking. In Proceedings of the international symposium on mixed and augmented reality (ISMAR).
[57] Nguyen, D. T., Hua, B., Tran, M., Pham, Q., & Yeung, S. (2016). A field model for repairing 3d shapes. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[58] Oswald, M. R., Töppe, E., Nieuwenhuis, C., & Cremers, D. (2013). A review of geometry recovery from a single image focusing on curved object reconstruction, pp. 343-378. · Zbl 1314.68323
[59] Pauly, M., Mitra, N. J., Giesen, J., Gross, M. H., & Guibas, L. J. (2005). Example-based 3d scan completion. In: Eurographics symposium on geometry processing (SGP).
[60] Pauly, M.; Mitra, NJ; Wallner, J.; Pottmann, H.; Guibas, LJ, Discovering structural regularity in 3d geometry, ACM Transaction on Graphics, 27, 3, 43:1-43:11 (2008)
[61] Pepik, B., Stark, M., Gehler, P. V., Ritschel, T., & Schiele, B. (2015). 3d object class detection in the wild. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), pp. 1-10.
[62] Pizlo, Z. (2007). Human perception of 3d shapes. In Proceedings of the international conference on computer analysis of images and patterns (CAIP).
[63] Pizlo, Z., 3D shape: Its unique place in visual perception (2010), New York: MIT Press, New York
[64] Prisacariu, V., Segal, A., & Reid, I. (2013). Simultaneous monocular 2d segmentation, 3d pose recovery and 3d reconstruction. In Proceedings of the Asian conference on computer vision (ACCV).
[65] Prisacariu, V. A., & Reid, I. (2011). Nonlinear shape manifolds as shape priors in level set segmentation and tracking. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[66] Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017a). Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[67] Qi, C. R., Yi, L., Su, H., & Guibas, L. J. (2017b). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems (NIPS).
[68] Rezende, D. J., Eslami, S. M. A., Mohamed, S., Battaglia, P., Jaderberg, M., & Heess, N. (2016). Unsupervised learning of 3d structure from images. arXiv:1607.00662.
[69] Rezende, D. J., & Mohamed, S. (2015). Variational inference with normalizing flows. In Proceedings of the international conference on machine learning (ICML).
[70] Riegler, G., Ulusoy, A. O., Bischof, H., & Geiger, A. (2017a). OctNetFusion: Learning depth fusion from data. In Proceedings of the international conference on 3D vision (3DV).
[71] Riegler, G., Ulusoy, A. O., & Geiger, A. (2017b). Octnet: Learning deep 3d representations at high resolutions. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[72] Ritchie, D., Horsfall, P., & Goodman, N. D. (2016). Deep amortized inference for probabilistic programs. arXiv:1610.05735.
[73] Rock, J., Gupta, T., Thorsen, J., Gwak, J., Shin, D., & Hoiem, D. (2015). Completing 3d object shape from one depth image. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[74] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention (MICCAI).
[75] Sandhu, R., Dambreville, S., Yezzi, A. J., & Tannenbaum, A. (2009). Non-rigid 2d-3d pose estimation and 2d image segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR). · Zbl 1223.49048
[76] Sandhu, R.; Dambreville, S.; Yezzi, AJ; Tannenbaum, A., A nonrigid kernel-based framework for 2d-3d pose estimation and 2d image segmentation, IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 33, 6, 1098-1115 (2011)
[77] Sharma, A., Grau, O., & Fritz, M. (2016). Vconv-dae: Deep volumetric shape learning without object labels. arXiv:1604.03755.
[78] Smith, E., & Meger, D. (2017). Improved adversarial systems for 3d object generation and reconstruction. arXiv:1707.09557.
[79] Song, S. & Xiao, J. (2014). Sliding shapes for 3D object detection in depth images. In Proceedings of the European Conference on Computer Vision (ECCV).
[80] Steinbrucker, F., Kerl, C., & Cremers, D. (2013). Large-scale multi-resolution surface reconstruction from rgb-d sequences. In Proceedings of the IEEE international conference on computer vision (ICCV).
[81] Stutz, D., & Geiger, A. (2018). Learning 3d shape completion from laser scan data with weak supervision. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[82] Sung, M.; Kim, VG; Angst, R.; Guibas, LJ, Data-driven structural priors for shape completion, ACM Transaction on Graphics, 34, 6, 175:1-175:11 (2015)
[83] Tatarchenko, M., Dosovitskiy, A., & Brox, T. (2017). Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE international conference on computer vision (ICCV).
[84] Thrun, S., & Wegbreit, B. (2005). Shape from symmetry. In Proceedings of the IEEE international conference on computer vision (ICCV), pp 1824-1831.
[85] Tulsiani, S., Efros, A. A., & Malik, J. (2018). Multi-view consistency as supervisory signal for learning shape and pose prediction. arXiv:1801.03910.
[86] Tulsiani, S., Zhou, T., Efros, A. A., & Malik, J. (2017). Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[87] van der Maaten, L.; Hinton, G., Visualizing high-dimensional data using t-sne, Journal of Machine Learning Research (JMLR), 9, 2579-2605 (2008) · Zbl 1225.68219
[88] Varley, J., DeChant, C., Richardson, A., Ruales, J., & Allen, P. K. (2017). Shape completion enabled robotic grasping. In Proceedings of IEEE international conference on intelligent robots and systems (IROS).
[89] Wang, S., Bai, M., Mattyus, G., Chu, H., Luo, W., Yang, B., Liang, J., Cheverie, J., Fidler, S., & Urtasun, R. (2016) Torontocity: Seeing the world with a million eyes. arXiv:1612.00423.
[90] Wang, W., Huang, Q., You, S., Yang, C., & Neumann, U. (2017). Shape inpainting using 3d generative adversarial network and recurrent convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV).
[91] Whelan, T., Leutenegger, S., Salas-Moreno, R. F., Glocker, B., & Davison, A. J. (2015). Elasticfusion: Dense SLAM without a pose graph. In Proceedings of robotics: science and systems (RSS).
[92] Wu, J., Xue, T., Lim, J. J., Tian, Y., Tenenbaum, J. B., Torralba, A., & Freeman, W. T. (2016a). Single image 3d interpreter network. In Proceedings of the European conference on computer vision (ECCV).
[93] Wu, J., Zhang, C., Xue, T., Freeman, B., & Tenenbaum, J. (2016b). Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in neural information processing systems (NIPS).
[94] Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3d shapenets: A deep representation for volumetric shapes. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[95] Xie, J., Kiefel, M., Sun, M. T., & Geiger, A. (2016). Semantic instance annotation of street scenes by 3d-2d label transfer. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
[96] Yan, X., Yang, J., Yumer, E., Guo, Y., & Lee, H. (2016). Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In Advances in neural information processing systems (NIPS).
[97] Yang, B., Rosa, S., Markham, A., Trigoni, N., & Wen, H. (2018). 3d object dense reconstruction from a single depth view. arXiv:1802.00411.
[98] Yang, B., Wen, H., Wang, S., Clark, R., Markham, A., & Trigoni, N. (2017). 3d object reconstruction from a single depth view with adversarial learning.arXiv:1708.07969.
[99] Zheng, Q.; Sharf, A.; Wan, G.; Li, Y.; Mitra, NJ; Cohen-Or, D.; Chen, B., Non-local scan consolidation for 3d urban scenes., ACM Trans on Graphics, 29, 4, 94:1-94:9 (2010)
[100] Zheng, S., Prisacariu, V. A., Averkiou, M., Cheng, M. M., Mitra, N. J., Shotton, J., Torr, P. H. S., & Rother, C. (2015). Object proposal estimation in depth images using compact 3d shape manifolds. In Proceedings of the German conference on pattern recognition (GCPR).
[101] Zia, M.; Stark, M.; Schiele, B.; Schindler, K., Detailed 3D representations for object recognition and modeling, IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 35, 11, 2608-2623 (2013)
[102] Zia, M. Z., Stark, M., & Schindler, K. (2014). Are cars just 3d boxes? Jointly estimating the 3d shape of multiple objects. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), pp. 3678-3685.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.