×

Bayesian distillation of deep learning models. (English. Russian original) Zbl 1491.68180

Autom. Remote Control 82, No. 11, 1846-1856 (2021); translation from Avtom. Telemekh. 2021, No. 11, 16-29 (2021).
The authors present a Bayesian approach to teacher-student networks’ knowledge distillation. Knowledge distillation was first proposed by G. Hinton et al. in their paper [“Distilling the knowledge in a neural network”, Preprint, arXiv:1503.02531]. They proposed to train a large network with ground truth labels as the teacher network, then train a smaller model on the outputs of the teacher network as “soft targets”.
This work extends the prior framework of teacher-student networks. The authors argue that the parameters of the student network can be initialized from the teacher network.
The teacher network is usually larger than the student network. To meaningfully initialize the student network, the authors propose to prune the teacher network so that it has the same architecture as the student network.
With the assumption that the posterior of the teacher network follows a Gaussian distribution, the authors prove that the pruned teacher network also follows a Gaussian distribution.

MSC:

68T07 Artificial neural networks and deep learning
62F15 Bayesian inference
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Krizhevsky, A., Sutskever, I., and Hinton, G., ImageNet classification with deep convolutional neural networks, Proc. 25th Int. Conf. Neural Inf. Process. Syst. (2012), vol. 1, pp. 1097-1105.
[2] Simonyan, K. and Zisserman, A., Very deep convolutional networks for large-scale image recognition, Int. Conf. Learn. Representations (San Diego, 2015).
[3] He, K., Ren, S., Sun, J., and Zhang, X., Deep residual learning for image recognition, Proc. IEEE Conf. Comput. Vision Pattern Recognit. (Las Vegas, 2016), pp. 770-778.
[4] Devlin, J., Chang, M., Lee, K., and Toutanova, K., BERT: pre-training of deep bidirectional transformers for language understanding, Proc. 2019 Conf. North Am. Ch. Assoc. Comput. Linguist.: Hum. Lang. Technol. (Minnesota, 2019), vol. 1, pp. 4171-4186.
[5] Vaswani, A., Gomez, A., Jones, L., Kaiser, L., Parmar, N., Polosukhin, I., Shazeer, N., and Uszkoreit, J., Attention is all you need, in Advances in Neural Information Processing Systems, 2017, vol. 5, pp. 6000-6010.
[6] Al-Rfou, R., Barua, A., Constant, N., Kale, M., Raffel, C., Roberts, A., Siddhant, A., and Xue, L., mT5: a massively multilingual pre-trained text-to-text transformer, Proc. 2021 Conf. North Am. Ch. Assoc. Comput. Linguist.: Hum. Lang. Technol. (2021), pp. 483-498.
[7] Brown, T. et al., GPT3: language models are few-shot learners, in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877-1901.
[8] Zheng, T.; Liu, X.; Qin, Z.; Ren, K., Adversarial attacks and defenses in deep learning, Engineering, 6, 346-360 (2020) · doi:10.1016/j.eng.2019.12.012
[9] Hinton, G., Dean, J., and Vinyals, O., Distilling the knowledge in a neural network, NIPS Deep Learn. Representation Learn. Workshop (2015).
[10] Vapnik, V.; Izmailov, R., Learning using privileged information: similarity control and knowledge transfer, J. Mach. Learn. Res., 16, 2023-2049 (2015) · Zbl 1351.68240
[11] Lopez-Paz, D., Bottou, L., Scholkopf, B., and Vapnik, V., Unifying distillation and privileged information, Int. Conf. Learn. Representations (Puerto Rico, 2016).
[12] Burges, C., Cortes, C., and LeCun, Y., The MNIST Dataset of Handwritten Digits, 1998. http://yann.lecun.com/exdb/mnist/index.html.
[13] Huang, Z. and Naiyan, W., Like What You Like: Knowledge Distill via Neuron Selectivity Transfer, 2019. .
[14] Hinton, G., Krizhevsky, A., and Nair, V., CIFAR-10 (Canadian Institute for Advanced Research). http://www.cs.toronto.edu/ kriz/cifar.html.
[15] Deng, J. et al., Imagenet: a large-scale hierarchical image database, Proc. IEEE Conf. Comput. Vision Pattern Recognit. (Miami, 2009), pp. 248-255.
[16] LeCun, Y., Denker, J., and Solla, S., Optimal brain damage, Advances in Neural Information Processing Systems, 1989, vol. 2, pp. 598-605.
[17] Graves, A., Practical variational inference for neural networks, Advances in Neural Information Processing Systems, 2011, vol. 24, pp. 2348-2356.
[18] Grabovoy, A. V.; Bakhteev, O. Y.; Strijov, V. V., Estimation of relevance for neural network parameters, Inf. Appl., 13, 2, 62-70 (2019)
[19] Rasul, K., Vollgraf, R., and Xiao, H., Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms, arXiv Preprint, 2017. .
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.