×

What’s in a name? – Gender classification of names with character based machine learning models. (English) Zbl 1473.68194

Summary: Gender information is no longer a mandatory input when registering for an account at many leading Internet companies. However, prediction of demographic information such as gender and age remains an important task, especially in intervention of unintentional gender/age bias in recommender systems. Therefore it is necessary to infer the gender of those users who did not to provide this information during registration. We consider the problem of predicting the gender of registered users based on their declared name. By analyzing the first names of 100M+ users, we found that genders can be very effectively classified using the composition of the name strings. We propose a number of character based machine learning models, and demonstrate that our models are able to infer the gender of users with much higher accuracy than baseline models. Moreover, we show that using the last names in addition to the first names improves classification performance further.

MSC:

68T50 Natural language processing
68T05 Learning and adaptive systems in artificial intelligence

Software:

BERT; Adam; GNMT; word2vec
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] 3000 most common words in english. https://www.ef.edu/english-resources/english-vocabulary/top-3000-words/ (2020). [Online; accessed March 22, 2020]
[2] SP 500 Companies (2020). https://datahub.io/core/s-and-p-500-companies. [Online; accessed March 22, 2020]
[3] Social Security Administration: National data on the relative frequency of given names in the population of U.S. births where the individual has a social security number (tabulated based on social security records as of march 3, 2019). http://www.ssa.gov/oact/babynames/names.zip
[4] Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: Sixth International AAAI Conference on Weblogs and Social Media
[5] Ambekar A, Ward C, Mohammed J, Male S, Skiena S (2009) Name-ethnicity classification from open sources. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 49-58. ACM
[6] Beretta V, Maccagnola D, Cribbin T, Messina E (2015) An interactive method for inferring demographic attributes in twitter. In: Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp. 113-122. ACM
[7] Brown E (2017) Gender inference from character sequences in multinational first names. https://towardsdatascience.com/name2gender-introduction-626d89378fb0#408a
[8] Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing, pp. 1301-1309. Association for Computational Linguistics
[9] Chen P, Sun Z, Bing L, Yang W (2017) Recurrent attention network on memory for aspect sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 452-461
[10] Ciot M, Sonderegger M, Ruths D (2013) Gender inference of twitter users in non-english contexts. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1136-1145
[11] Google Cloud Content Categories (2019). https://cloud.google.com/natural-language/docs/categories
[12] Culotta A, Kumar NR, Cutler J (2015) Predicting the demographics of twitter users from website traffic data. In: AAAI, pp. 72-78
[13] Culotta, A.; Ravi, NK; Cutler, J., Predicting twitter user demographics using distant supervision from website traffic data, J Artif Intell Res, 55, 389-408 (2016) · doi:10.1613/jair.4935
[14] Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
[15] Grbovic M, Radosavljevic V, Djuric N, Bhamidipati N, Nagarajan A (2015) Gender and interest targeting for sponsored post advertising at tumblr. In: proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 1819-1828. ACM, New York, NY, USA. doi:10.1145/2783258.2788616
[16] Han S, Hu Y, Skiena S, Coskun B, Liu M, Qin H, Perez J (2017) Generating look-alike names for security challenges. In: proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, pp. 57-67. ACM, New York, NY, USA. doi:10.1145/3128572.3140441
[17] Hochreiter S, Schmidhuber J (1997) Long short-term memory. In: neural Computation, pp. 1735-1780
[18] Karako C, Manggala P (2018) Using image fairness representations in diversity-based re-ranking for recommendations. In: adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization, pp. 23-28. ACM
[19] Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
[20] Knowles R, Carroll J, Dredze M (2016) Demographer: Extremely simple name demographics. In: proceedings of the First Workshop on NLP and Computational Social Science, pp. 108-113
[21] Kokkos A, Tzouramanis T (2014) A robust gender inference model for online social networks and its application to linkedin and twitter. First Monday 19(9)
[22] Liu W, Al Zamal F, Ruths D (2012) Using social media to infer gender composition of commuter populations. In: sixth international AAAI Conference on Weblogs and Social Media
[23] Liu W, Ruths D (2013) What’s in a name? using first names as features for gender inference in twitter. In: analyzing microtext AAAI 2013 Spring Symposium, pp. 10-16. AAAI, Palo Alto, CA, USA
[24] Lu F (2018) The 11 Most Beautiful Chinese Names and What They Mean. https://bit.ly/2yGSNO7
[25] Ludu PS (2014) Inferring gender of a twitter user using celebrities it follows. arXiv preprint arXiv:1405.6667
[26] Merler M, Cao L, Smith JR (2015) You are what you tweet...pic! gender prediction based on semantic analysis of social media images. In: 2015 IEEE International Conference on Multimedia and Expo (ICME), pp. 1-6. IEEE
[27] Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: proceedings of Workshop at ICLR
[28] Mueller J, Stumme G (2016) Gender inference using statistical name characteristics in twitter. In: proceedings of the The 3rd Multidisciplinary International Social Networks Conference on SocialInformatics, Data Science 2016, p. 47. ACM
[29] Otterbacher J (2010) Inferring gender of movie reviewers: exploiting writing style, content and metadata. In: proceedings of the 19th ACM international conference on Information and knowledge management, pp. 369-378. ACM
[30] Pennacchiotti M, Popescu AM (2011) A machine learning approach to twitter user classification. In: Fifth International AAAI Conference on Weblogs and Social Media
[31] Rao D, Yarowsky D (2010) Detecting latent user properties in social media. In: Proc. of the NIPS MLSN Workshop, pp. 1-7. Citeseer
[32] Sakaki S, Miura Y, Ma X, Hattori K, Ohkuma T (2014) Twitter user gender inference using combined analysis of text and image processing. In: proceedings of the Third Workshop on Vision and Language, pp. 54-61
[33] Wang S, Manning CD (2012) Baselines and bigrams: Simple, good sentiment and topic classification. In: proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL ’12, pp. 90-94. Association for Computational Linguistics, Stroudsburg, PA, USA
[34] Wang Y, Huang M, Zhao L, et al. (2016) Attention-based lstm for aspect-level sentiment classification. In: proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606-615
[35] Wikipedia: Andrea. https://en.wikipedia.org/wiki/Andrea [Online; accessed March 22, 2020]
[36] Wikipedia: Toni. https://en.wikipedia.org/wiki/Toni [Online; accessed March 22, 2020]
[37] Wikipedia: Unisex name. https://en.wikipedia.org/wiki/Unisex_name [Online; accessed March 22, 2020]
[38] Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Łukasz Kaiser, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR arXiv:1609.08144
[39] Yao S, Huang B (2017) Beyond parity: Fairness objectives for collaborative filtering. In: advances in neural information processing systems, pp. 2921-2930
[40] Ye J, Han S, Hu Y, Coskun B, Liu M, Qin H, Skiena S (2017) Nationality classification using name embeddings. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, pp. 1897-1906. ACM, New York, NY, USA. doi:10.1145/3132847.3133008
[41] Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 649-657. MIT Press, Cambridge, MA, USA
[42] Zhou X, Wan X, Xiao J (2016) Attention-based lstm network for cross-lingual sentiment classification. In: proceedings of the 2016 conference on empirical methods in natural language processing, pp. 247-256
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.