zbMATH — the first resource for mathematics

Recognizing named entities in specific domain. (English) Zbl 07266206
Summary: The paper presents the results of applying the BERT representation model in the named entity recognition task (NER) for the cybersecurity domain in Russian. We compare several approaches to domain-specific NER combining BERT fine-tuning on a domain-specific text collection, general labeled data, domain-specific data augmentation, and a domain-specific annotated dataset. We showed that using a BERT model fine-tuned on a domain text collection and pre-trained on the combination of a general dataset and augmented data achieves the best results of named entity recognition. We also studied computational performance of the BERT model in so-called mixed precision regime.
68 Computer science
62 Statistics
Full Text: DOI
[1] Afanasyev, I.; Voevodin, V.; Rudyak, V.; Emelyanenko, A., The practice of conducting performance analysis of supercomputer applications, Numer. Methods Program., 20, 346-355 (2019)
[2] D. Bahdanau, K. Cho, and Y. Bengio,‘‘Neural machine translation by jointly learning to align and translate,’’ arXiv:1409.0473 (2014).
[3] V. Bocharov, A. Starostin, S. Alexeeva, A. Bodrova, A. Chunchunkov, S. Dzhumaev, I. Efimenko, D. Granovsky, V. Khoroshevsky, I. Krylova, M. Nikolaeva, I. Smurov, and S. Toldova, ‘‘FactRuEval 2016: Evaluation of named entity recognition and fact extraction systems for Russian,’’ in Proceedings of International Conference on Computational Linguistics Dialog-2016 (2016), No. 22, pp. 702-720.
[4] R. Bridges, C. Jones, M. Iannacone, K. Testa, and J. Goodall, ‘‘Automatic labeling for entity extraction in cyber security,’’ arXiv:1308.4941 (2013)
[5] L. Chen, A. Moschitti, G. Castellucci, A. Favalli, and R. Romagnoli, ‘‘Transfer learning for industrial applications of named entity recognition,’’ in Proceedings of the 2nd Workshop on Natural Language for Artificial Intelligence NL4AI 2018 (2018), pp. 129-140.
[6] DeepPavlov Documentation. http://docs.deeppavlov.ai/en/master/. Accessed Dec. 25, 2019.
[7] J. Devlin, M. Chang, K. Lee, and K. Toutanova, ‘‘Bert: Pre-training of deep bidirectional transformers for language understanding,’’ arXiv:1810.04805 (2018).
[8] Fellbaum, Ch., WordNet: An Electronic Lexical Database (1998), Boston, MA: MIT, Boston, MA · Zbl 0913.68054
[9] H. Gasmi, A. Bouras, and J. Laval, ‘‘LSTM recurrent neural networks for cybersecurity named entity recognition,’’ in Proceedings of the International Conference on Software Engineering Advances ICSEA, 2018, Vol. 11.
[10] J. Howard and S. Ruder, ‘‘Universal language model fine-tuning for text classification,’’ arXiv:1801.06146 (2018).
[11] A. Joshi, R. Lal, T. Finin, and A. Joshi, ‘‘Extracting cybersecurity related linked data from text,’’ in Proceedings of the 2013 IEEE 7th International Conference on Semantic Computing (2013), pp. 252-259.
[12] S. Kobayashi, ‘‘Contextual augmentation: Data augmentation by words with paradigmatic relations,’’ in Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics NAACL-HLT, 2018, pp. 452-457.
[13] Y. Kuratov and M. Arkhipov, ‘‘Adaptation of deep bidirectional multilingual transformers for russian language,’’ arXiv:1905.07213 (2019).
[14] J. Lafferty, A. McCallum, and F. Pereira, ‘‘Conditional random fields: Probabilistic: models for segmenting and labeling sequence data,’’ in Proceedings of the International Conference on Machine Learning ICML-2001 (2001).
[15] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, ‘‘Neural architectures for named entity recognition,’’ arXiv:1603.01360 (2016).
[16] T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘‘Efficient estimation of word representations in vector space,’’ arXiv:1301.3781 (2013).
[17] V. Mozharova and N. Loukachevitch, ‘‘Combining knowledge and CRF-based approach to named entity recognition in Russian,’’ in Proceedings of the International Conference on Analysis of Images, Social Networks and Texts (Springer, Cham, 2016), pp. 185-195.
[18] V. Mozharova and N. Loukachevitch, ‘‘Recognizing names in islam-related russian twitter,’’ in Proceedings of the Conference on Data Analytics and Management in Data Intensive Domains DAMDID-2017 (2017), pp. 319-324.
[19] J. Piskorski, L. Laskova, M. Marcinczuk, L. Pivovarova, P. Priban, J. Steinberger, and R. Yangarberger, ‘‘The second cross-lingual challenge on recognition, normalization, classification, and linking of named entities across slavic languages,’’ in Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing BSNLP-2019 (2019), pp. 63-74.
[20] E. Sang and F. Meulde, ‘‘Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,’’ in Proceedings of the 7th conference on Natural Language Learning at HLT-NAACL 2003 (2003), Vol. 4, pp. 142-147.
[21] Sirotina, A.; Loukachevitch, N., Named entity recognition in information security domain for Russian, Proceedings of the Recent Advances in Natural Language Processing, RANLP-2019, 1115-1122 (2019)
[22] K. Shinzato, S. Sekine, N. Yoshinaga, and K. Torisawa, ‘‘Constructing dictionaries for named entity recognition on specific domains from the Web,’’ in Proceedings of the Web Content Mining with Human Language Technologies Workshop on the 5th International Semantic Web (2006).
[23] B. Strauss, B. Toma, A. Ritter, M. de Marneffe, and W. Xu, ‘‘Results of the wnut16 named entity recognition shared task,’’ in Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) (2016), pp. 138-144.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proceedings of the International Conference on Advances in Neural Information Processing Systems (2017), 5998-6008.
[25] J. Wei and K. Zou, ‘‘Eda: Easy data augmentation techniques for boosting performance on text classification tasks,’’ in Proceedings of the Conference on Empirical Methods in Natural Language Processing EMNLP-2019 (2019), pp. 6381-6387.
[26] Y. Wu, M. Schuster, Z. Chen, Q. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., ‘‘Google’s neural machine translation system: Bridging the gap between human and machine translation,’’ arXiv:1609.08144 (2016).
[27] W. Yang Wang and D. Yang, ‘‘That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets,’’ in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015), pp. 2557-2563.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.