×

Simple strategies for semi-supervised feature selection. (English) Zbl 1457.68239

Summary: What is the simplest thing you can do to solve a problem? In the context of semi-supervised feature selection, we tackle exactly this – how much we can gain from two simple classifier-independent strategies. If we have some binary labelled data and some unlabelled, we could assume the unlabelled data are all positives, or assume them all negatives. These minimalist, seemingly naive, approaches have not previously been studied in depth. However, with theoretical and empirical studies, we show they provide powerful results for feature selection, via hypothesis testing and feature ranking. Combining them with some “soft” prior knowledge of the domain, we derive two novel algorithms (Semi-JMI, Semi-IAMB) that outperform significantly more complex competing methods, showing particularly good performance when the labels are missing-not-at-random. We conclude that simple approaches to this problem can work surprisingly well, and in many situations we can provably recover the exact feature selection dynamics, as if we had labelled the entire dataset.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)

Software:

TETRAD
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Agresti, A. (2013). Categorical data analysis. Wiley series in probability and statistics (3rd ed.). New York: Wiley-Interscience. · Zbl 1281.62022
[2] Allison, P. D. (2001) Missing data. Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-136.
[3] Ang, JC; Mirzal, A; Haron, H; Hamed, HNA, Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13, 971-989, (2016) · doi:10.1109/TCBB.2015.2478454
[4] Balcan, M-F; Blum, A, A discriminative model for semi-supervised learning, Journal of the ACM (JACM), 57, 19, (2010) · Zbl 1327.68185 · doi:10.1145/1706591.1706599
[5] Benabdeslem, K., & Hindawi, M. (2011) Constrained laplacian score for semi-supervised feature selection. In Machine learning and knowledge discovery in databases (ECML/PKDD) (pp. 204-218). Springer, Berlin, Heidelberg.
[6] Berger, JO, Could Fisher, Jeffreys and Neyman have agreed on testing?, Statistical Science, 18, 1-32, (2003) · Zbl 1048.62006 · doi:10.1214/ss/1056397485
[7] Blanchard, G; Lee, G; Scott, C, Semi-supervised novelty detection, The Journal of Machine Learning Research (JMLR), 11, 2973-3009, (2010) · Zbl 1242.68205
[8] Brown, G; Pocock, A; Zhao, M-J; Lujan, M, Conditional likelihood maximisation: A unifying framework for information theoretic feature selection, Journal of Machine Learning Research (JMLR), 13, 27-66, (2012) · Zbl 1283.68283
[9] Cai, R; Zhang, Z; Hao, Z, BASSUM: A Bayesian semi-supervised method for classification feature selection, Pattern Recognition, 44, 811-820, (2011) · Zbl 1213.68517 · doi:10.1016/j.patcog.2010.10.023
[10] Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Routledge Academic. · Zbl 0747.62110
[11] Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed.). New York: Wiley. · Zbl 1140.94001
[12] Cressie, N; Read, TRC, Pearson’s \({X}^2\) and the loglikelihood ratio statistic \({G}^2\): A comparative review, International Statistical Review/Revue Internationale de Statistique, 57, 19-43, (1989) · Zbl 0707.62105
[13] Dawid, P. A. (1979) Conditional independence in statistical theory. Journal of the Royal Statistical Society, Series B (Methodological), 41(1), 1-31. · Zbl 0408.62004
[14] Demšar, J, Statistical comparisons of classifiers over multiple data sets, The Journal of Machine Learning Research (JMLR), 7, 1-30, (2006) · Zbl 1222.68184
[15] Didelez, V; Kreiner, S; Keiding, N, Graphical models for inference under outcome-dependent sampling, Statistical Science, 25, 368-387, (2010) · Zbl 1329.62042 · doi:10.1214/10-STS340
[16] du Plessis, M. C., & Sugiyama, M. (2012). Semi-supervised learning of class balance under class-prior change by distribution matching. In Proceedings of the 29th international conference on machine learning (ICML). · Zbl 1298.68268
[17] Elkan, C., & Noto, K. (2008) Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp. 213-220.
[18] Gretton, A; Györfi, L, Consistent nonparametric tests of independence, The Journal of Machine Learning Research (JMLR), 99, 1391-1423, (2010) · Zbl 1242.62033
[19] Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2006). Feature extraction: Foundations and applications. Secaucus, NJ: Springer-Verlag New York. · Zbl 1114.68059 · doi:10.1007/978-3-540-35488-8
[20] He, D., Rish, I., Haws, D., & Parida, L. (2016) MINT: Mutual information based transductive feature selection for genetic trait prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13(3), 578-583.
[21] Hein, M; Quiñonero Candela, J (ed.); Sugiyama, M (ed.); Schwaighofer, A (ed.); Lawrence, ND (ed.), Binary classification under sample selection bias, 41-64, (2009), Cambridge
[22] Kalousis, A; Prados, J; Hilario, M, Stability of feature selection algorithms: A study on high-dimensional spaces, Knowledge and Information Systems, 12, 95-116, (2007) · doi:10.1007/s10115-006-0040-8
[23] Koller, D., & Sahami, M. (1996) Toward optimal feature selection. In International conference of machine learning (ICML), pp. 284-292.
[24] Krijthe, J. H., & Loog, M. (2015) Implicitly constrained semi-supervised least squares classification. In International symposium on intelligent data analysis. Springer, pp. 158-169.
[25] Kuncheva, L I. (2007) A stability index for feature selection. In Artificial intelligence and applications, pp. 421-427.
[26] Lafferty, J., & Wasserman, L. (2007) Statistical analysis of semi-supervised regression. In Advances in neural information processing systems (NIPS), Vol. 21.
[27] Li, Y-F; Zhou, Z-H, Towards making unlabeled data never hurt, IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 175-188, (2015) · doi:10.1109/TPAMI.2014.2299812
[28] Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data. Wiley series in probability and mathematical statistics (2nd ed.). New York: Wiley. · Zbl 1011.62004
[29] Liu, Y; Nie, F; Wu, J; Chen, L, Efficient semi-supervised feature selection with noise insensitive trace ratio criterion, Neurocomputing, 105, 12-18, (2013) · doi:10.1016/j.neucom.2012.05.031
[30] Loftus, GR, On the tyranny of hypothesis testing in the social sciences, Contemporary Psychology, 36, 102-105, (1991)
[31] Loog, M, Contrastive pessimistic likelihood estimation for semi-supervised classification, IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 462-475, (2016) · doi:10.1109/TPAMI.2015.2452921
[32] Mohan, K; Pearl, J; Tian, J, Graphical models for inference with missing data, Advances in Neural Information Processing Systems (NIPS), 26, 1277-1285, (2013)
[33] Moreno-Torres, JG; Raeder, T; Alaiz-Rodríguez, R; Chawla, NV; Herrera, F, A unifying view on dataset shift in classification, Pattern Recognition, 45, 521-530, (2012) · doi:10.1016/j.patcog.2011.06.019
[34] Nogueira, S., & Brown, G. (2016). Measuring the stability of feature selection. In Machine learning and knowledge discovery in databases (ECML/PKDD). Springer International Publishing, pp. 442-457.
[35] Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco, CA: Morgan Kaufmann. · Zbl 0746.68089
[36] Peng, H; Long, F; Ding, C, Feature selection based on mutual information criteria of MAX-dependency, MAX-relevance, and MIN-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1226-1238, (2005) · doi:10.1109/TPAMI.2005.159
[37] Ren, J., Qiu, Z., Fan, W., Cheng, H., & Philip, S. Y. (2008) Forward semi-supervised feature selection. In Pacific-Asia conference on knowledge discovery and data mining. Springer, pp. 970-976.
[38] Rosenthal, R., Rosnow, R. L., & Rubin, D. B. (2000). Contrasts and effect sizes in behavioral research: A correlational approach. McGraw-Hill series in psychology. Cambridge: Cambridge University Press.
[39] Sechidis, K. (2015). Hypothesis testing and feature selection in semi-supervised data. PhD thesis, School of Computer Science, University Of Manchester, UK.
[40] Sechidis, K., & Brown, G. (2015). Markov Blanket discovery in positive-unlabelled and semi-supervised data. In Machine learning and knowledge discovery in databases (ECML/PKDD). Springer, Berlin, Heidelberg, pp. 351-366.
[41] Sechidis, K., Calvo, B., & Brown, G. (2014). Statistical hypothesis testing in positive unlabelled data. In Machine learning and knowledge discovery in databases (ECML/PKDD). Springer, Berlin, Heidelberg, pp. 66-81.
[42] Sechidis, K; Sperrin, M; Petherick, ES; Luján, M; Brown, G, Dealing with under-reported variables: an information theoretic solution, International Journal of Approximate Reasoning, 85, 159-177, (2017) · Zbl 1429.62052 · doi:10.1016/j.ijar.2017.04.002
[43] Seeger, M. (2002). Learning with labeled and unlabeled data. Technical report, Technical report, University of Edinburgh.
[44] Sheikhpour, R; Sarram, MA; Gharaghani, S; Chahooki, MA, A survey on semi-supervised feature selection methods, Pattern Recognition, 64, 141-158, (2017) · Zbl 1429.68239 · doi:10.1016/j.patcog.2016.11.003
[45] Shelby, J. (1974). The analysis of frequency data. Midway reprints. Chicago: University of Chicago Press. · Zbl 0325.62017
[46] Singh, A; Nowak, R; Zhu, X, Unlabeled data: now it helps, now it doesn’t, Advances in Neural Information Processing Systems (NIPS), 22, 1513-1520, (2009)
[47] Smith, A. T., & Elkan, C. (2007). Making generative classifiers robust to selection bias. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp. 657-666.
[48] Sokolovska, N., Cappé, O., & Yvon, F. (2008). The asymptotics of semi-supervised learning in discriminative probabilistic models. In Proceedings of the 25th international conference on machine learning (ICML). ACM, pp. 984-991.
[49] Spirtes, P., Glymour, C., & Scheines, R. (2001). Causation, prediction, and search (2nd ed.). Cambridge: The MIT Press. · Zbl 0806.62001
[50] Sugiyama, M, Machine learning with squared-loss mutual information, Entropy, 15, 80-112, (2012) · Zbl 1371.68241 · doi:10.3390/e15010080
[51] Tsamardinos, I., & Aliferis, C. F. (2003) Towards principled feature selection: Relevancy, filters and wrappers. In AISTATS.
[52] Van den Broeck, G., Mohan, K., Choi, A., & Pearl, J. (2015) Efficient algorithms for Bayesian network parameter learning from incomplete data. In Conference on uncertainty in artificial intelligence (UAI).
[53] Xu, J., Tang, B., He, H., & Man H. (2016). Semisupervised feature selection based on relevance and redundancy criteria. IEEE Transactions on Neural Networks and Learning Systems, PP(99), 1-11.
[54] Yang, HH; Moody, J; Solla, SA (ed.); Leen, TK (ed.); Müller, K (ed.), Data visualization and feature selection: new algorithms for Nongaussian data, 687-693, (1999), Cambridge
[55] Zhao, Z., & Liu, H. (2007). Semi-supervised feature selection via spectral analysis. In Proceedings of the 2007 SIAM international conference on data mining. SIAM, pp. 641-646.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.