Confident learning: estimating uncertainty in dataset labels. (English) Zbl 07350692

Summary: Learning exists in the context of data, yet notions of confidence typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. Whereas numerous studies have developed these principles independently, here, we combine them, building on the assumption of a class-conditional noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This results in a generalized CL which is provably consistent and experimentally performant. We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeding seven recent competitive approaches for learning with noisy labels on the CIFAR dataset. Uniquely, the CL framework is not coupled to a specific data modality or model (e.g., we use CL to find several label errors in the presumed error-free MNIST dataset and improve sentiment classification on text data in Amazon Reviews). We also employ CL on ImageNet to quantify ontological class overlap (e.g., estimating 645 missile images are mislabeled as their parent class projectile), and moderately increase model accuracy (e.g., for ResNet) by cleaning data prior to training. These results are replicable using the open-source cleanlab release.


68Txx Artificial intelligence
Full Text: DOI arXiv


[1] Amjad, M., Shah, D., and Shen, D. (2018). Robust synthetic control.Journal of Machine Learning Research (JMLR), 19(1):802-852. · Zbl 1445.62113
[2] Angluin, D. and Laird, P. (1988). Learning from noisy examples.Machine Learning, 2(4):343-370.
[3] Arazo, E., Ortego, D., Albert, P., O’Connor, N. E., and McGuinness, K. (2019). Unsupervised label noise modeling and loss correction. InInternational Conference on Machine Learning
[4] Beigman, E. and Klebanov, B. B. (2009). Learning with annotation noise. InAnnual Conference of the Association for Computational Linguistics (ACL).
[5] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information.Transactions of the Association for Computational Linguistics, 5:135-146.
[6] Bouguelia, M.-R., Nowaczyk, S., Santosh, K., and Verikas, A. (2018). Agreeing to disagree: active learning with noisy labels without crowdsourcing.International Journal of Machine
[7] Brodley, C. E. and Friedl, M. A. (1999). Identifying mislabeled training data.Journal of Artificial Intelligence Research (JAIR), 11:131-167. · Zbl 0924.68158
[8] Chen, P., Liao, B. B., Chen, G., and Zhang, S. (2019). Understanding and utilizing deep neural networks trained with noisy labels. InInternational Conference on Machine Learning
[9] Chowdhary, K. and Dupuis, P. (2013). Distinguishing and integrating aleatoric and epistemic variation in uncertainty quantification.Mathematical Modelling and Numerical Analysis · Zbl 1266.65009
[10] Dawid, A. P. and Skene, A. M. (1979). Maximum likelihood estimation of observer errorrates using the em algorithm.Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20-28.
[11] Elkan, C. (2001). The foundations of cost-sensitive learning. InInternational Joint Conference on Artificial Intelligence (IJCAI). · Zbl 0979.03023
[12] Elkan, C. and Noto, K. (2008). Learning classifiers from only positive and unlabeled data. InSIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
[13] Feurer, M., van Rijn, J. N., Kadra, A., Gijsbers, P., Mallik, N., Ravi, S., Müller, A., Vanschoren, J., and Hutter, F. (2019). Openml-python: an extensible python api for openml.arXiv preprint arXiv:1911.02490.
[14] Forman, G. (2005). Counting positives accurately despite inaccurate classification. In European Conference on Computer Vision (ECCV).
[15] Forman, G. (2008). Quantifying counts and costs via classification.Data Mining and Knowledge Discovery, 17(2):164-206.
[16] Goldberger, J. and Ben-Reuven, E. (2017). Training deep neural-networks using a noise adaptation layer. InInternational Conference on Learning Representations (ICLR).
[17] Graepel, T. and Herbrich, R. (2001). The kernel gibbs sampler. InConference on Neural Information Processing Systems (NeurIPS).
[18] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. InInternational Conference on Machine Learning (ICML).
[19] Halpern, Y., Horng, S., Choi, Y., and Sontag, D. (2016). Electronic medical record phenotyping using the anchor and learn framework.Journal of the American Medical Informatics
[20] Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. (2018). Co-teaching: Robust training of deep neural networks with extremely noisy labels. In
[21] Han, J., Luo, P., and Wang, X. (2019). Deep self-learning from noisy labels. InInternational Conference on Computer Vision (ICCV).
[22] He, R. and McAuley, J. (2016). Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. InInternational conference on world wide web
[23] Hendrycks, D. and Gimpel, K. (2017). A baseline for detecting misclassified and out-ofdistribution examples in neural networks.International Conference on Learning Represen
[24] Hendrycks, D., Mazeika, M., Wilson, D., and Gimpel, K. (2018). Using trusted data to train deep networks on labels corrupted by severe noise. InConference on Neural Information
[25] Hoffman, J., Pathak, D., Darrell, T., and Saenko, K. (2015). Detector discovery in the wild: Joint multiple instance and representation learning. InConference on Computer Vision
[26] Huang, J., Qu, L., Jia, R., and Zhao, B. (2019). O2u-net: A simple noisy label detection approach for deep neural networks. InInternational Conference on Computer Vision
[27] Jiang, L., Huang, D., Liu, M., and Yang, W. (2020). Beyond synthetic noise: Deep learning on controlled noisy labels. InInternational Conference on Machine Learning (ICML).
[28] Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. (2018). Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. InInternational
[29] Jindal, I., Nokleby, M., and Chen, X. (2016). Learning deep networks from noisy labels with dropout regularization. InInternational Conference on Data Mining (ICDM).
[30] Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2017). Bag of tricks for efficient text classification. InAnnual Conference of the Association for Computational Linguistics
[31] Katz-Samuels, J., Blanchard, G., and Scott, C. (2019). Decontamination of mutual contamination models.Journal of Machine Learning Research (JMLR), 20(41):1-57. · Zbl 1484.62077
[32] Khetan, A., Lipton, Z. C., and Anandkumar, A. (2018). Learning from noisy singly-labeled data. InInternational Conference on Learning Representations (ICLR).
[33] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto.
[34] Lawrence, N. D. and Schölkopf, B. (2001). Estimating a kernel fisher discriminant in the presence of label noise. InInternational Conference on Machine Learning (ICML).
[35] Li, J., Socher, R., and Hoi, S. C. (2020). Dividemix: Learning with noisy labels as semisupervised learning. InInternational Conference on Learning Representations (ICLR).
[36] Li, W., Wang, L., Li, W., Agustsson, E., and Van Gool, L. (2017a). Webvision database: Visual learning and understanding from web data.arXiv:1708.02862.
[37] Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., and Li, L.-J. (2017b). Learning from noisy labels with distillation. InInternational Conference on Computer Vision (ICCV).
[38] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. InEuropean Conference
[39] Lipton, Z., Wang, Y.-X., and Smola, A. (2018). Detecting and correcting for label shift with black box predictors. InInternational Conference on Machine Learning (ICML).
[40] Liu, T. and Tao, D. (2015). Classification with noisy labels by importance reweighting.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(3):447-461.
[41] Natarajan, N., Dhillon, I. S., Ravikumar, P., and Tewari, A. (2017). Cost-sensitive learning with noisy labels.Journal of Machine Learning Research (JMLR), 18:155-1. · Zbl 1467.68151
[42] Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. (2013). Learning with noisy labels. InConference on Neural Information Processing Systems (NeurIPS). · Zbl 1467.68151
[43] Northcutt, C., Zha, S., Lovegrove, S., and Newcombe, R. (2020). Egocom: A multi-person multi-modal egocentric communications dataset.IEEE Transactions on Pattern Analysis
[44] Northcutt, C. G., Athalye, A., and Mueller, J. (2021).Pervasive label errors in test sets destabilize machine learning benchmarks. InInternational Conference on Learning
[45] Northcutt, C. G., Ho, A. D., and Chuang, I. L. (2016). Detecting and preventing “multipleaccount” cheating in massive open online courses.Computers & Education, 100:71-80.
[46] Northcutt, C. G., Wu, T., and Chuang, I. L. (2017). Learning with confident examples: Rank pruning for robust classification with noisy labels. InConference on Uncertainty in
[47] Page, L., Brin, S., Motwani, R., and Winograd, T. (1997). Pagerank: Bringing order to the web. Technical report, Stanford Digital Libraries Working Paper.
[48] Patrini, G., Nielsen, F., Nock, R., and Carioni, M. (2016). Loss factorization, weakly supervised learning and label noise robustness. InInternational Conference on Machine
[49] Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. (2017). Making deep neural networks robust to label noise: A loss correction approach. InConference on Computer
[50] Ratner, A. J., De Sa, C. M., Wu, S., Selsam, D., and Ré, C. (2016). Data programming: Creating large training sets, quickly. InConference on Neural Information Processing Systems (NeurIPS).
[51] Reed, S. E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. (2015). Training deep neural networks on noisy labels with bootstrapping. InInternational
[52] Richard, M. D. and Lippmann, R. P. (1991). Neural network classifiers estimate bayesian a posteriori probabilities.Neural computation, 3(4):461-483.
[53] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211-252.
[54] Sáez, J. A., Galar, M., Luengo, J., and Herrera, F. (2014). Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition.
[55] Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., and Aroyo, L. M. (2021). “Everyone wants to do the model work, not the data work”: Data cascades in high-stakes
[56] Scott, C. (2015). A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. InInternational Conference on Artificial Intelligence and Statistics (AISTATS).
[57] Shen, Y. and Sanghavi, S. (2019). Learning with bad training data via iterative trimmed loss minimization. InInternational Conference on Machine Learning (ICML), volume 97 ofProceedings of Machine Learning Research.
[58] Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., and Meng, D. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. InConference on Neural Information
[59] Sugiyama, M., Suzuki, T., and Kanamori, T. (2012).Density Ratio Estimation in ML. Cambridge University Press, New York, NY, USA, 1st edition. · Zbl 1274.62037
[60] Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., and Fergus, R. (2015). Training convolutional networks with noisy labels. InInternational Conference on Learning Representations
[61] Tanno, R., Saeedi, A., Sankaranarayanan, S., Alexander, D. C., and Silberman, N. (2019a). Learning from noisy labels by regularized estimation of annotator confusion. InConference
[62] Tanno, R., Saeedi, A., Sankaranarayanan, S., Alexander, D. C., and Silberman, N. (2019b). Learning from noisy labels by regularized estimation of annotator confusion. InConference
[63] Vahdat, A. (2017). Toward robustness against label noise in training deep discriminative neural networks. InConference on Neural Information Processing Systems (NeurIPS).
[64] Van Rooyen, B., Menon, A., and Williamson, R. C. (2015). Learning with symmetric label noise: The importance of being unhinged. InConference on Neural Information Processing Systems (NeurIPS).
[65] Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., and Bailey, J. (2019). Symmetric cross entropy for robust learning with noisy labels. InInternational Conference on Computer Vision
[66] Wei, C., Lee, J. D., Liu, Q., and Ma, T. (2018). On the margin theory of feedforward neural networks.Computing Research Repository (CoRR).
[67] Xu, Y., Cao, P., Kong, Y., and Wang, Y. (2019). L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. InConference on Neural Information
[68] Yun, S., Oh, S. J., Heo, B., Han, D., Choe, J., and Chun, S. (2021). Re-labeling imagenet: from single to multi-labels, from global to localized labels. InConference on Computer
[69] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017a). Understanding deep learning requires rethinking generalization. InInternational Conference on Learning
[70] Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. (2018). mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations (ICLR).
[71] Zhang, J., Sheng, V. S., Li, T., and Wu, X. (2017b). Improving crowdsourced label quality using noise correction.IEEE Transactions on Neural Networks and Learning Systems, 29(5):1675-1688.
[72] In this section, we restate the main theorems for confident learning and provide their proofs.
[73] Lemma 1(Ideal Thresholds).For a noisy datasetX:= (x,y˜)n∈(Rd,[m])nand modelθ,
[74] Proof.We usetito denote the thresholds used to partitionXintombins, each estimating
[75] one ofXy∗. By definition, ∀i∈[m], ti=Ex∈Xy˜=ipˆ(˜y=i;x,θ)
[76] For anyti, we show the following. ti=Epˆ(˜y=i|y∗=j;x,θ)ˆp(y∗=j;x,θ).Bayes Rule
[77] i=jrepresents the probabilities of correct labeling, whereas wheni6=j, the terms give the
[78] probabilities of mislabelingp(˜y=i|y∗=j), weighted by the probabilityp(y∗=j|y˜=i)that
[79] the mislabeling is corrected.
[80] θ:x→pˆ(˜y), ifpˆ(˜y;x,θ)isidealand each diagonal entry ofQy˜|y∗maximizes its row and
[81] Proof.Alg.1defines the construction of the confident joint. We consider Case 1: when
[82] there are collisions (trivial by the construction of Alg.1) and case 2: when there are no
[83] collisions (harder). Case 1 (collisions):
[84] xkgets assigned bijectively into bin xk∈Xˆy,y˜∗[˜yk][arg maxpˆ(˜y=i;x,θ)]
[85] Because we have thatpˆ(˜y;x,θ)is ideal, we can rewrite this as xk∈Xˆy,y˜∗[˜yk][arg maxpˆ(˜y=i|y∗=yk∗;x)]
[86] noisy labelsy˜are given, thus the confident joint (Eqn.1) will never place them in the wrong
[87] bin ofXˆy˜=i,y∗=j. Thus, claim 1 and claim 2 suffice to show thatXˆy˜=i,y∗=j=Xy˜=i,y∗=j.
[88] Proof (Claim 1) of Case 2:Inspecting Eqn. (1) and Alg (1), by the construction of
[89] Cy,y˜∗, we have that∀x∈Xy˜=i,pˆ(˜y=j|y∗=j;x,θ)≥tj−→Xy˜=i,y∗=j⊆Xˆy˜=i,y∗=j. When
[90] the left-hand side is true, all examples with noisy labeliand hidden, true labeljare counted
[91] inXˆy˜=i,y∗=j. Thus, it suffices to prove:
[92] j|y∗=j),∀x∈Xy˜=i. Note the change from predicted probability,pˆ, to an exact probability,
[93] p. Thus by the ideal condition, the inequality in (5) can be written asp(˜y=j|y∗=j)≥tj,
[94] Proof (Claim 2) of Case 2:We proveXy˜=i,y∗6=j*Xˆy˜=i,y∗=jby contradiction. Assume
[95] there exists some examplexk∈Xy˜=i,y∗=zforz6=jsuch thatxk∈Xˆy˜=i,y∗=j. By claim 1,
[96] tie witharg max. Because each diagonal entry ofQy˜|y∗maximizes its row and column this
[97] ∀i∈[m], j∈[m],Xˆy˜=i,y∗=j=Xy˜=i,y∗=j, i.e. the confident jointexactly countsthe partitions
[98] Xy˜=i,y∗=jfor all pairs(i, j)∈[m]×M, thusCy,y˜∗=nQy,y˜∗andQˆy,y˜∗uQy,y˜∗. Omitting
[99] discretization error, the confident jointCy,y˜∗, when normalized toQˆy,y˜∗, is an exact estimator
[100] forQy,y˜∗. For example, if the noise rate is0.39, but the dataset has only 5 examples in that
[101] class, the best possible estimate by removing errors is2/5 = 0.4u0.39.
[102] Corollary 1.0(Exact Estimation).For a noisy dataset,(x,y˜)n∈(Rd,[m])nandθ:x→pˆ(˜y),
[103] Xˆy˜=i,y∗=j=Xy˜=i,y∗=j, thenQˆy,y˜∗uQy,y˜∗.
[104] Proof.The result follows directly from Theorem1. Because the confident jointexactly counts
[105] the partitionsXy˜=i,y∗=jfor all pairs(i, j)∈[m]×Mby Theorem1,Cy,y˜∗=nQy,y˜∗, omitting
[106] discretization rounding errors. In the main text, Theorem1includes Corollary1.0for brevity. We have separated out
[107] Corollary1.0here to make apparent that the primary contribution of Theorem1is to prove y˜=i,y∗=j=Xy˜=i,y∗=j, from which the result of Corollary1.0, namely thatQˆy,y˜∗uQy,y˜∗
[108] naturally follows, omitting discretization rounding errors.
[109] Corollary 1.1(Per-Class Robustness).For a noisy dataset,X:= (x,y˜)n∈(Rd,[m])nand
[110] Proof.Re-stating the meaning ofper-class diffracted, we wish to show that ifpˆ(˜y;x,θ)is
[111] diffracted with class-conditional noise s.t.∀j∈[m],pˆ(˜y=j;x,θ) =(1)j·p∗(˜y=j|y∗=y∗k) +(2)j
[112] ofQy˜|y∗maximizes its row, thenXˆy˜=i,y∗=j=Xy˜=i,y∗=jandQˆy,y˜∗uQy,y˜∗. First note that combining linear combinations of real-valued(1)jand(2)jwith the
[113] probabilities of classjfor each example may result in some examples havingpˆx,y˜=j= jp∗x,y˜=j+(2)j>1orpˆx,y˜=j=(1)jp∗x,y˜=j+(2)j<0. The proof makes no assumption about
[114] the validity of the model outputs and therefore holds when this occurs. Furthermore, confident
[115] learning does not require valid probabilities when finding label errors because confident
[116] learning depends on therankprinciple, i.e., the rankings of the probabilities, not the values
[117] of the probabilities. When there are no label collisions, the bins created by the confident joint are:
[118] for a givenj, j: tjj=E(1)j(p∗x,y˜=j+(2)j)
[119] ofCy,y˜∗foridealprobabilities, which we proved yields exact label errors and consistent
[120] estimation ofQy,y˜∗in Theorem1, which concludes the proof. Note that we eliminate the
[121] need for the assumption that each diagonal entry ofQy˜|y∗maximizes its column because
[122] this assumption is only used in the proof of Theorem1when collisions occur, but here we
[123] only consider the case when there are no collisions.
[124] Proof.We consider the nontrivial real-world setting when a learning modelθ:x→pˆ(˜y)outputs
[125] erroneous, non-ideal predicted probabilities with an error term added for every example,
[126] across every class, such that∀x∈X,∀j∈[m],pˆx,y˜=j=p∗x,y˜=j+x,y˜=j. As a notation
[127] reminderp∗x,˜y=jis shorthand for the ideal probabilitiesp∗(˜y=j|y∗=y∗k) +x,y˜=jandpˆx,y˜=j
[128] is shorthand for the predicted probabilitiespˆ(˜y=j;x,θ). The predicted probability errorx,y˜=jis distributed uniformly with no other constraints.
[129] seen by looking at the form of the uniform distribution in Eqn. (4). If we wanted, we could
[130] add the constraint thatj= 0,∀j∈[m]which would simplify the theorem and the proof, but
[131] is not as general and we prove exact label error and joint estimation without this constraint. We re-iterate the form of the error in Eqn. (4) here (Udenotes a uniform distribution):
[132] If this statement is true then the subsets created by the confident joint in Eqn.<a href=”12125ArticlePDF2669111020210414.html#are
[133] unaltered and thereforeXˆ˜yx=,yi,y˜=∗j=j= ˆX˜y=i,y∗=jT hm.=<a href=”12125ArticlePDF2669111020210414.html#Xy˜=i,y∗=j, whereXˆy˜x=,yi,y˜=∗j=jdenotes
[134] the confident joint subsets forx,y˜=jpredicted probabilities. Now we complete the proof. From the distribution forx,y˜=j(Eqn.<a href=“12125ArticlePDF2669111020210414.html#<a href=”12125ArticlePDF2669111020210414.html#
[135] Re-arranging p∗x,y˜=j< tj=⇒p∗x,y˜=j+x,y˜=j< tj+j p∗x,y˜=j≥tj=⇒p∗x,y˜=j+x,y˜=j≥tj+j
[136] Using the contrapositive, we have p∗x,y˜=j+x,˜y=j≥tj+j=⇒p∗x,˜y=j≥tj
[137] Combining, we have p∗x,y˜=j+x,y˜=j≥tj+j⇐⇒p∗x,y˜=j≥tj
[138] same condition (p∗x,y˜=j≥tj) as the confident joint counts under ideal probabilities in Thm
[139] holds under no label collisions. The proof applies for finite datasets because we ignore
[140] discretization error, however, for equality, the proof requires the assumptionn→ ∞which is n→∞
[141] in the statement of the theorem. Note that while we use a uniform distribution in Eqn.4, any bounded symmetric
[142] distribution with modej=Ex∈Xx,jis sufficient. Observe that the bounds of the distribution
[143] are non-vacuous (they do not collapse to a single valueej) becausetj6=p∗x,y˜=jby Lemma1.
[144] in Eqn.2. For clarity, we provide these equations in algorithm form (See Alg.1and Alg.2). 1407
[145] Cy,y˜∗. The algorithm takes two inputs: (1)Pˆann×mmatrix of out-of-sample predicted
[146] probabilitiesPˆ[i][j] := ˆp(˜y=j;xi,θ)and (2) the associated array of noisy labels. We typically
[147] use cross-validation to computePˆfor train sets and a model trained on the train set and
[148] fine-tuned with cross-validation on the test set to computePˆfor a test set. Any method
[149] Results in all tables reproducible via open-sourcedcleanlabpackage. Note that Alg.1embodies Eqn.1, and Alg.2realizes Eqn.3.
[150] of class-conditional label noise inputCy,y˜∗[i][j],m×munnormalized counts inputy˜ann×1array of noisy integer labels procedureJointEstimation(C,y˜): C˜y˜=i,y∗=j←PCy˜=i,y∗=j· |X
[151] Fig.S1shows the absolute difference of the true jointQy,y˜∗and the joint distribution
[152] estimated using confident learningQˆy,y˜∗on CIFAR-10, for 20
[153] 20
[154] Observe that in moderate noise regimes between 20
[155] accurately estimates nearly every entry in the joint distribution of label noise. This figure
[156] serves to provide evidence for how confident learning identifies the label errors with high
[157] accuracy as shown in Table2as well as support our theoretical contribution that confident
[158] learning exactly estimates the joint distribution of labels under reasonable assumptions (c.f.,
[159] data cleaned by CL in the train set, we may have induced a distributional shift, making the
[160] moderate increase accuracy a more satisfying result. In TableS1, we estimate theQy,y˜∗using the confusion-matrixCconfusionapproach
[161] normalized via Eqn. (3) and compare thisQˆy,y˜∗, estimated by normalizing the CL approach
[162] shows improvement usingCy,y˜∗overCconfusion, low RMSE scores, and robustness to sparsity
[163] in moderate-noise regimes. 1408
[164] Figure S1: Absolute difference of the true jointQy,y˜∗and the joint distribution estimated
[165] using confident learningQˆy,y˜∗on CIFAR-10, for 20
[166] and 60
[167] compared with using the baseline approachCconfusionto estimateQˆy,y˜∗. Noise0.20.40.7 Sparsity00.20.40.600.20.40.600.20.40.6 kQˆ˜y,y∗-Qy,y˜∗k20.004 0.004 0.004 0.004 0.004 0.004 0.004 0.0050.0110.0100.0150.017 kQˆconf usion-Q˜y,y∗k20.0060.0060.0050.0050.0050.0050.0050.007 0.011 0.011 0.015 0.019
[168] C.1 Benchmarking INCV
[169] and 4 RTX 2080 ti GPUs. Due to memory leak issues (as of the February 2020 open-source
[170] release, tested on a MacOS laptop with 16GB RAM and Ubuntu 18.04 LTS Linux server
[171] errors. For fair comparison, we restarted INCV training until all models completed at least
[172] 90 training epochs. For each experiment, TableS2shows the total time required for training,
[173] epochs completed, and the associated accuracies. As shown in the table, the training time
[174] for INCV may take over 20 hours because the approach requires iterative retraining. For
[175] comparison, CL takes less than three hours on the same machine: an hour for cross-validation,
[176] less than a minute to find errors, an hour to retrain. 2.https://github.com/chenpf1025/noisy_label_understanding_utilizing
[177] for various noise and sparsity settings. Noise0.20.40.7 Sparsity00.20.40.600.20.40.600.20.40.6 Accuracy0.8780.8860.8960.892 0.8440.7660.8540.7360.2830.2530.3480.297 Time (hours)9.120 11.350 10.420 7.220 7.580 11.720 20.420 6.180 16.230 17.250 16.880 18.300 Epochs trained9191200157912002001399292118200
[178] In this section, we include additional figures that support the main manuscript. Fig.S2
[179] explores the benchmark accuracy of the individual confident learning approaches to support
[180] Fig.5and Fig.4in the main text. The noise matrices shown in Fig.S3were used to
[181] generate the synthetic noisy labels for the results in Tables4and2. Fig.S2shows the top-1 accuracy on the ILSVRC validation set when removing label
[182] errors estimated by CL methods versus removing random examples. For each CL method, we
[183] plot the accuracy of training with 20
[184] omitting points beyond 200k. 69
[185] Figure S2: Increased ResNet validation accuracy using CL methods on ImageNet with
[186] original labels (no synthetic noise added). Each point on the line for each method, from left
[187] to right, depicts the accuracy of training with 20
[188] removed. Error bars are estimated with Clopper-Pearson 95
[189] dash-dotted baseline captures when examples are removed uniformly randomly. The black
[190] dotted line depicts accuracy when training with all examples. 1410
[191] Figure S3: The CIFAR-10 noise transition matrices used to create the synthetic label errors.
[192] In thecleanlabcode base,sis used in place ofy˜to notate the noisy unobserved labels and
[193] yis used in place ofy∗to notate the latent uncorrupted labels
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.