×

Sample size determination for classifiers based on single-nucleotide polymorphisms. (English) Zbl 1437.62538

Summary: Single-nucleotide polymorphisms (SNPs), believed to determine human differences, are widely used to predict risk of diseases. Typically, clinical samples are limited and/or the sampling cost is high. Thus, it is essential to determine an adequate sample size needed to build a classifier based on SNPs. Such a classifier would facilitate correct classifications, while keeping the sample size to a minimum, thereby making the studies cost-effective. For coded SNP data from 2 classes, an optimal classifier and an approximation to its probability of correct classification (PCC) are derived. A linear classifier is constructed and an approximation to its PCC is also derived. These approximations are validated through a variety of Monte Carlo simulations. A sample size determination algorithm based on the criterion, which ensures that the difference between the 2 approximate PCCs is below a threshold, is given and its effectiveness is illustrated via simulations. For the HapMap data on Chinese and Japanese populations, a linear classifier is built using 51 independent SNPs, and the required total sample sizes are determined using our algorithm, as the threshold varies. For example, when the threshold value is 0.05, our algorithm determines a total sample size of 166 (83 for Chinese and 83 for Japanese) that satisfies the criterion.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Improved prediction of cardiovascular disease based on a panel of single nucleotide polymorphisms identified through genome-wide association studies, 3, 468-474 (2010) · doi:10.1161/CIRCGENETICS.110.946269
[2] A simulation-approximation approach to sample size planning for high-dimensional classification studies, 10, 424-435 (2009) · Zbl 1437.62439 · doi:10.1093/biostatistics/kxp001
[3] e1071: misc functions of the Department of Statistics (e1071), 1, 5-11 (2005)
[4] Sample size determination in microarray experiments for class comparison and prognostic classification, 6, 27-38 (2005) · Zbl 1069.62093 · doi:10.1093/biostatistics/kxh015
[5] Sample size planning for developing classifiers using high-dimensional DNA microarray data, 8, 101-117 (2007) · Zbl 1170.62374 · doi:10.1093/biostatistics/kxj036
[6] How large a training set is needed to develop a classifier for microarray data, 14, 108-114 (2008) · doi:10.1158/1078-0432.CCR-07-0443
[7] How many samples are needed to build a classifier: a general sequential approach, 21, 63-70 (2005) · doi:10.1093/bioinformatics/bth461
[8] A machine learning pipeline for quantitative phenotype prediction from genotype data, 11, S3 (2010) · doi:10.1186/1471-2105-11-S8-S3
[9] A web server for inferring the human N-acetyltransferase-2 (NAT2) enzymatic phenotype from NAT2 genotype, 25, 1185-1186 (2009) · doi:10.1093/bioinformatics/btp121
[10] On expected probabilities of misclassification in discriminant analysis, necessary sample size, and a relation with the multiple correlation coefficient, 24, 823-834 (1968) · doi:10.2307/2528873
[11] Predicting unobserved phenotypes for complex traits from whole-genome SNP data (2008)
[12] Estimating dataset size requirements for classifying DNA microarray data, 10, 119-142 (2003) · doi:10.1089/106652703321825928
[13] Detecting high-order interactions of single nucleotide polymorphisms using genetic programming, 23, 3280-3288 (2007) · doi:10.1093/bioinformatics/btm522
[14] New kernel methods for phenotype prediction from genotype data, 22, 132-141 (2010)
[15] SNP selection at the NAT2 locus for an accurate prediction of the acetylation phenotype, 8, 76-85 (2006) · doi:10.1097/01.gim.0000200951.54346.d6
[16] Enhanced prediction of lopinavir resistance from genotype by use of artificial neural networks, 188, 653-660 (2003) · doi:10.1086/jid.2003.188.issue-5
[17] Prediction of individual genetic risk to disease from genome-wide association studies, 17, 1520-1528 (2007) · doi:10.1101/gr.6665407
[18] Effective selection of informative SNPs and classification on the HapMap genotype data, 8, 484 (2007) · doi:10.1186/1471-2105-8-484
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.