A case study competition among methods for analyzing large spatial data.

*(English)*Zbl 1426.62345Summary: The Gaussian process is an indispensable tool for spatial data analysts. The onset of the “big data” era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low-rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics.

##### MSC:

62P12 | Applications of statistics to environmental and related topics |

62M30 | Inference from spatial processes |

62-08 | Computational methods for problems pertaining to statistics |

PDF
BibTeX
Cite

\textit{M. J. Heaton} et al., J. Agric. Biol. Environ. Stat. 24, No. 3, 398--425 (2019; Zbl 1426.62345)

Full Text:
DOI

##### References:

[1] | Anderson, C.; Lee, D.; Dean, N., Identifying clusters in Bayesian disease mapping, Biostatistics, 15, 457-469, (2014) |

[2] | Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014), Hierarchical modeling and analysis for spatial data, Crc Press. |

[3] | Banerjee, S.; Gelfand, AE; Finley, AO; Sang, H., Gaussian predictive process models for large spatial data sets, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 825-848, (2008) · Zbl 05563371 |

[4] | Barbian, MH; Assunção, RM, Spatial subsemble estimator for large geostatistical data, Spatial Statistics, 22, 68-88, (2017) |

[5] | Bevilacqua, M., Faouzi, T., Furrer, R., and Porcu, E. (2016), ““Estimation and Prediction using Generalized Wendland Covariance Function under Fixed Domain Asymptotics,”’ arXiv:1607.06921v2. · Zbl 1418.62365 |

[6] | Bradley, JR; Cressie, N.; Shi, T.; etal., A comparison of spatial predictors when datasets could be very large, Statistics Surveys, 10, 100-131, (2016) · Zbl 1347.62083 |

[7] | Castrillon-Candás, JE; Genton, MG; Yokota, R., Multi-level restricted maximum likelihood covariance estimation and kriging for large non-gridded spatial datasets, Spatial Statistics, 18, 105-124, (2016) |

[8] | Cohn, DA, Neural Network Exploration Using Optimal Experimental Design, Advances in Neural Information Processing Systems, Morgan Kaufmann Publishers, 6, 679-686, (1996) |

[9] | Cressie, N. (1993), Statistics for spatial data, John Wiley & Sons. |

[10] | Cressie, N. and Johannesson, G. (2006), ““Spatial prediction for massive data sets,”’ in Mastering the Data Explosion in the Earth and Environmental Sciences: Proceedings of the Australian Academy of Science Elizabeth and Frederick White Conference, Canberra, Australia: Australian Academy of Science, pp. 1-11. |

[11] | Cressie, N.; Johannesson, G., Fixed rank kriging for very large spatial data sets, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 209-226, (2008) · Zbl 05563351 |

[12] | Cressie, N. and Wikle, C. K. (2015), Statistics for spatio-temporal data, John Wiley & Sons. · Zbl 1273.62017 |

[13] | Dahlhaus, R.; Künsch, H., Edge effects and efficient parameter estimation for stationary random fields, Biometrika, 74, 877-882, (1987) · Zbl 0633.62094 |

[14] | Datta, A.; Banerjee, S.; Finley, AO; Gelfand, AE, Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets, Journal of the American Statistical Association, 111, 800-812, (2016) |

[15] | Datta, A.; Banerjee, S.; Finley, AO; Gelfand, AE, On nearest-neighbor Gaussian process models for massive spatial data, Wiley Interdisciplinary Reviews: Computational Statistics, 8, 162-171, (2016) |

[16] | Datta, A.; Banerjee, S.; Finley, AO; Hamm, NA; Schaap, M.; etal., Nonseparable dynamic nearest neighbor Gaussian process models for large spatio-temporal data with an application to particulate matter analysis, The Annals of Applied Statistics, 10, 1286-1316, (2016) · Zbl 1391.62269 |

[17] | Du, J.; Zhang, H.; Mandrekar, VS, Fixed-domain asymptotic properties of tapered maximum likelihood estimators, Ann. Statist., 37, 3330-3361, (2009) · Zbl 1369.62248 |

[18] | Eidsvik, J.; Shaby, BA; Reich, BJ; Wheeler, M.; Niemi, J., Estimation and prediction in spatial models with block composite likelihoods, Journal of Computational and Graphical Statistics, 23, 295-315, (2014) |

[19] | Emery, X., The kriging update equations and their application to the selection of neighboring data, Computational Geosciences, 13, 269-280, (2009) |

[20] | Finley, A., Datta, A., and Banerjee, S. (2017), spNNGP: Spatial Regression Models for Large Datasets using Nearest Neighbor Gaussian Processes, r package version 0.1.1. |

[21] | Finley, A. O., Datta, A., Cook, B. C., Morton, D. C., Andersen, H. E., and Banerjee, S. (2018), ““Efficient algorithms for Bayesian Nearest Neighbor Gaussian Processes,”’ arXiv:1702.00434. |

[22] | Finley, AO; Sang, H.; Banerjee, S.; Gelfand, AE, Improving the performance of predictive process modeling for large datasets, Computational statistics & data analysis, 53, 2873-2884, (2009) · Zbl 1453.62090 |

[23] | Fuentes, M., Approximate likelihood for large irregularly spaced spatial data, Journal of the American Statistical Association, 102, 321-331, (2007) · Zbl 1284.62589 |

[24] | Furrer, R. (2016), spam: SPArse Matrix, r package version 1.4-0. |

[25] | Furrer, R.; Bachoc, F.; Du, J., Asymptotic Properties of Multivariate Tapering for Estimation and Prediction, J. Multivariate Anal., 149, 177-191, (2016) · Zbl 1341.62263 |

[26] | Furrer, R.; Genton, MG; Nychka, D., Covariance tapering for interpolation of large spatial datasets, Journal of Computational and Graphical Statistics, 15, 502-523, (2006) |

[27] | Furrer, R.; Sain, SR, spam: A Sparse Matrix R Package with Emphasis on MCMC Methods for Gaussian Markov Random Fields, J. Stat. Softw., 36, 1-25, (2010) |

[28] | Gerber, F. (2017), gapfill: Fill Missing Values in Satellite Data, r package version 0.9.5. |

[29] | Gerber, F.; Furrer, R.; Schaepman-Strub, G.; Jong, R.; Schaepman, ME, Predicting missing values in spatio-temporal satellite data, IEEE Transactions on Geoscience and Remote Sensing, 56, 2841-2853, (2018) |

[30] | Gneiting, T.; Katzfuss, M., Probabilistic forecasting, Annual Review of Statistics and Its Application, 1, 125-151, (2014) |

[31] | Gneiting, T.; Raftery, AE, Strictly proper scoring rules, prediction, and estimation, Journal of the American Statistical Association, 102, 359-378, (2007) · Zbl 1284.62093 |

[32] | Gramacy, R.; Apley, D., Local Gaussian Process Approximation for Large Computer Experiments, Journal of Computational and Graphical Statistics, 24, 561-578, (2015) |

[33] | Gramacy, R.; Niemi, J.; Weiss, R., Massively Parallel Approximate Gaussian Process Regression, Journal of Uncertainty Quantification, 2, 564-584, (2014) · Zbl 1308.62159 |

[34] | Gramacy, RB, laGP: Large-Scale Spatial Modeling via Local Approximate Gaussian Processes in R, Journal of Statistical Software, 72, 1-46, (2016) |

[35] | Gramacy, RB; Haaland, B., Speeding up neighborhood search in local Gaussian process prediction, Technometrics, 58, 294-303, (2016) |

[36] | Guhaniyogi, R. and Banerjee, S. (2018), ““Meta-kriging: Scalable Bayesian modeling and inference for massive spatial datasets,”’ Technometrics. |

[37] | Guhaniyogi, R., Li, C., Savitsky, T. D., and Srivastava, S. (2017), ““A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging,”’ arXiv preprint arXiv:1712.09767. |

[38] | Guinness, J. (2017), ““Spectral Density Estimation for Random Fields via Periodic Embeddings,”’ arXiv preprint arXiv:1710.08978. |

[39] | Guinness, J.; Fuentes, M., Circulant embedding of approximate covariances for inference from Gaussian data on large lattices, Journal of Computational and Graphical Statistics, 26, 88-97, (2017) |

[40] | Guyon, X., Parameter estimation for a stationary process on a d-dimensional lattice, Biometrika, 69, 95-105, (1982) · Zbl 0485.62107 |

[41] | Heaton, MJ; Christensen, WF; Terres, MA, Nonstationary Gaussian process models using spatial hierarchical clustering from finite differences, Technometrics, 59, 93-101, (2017) |

[42] | Higdon, D. (2002), ““Space and space-time modeling using process convolutions,”’ in Quantitative methods for current environmental issues, Springer, pp. 37-56. · Zbl 1255.86016 |

[43] | Hirano, T.; Yajima, Y., Covariance tapering for prediction of large spatial data sets in transformed random fields, Annals of the Institute of Statistical Mathematics, 65, 913-939, (2013) · Zbl 1273.62233 |

[44] | Jurek, M. and Katzfuss, M. (2018), ““Multi-resolution filters for massive spatio-temporal data,”’ arXiv:1810.04200. |

[45] | Kang, E.; Liu, D.; Cressie, N., Statistical analysis of small-area data based on independence, spatial, non-hierarchical, and hierarchical models, Computational Statistics & Data Analysis, 53, 3016-3032, (2009) · Zbl 1453.62123 |

[46] | Kang, EL; Cressie, N., Bayesian inference for the spatial random effects model, Journal of the American Statistical Association, 106, 972-983, (2011) · Zbl 1229.62008 |

[47] | Katzfuss, M., A multi-resolution approximation for massive spatial datasets, Journal of the American Statistical Association, 112, 201-214, (2017) |

[48] | Katzfuss, M.; Cressie, N., Spatio-temporal smoothing and EM estimation for massive remote-sensing data sets, Journal of Time Series Analysis, 32, 430-446, (2011) · Zbl 1294.62119 |

[49] | Katzfuss, M. and Gong, W. (2017), ““Multi-resolution approximations of Gaussian processes for large spatial datasets,”’ arXiv:1710.08976. |

[50] | Katzfuss, M.; Hammerling, D., Parallel inference for massive distributed spatial data using low-rank models, Statistics and Computing, 27, 363-375, (2017) · Zbl 06697662 |

[51] | Kaufman, CG; Schervish, MJ; Nychka, DW, Covariance tapering for likelihood-based estimation in large spatial data sets, Journal of the American Statistical Association, 103, 1545-1555, (2008) · Zbl 1286.62072 |

[52] | Kim, H-M; Mallick, BK; Holmes, C., Analyzing nonstationary spatial data using piecewise Gaussian processes, Journal of the American Statistical Association, 100, 653-668, (2005) · Zbl 1117.62368 |

[53] | Kleiber, W.; Nychka, DW, Equivalent kriging, Spatial Statistics, 12, 31-49, (2015) |

[54] | Knorr-Held, L.; Raßer, G., Bayesian detection of clusters and discontinuities in disease maps, Biometrics, 56, 13-21, (2000) · Zbl 1060.62629 |

[55] | Konomi, BA; Sang, H.; Mallick, BK, Adaptive bayesian nonstationary modeling for large spatial datasets using covariance approximations, Journal of Computational and Graphical Statistics, 23, 802-829, (2014) |

[56] | Lemos, RT; Sansó, B., A spatio-temporal model for mean, anomaly, and trend fields of North Atlantic sea surface temperature, Journal of the American Statistical Association, 104, 5-18, (2009) · Zbl 1248.62172 |

[57] | Liang, F.; Cheng, Y.; Song, Q.; Park, J.; Yang, P., A resampling-based stochastic approximation method for analysis of large geostatistical data, Journal of the American Statistical Association, 108, 325-339, (2013) · Zbl 06158346 |

[58] | Lindgren, F.; Rue, H.; Lindström, J., An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73, 423-498, (2011) · Zbl 1274.62360 |

[59] | Litvinenko, A., Sun, Y., Genton, M. G., and Keyes, D. (2017), ““Likelihood Approximation With Hierarchical Matrices For Large Spatial Datasets,”’ arXiv preprint arXiv:1709.04419. · Zbl 07058810 |

[60] | Liu, H., Ong, Y.-S., Shen, X., and Cai, J. (2018), ““When Gaussian Process Meets Big Data: A Review of Scalable GPs,”’ arXiv preprint arXiv:1807.01065. |

[61] | Minsker, S., Geometric median and robust estimation in Banach spaces, Bernoulli, 21, 2308-2335, (2015) · Zbl 1348.60041 |

[62] | Minsker, S., Srivastava, S., Lin, L., and Dunson, D. B. (2014), ““Robust and scalable Bayes via a median of subset posterior measures,”’ arXiv preprint arXiv:1403.2660. · Zbl 1442.62056 |

[63] | Neelon, B.; Gelfand, AE; Miranda, ML, A multivariate spatial mixture model for areal data: examining regional differences in standardized test scores, Journal of the Royal Statistical Society: Series C (Applied Statistics), 63, 737-761, (2014) |

[64] | Nychka, D.; Bandyopadhyay, S.; Hammerling, D.; Lindgren, F.; Sain, S., A multiresolution Gaussian process model for the analysis of large spatial datasets, Journal of Computational and Graphical Statistics, 24, 579-599, (2015) |

[65] | Paciorek, CJ; Lipshitz, B.; Zhuo, W.; Kaufman, CG; Thomas, RC; etal., Parallelizing Gaussian Process Calculations In R, Journal of Statistical Software, 63, 1-23, (2015) |

[66] | Rue, H.; Martino, S.; Chopin, N., Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71, 319-392, (2009) · Zbl 1248.62156 |

[67] | Rue, H., Martino, S., Lindgren, F., Simpson, D., Riebler, A., Krainski, E. T., and Fuglstad, G.-A. (2017), INLA: Bayesian Analysis of Latent Gaussian Models using Integrated Nested Laplace Approximations, r package version 17.06.20. |

[68] | Sang, H.; Huang, JZ, A full scale approximation of covariance functions for large spatial data sets, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 111-132, (2012) · Zbl 1411.62274 |

[69] | Sang, H., Jun, M., and Huang, J. Z. (2011), ““Covariance approximation for large multivariate spatial data sets with an application to multiple climate model errors,”’ The Annals of Applied Statistics, 2519-2548. · Zbl 1234.62071 |

[70] | Schabenberger, O. and Gotway, C. A. (2004), Statistical methods for spatial data analysis, CRC press. · Zbl 1068.62096 |

[71] | Simpson, D.; Lindgren, F.; Rue, H., In order to make spatial statistics computationally feasible, we need to forget about the covariance function, Environmetrics, 23, 65-74, (2012) |

[72] | Stein, M. L. (1999), Interpolation of Spatial Data, Springer-Verlag, some theory for Kriging. |

[73] | Stein, ML, Statistical properties of covariance tapers, Journal of Computational and Graphical Statistics, 22, 866-885, (2013) |

[74] | Stein, ML, Limitations on low rank approximations for covariance matrices of spatial data, Spatial Statistics, 8, 1-19, (2014) |

[75] | Stein, ML; Chen, J.; Anitescu, M.; etal., Stochastic approximation of score functions for Gaussian processes, The Annals of Applied Statistics, 7, 1162-1191, (2013) · Zbl 1454.62283 |

[76] | Stein, ML; Chi, Z.; Welty, LJ, Approximating likelihoods for large spatial data sets, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, 275-296, (2004) · Zbl 1062.62094 |

[77] | Sun, Y., Li, B., and Genton, M. G. (2012), ““Geostatistics for large datasets,”’ in Advances and challenges in space-time modelling of natural events, Springer, pp. 55-77. |

[78] | Sun, Y.; Stein, ML, Statistically and computationally efficient estimating equations for large spatial datasets, Journal of Computational and Graphical Statistics, 25, 187-208, (2016) |

[79] | Taylor-Rodriguez, D., Finley, A. O., Datta, A., Babcock, C., Andersen, H.-E., Cook, B. D., Morton, D. C., and Baneerjee, S. (2018), ““Spatial Factor Models for High-Dimensional and Large Spatial Data: An Application in Forest Variable Mapping,”’ arXiv preprint arXiv:1801.02078. · Zbl 1421.62161 |

[80] | Ton, J.-F., Flaxman, S., Sejdinovic, D., and Bhatt, S. (2017), ““Spatial Mapping with Gaussian Processes and Nonstationary Fourier Features,”’ arXiv preprint arXiv:1711.05615. |

[81] | Vapnik, V. (1995), The Nature of Statistical Learning Theory, New York: Springer Verlag. · Zbl 0833.62008 |

[82] | Varin, C., Reid, N., and Firth, D. (2011), ““An overview of composite likelihood methods,”’ Statistica Sinica, 5-42. · Zbl 05849508 |

[83] | Vecchia, A. V. (1988), ““Estimation and model identification for continuous spatial processes,”’ Journal of the Royal Statistical Society. Series B (Methodological), 297-312. |

[84] | Wang, D.; Loh, W-L, On fixed-domain asymptotics and covariance tapering in Gaussian random field models, Electron. J. Statist., 5, 238-269, (2011) · Zbl 1274.62643 |

[85] | Weiss, DJ; Atkinson, PM; Bhatt, S.; Mappin, B.; Hay, SI; Gething, PW, An effective approach for gap-filling continental scale remotely sensed time-series, ISPRS J. Photogramm. Remote Sens., 98, 106-118, (2014) |

[86] | Whittle, P. (1954), ““On stationary processes in the plane,”’ Biometrika, 434-449. · Zbl 0058.35601 |

[87] | Wikle, C. K., Cressie, N., Zammit-Mangion, A., and Shumack, C. (2017), ““A Common Task Framework (CTF) for Objective Comparison of Spatial Prediction Methodologies,”’ Statistics Views. |

[88] | Zammit-Mangion, A. and Cressie, N. (2018), ““FRK: An R Package for Spatial and Spatio-Temporal Prediction with Large Datasets,”’ arXiv preprint arXiv:1705.08105. |

[89] | Zammit-Mangion, A.; Cressie, N.; Shumack, C., On statistical approaches to generate Level 3 products from satellite remote sensing retrievals, Remote Sensing, 10, 155, (2018) |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.