×

zbMATH — the first resource for mathematics

Correspondence analysis of textual data involving contextual information: CA-GALT on principal components. (English) Zbl 1414.62169
Summary: Correspondence analysis on an aggregated lexical table is a typical practice in textual analysis in which a contextual categorical variable is used to aggregate documents, depending on the categories to which they belong. This work generalises this approach and considers several quantitative, categorical or mixed contextual variables. The result is a new method that we have called ’correspondence analysis on a generalised aggregated lexical table’. A favoured application derives from surveys by questionnaire, including both open-ended and closed questions. The free-text answers are encoded into a respondents \(\times\) words frequency table called a lexical table. The closed questions, either quantitative or categorical, form the contextual variables. The primary objective is to establish a typology of the variables and a typology of the words from their mutual relationships as grasped from jointly analysing the textual and contextual tables. Validation tests are offered, particularly in the form of confidence ellipses. The comprehensive and numerous properties of the method, similar to correspondence analysis properties, are detailed. Promising results are obtained as indicated by an application to a marketing survey conducted among 1,000 respondents.
MSC:
62H05 Characterization and structure theory for multivariate probability distributions; copulas
62H17 Contingency tables
62H25 Factor analysis and principal components; correspondence analysis
Software:
Canoco; sedaR
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Balbi S, Giordano G (2001) A factorial technique for analyzing textual data with external information. In: Borra S, Rocci R, Vichi M, Schader M (eds) Advances in classification and data analysis. Springer, Heidelberg, pp 169-176
[2] Balbi, S.; Misuraca, M.; Skiadas, CH (ed.), A doubly projected analysis for lexical tables, 13-19, (2010), Boston
[3] Bécue-Bertaut, M.; Pagès, J., A principal axes method for comparing multiple contingency tables: mfact, Comput Stat Data Anal, 45, 481-503, (2004) · Zbl 1429.62199
[4] Bécue-Bertaut, M.; Pagès, J., Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data, Comput Stat Data Anal, 52, 3255-3268, (2008) · Zbl 05564701
[5] Benzécri JP (1973) L’Analyse des Données., Tome I: L’Analyse des Correspondances. Dunod, Paris
[6] Benzécri JP (1981) Pratique de l’Analyse des Données, Tome III., Linguistique & Lexicologie. Dunod, Paris
[7] Brandimarte P (2011) Quantitative methods: an introduction for business management. Wiley, On line library
[8] D’Ambra, L.; Lauro, NC, Analisi in component principali in rapport ad un sottospazio di referimento, Riv Stat Appl, 15, 51-67, (1982)
[9] Efron, B., Bootstrap methods: another look to the jackknife, Ann Stat, 7, 1-26, (1979) · Zbl 0406.62024
[10] Escofier B, Pagès J (2008) Analyses factorielles simples et multiples, 4th edn. Dunod, Paris
[11] Esposito Vinzi, V., Exploratory methods for comparative analysis, Chemometr Intell Lab, 58, 275-286, (2001)
[12] Härdle K, Simar L (2012) Applied Multivariate Statistical Analysis. Springer, London · Zbl 1266.62032
[13] Jollife, IT, A note on the use of principal components in regression, Appl Stat, 31, 300-303, (1982)
[14] Lauro, NC; D’Ambra, L.; Diday, E. (ed.); Jambu, M. (ed.); Lebart, L. (ed.); Pagès, J. (ed.); Tomassone, R. (ed.), L’analyse non symétrique des correspondances, No. III, (1984), Amsterdam
[15] Lebart L, Salem A, Berry L (1998) Exploring textual data. Kluwer, Dordrecht
[16] Lebart L, Piron M, Morineau A (2006) Statistique exploratoire multidimensionnelle. Visualisation et inférence en fouilles de données, 4th edn. Dunod, Paris
[17] Lebreton, JD; Chessel, D.; Prodon, R.; Yoccoz, N., L’analyse des relations espèces-milieu par l’analyse canonique des correspondances. i. variables de milieu quantitatives, Acta Oecol, 9, 53-67, (1988)
[18] Legendre P, Legendre L (1998) Numerical ecology, 2nd edn. Elsevier Science, Amsterdam · Zbl 1033.92036
[19] Massy, WF, Principal components regression in exploratory statistical research, J Am Stat Assoc, 60, 234-256, (1965)
[20] Murtagh F (2005) Correspondence analysis and data coding with java and R. Chapman & Hall, Boca Raton · Zbl 1079.62003
[21] Ni, L., Principal component regression revisited, Stat Sin, 21, 741-747, (2011) · Zbl 1214.62069
[22] Preda C, Saporta G. (2005) PLS regression on a stochastic process. Comput Stat Data Anal 48:149-158 · Zbl 1429.62224
[23] Saporta G (2011) Probabilités, analyse des données et statistiques, 3rd edn. Technip, Paris
[24] Spano M, Triunfo N (2012) La relazione sulla gestione delle società italiane quotate sul mercato regolamentato. In: Dister A, Longrée D, Purnelle G (eds) Actes de \(11^{\grave{{\rm e}}{{\rm me}}}\) Journées d’analyse de données textuelles. http://lexicometrica.univ-paris3.fr/jadt/jadt2012/tocJADT2012.htm
[25] Takane Y (1997) CPCA: a comprehensive theory. In: Proceedings of the 1997 IEEE international conference on systems, man and cybernetics (SMC). IEEE, Orlando, pp 35-40
[26] Takane, Y.; Yanai, H.; Mayekawa, S., Relationships among several methods of linearly constrained correspondence analysis, Psychometrika, 56, 667-684, (1991) · Zbl 0760.62057
[27] Braak, CJF, Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis, Ecology, 67, 1167-1179, (1986)
[28] ter Braak CJF (1987) Canoco—a FORTRAN program for canonical community ordination by [partial] [detrended] [canonical] correspondence analysis (version 2.1), ITI-TNO Institute of Applied Computer Sciences, Wageningen
[29] Tufféry S (2005) Data mining et statistique décisionnelle. Technip, Paris
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.