zbMATH — the first resource for mathematics

Investigation on LP-residual representations for speaker identification. (English) Zbl 1182.68192
Summary: Feature extraction is an essential and important step for speaker recognition systems. In this paper, we propose to improve these systems by exploiting both conventional features such as Mel Frequency Cepstral Coding (MFCC), Linear Predictive Cepstral Coding (LPCC) and non-conventional ones. The method exploits information present in the Linear Predictive (LP) residual signal. The features extracted from the LP-residue are then combined to the MFCC or the LPCC. We investigate two approaches termed as temporal and frequential representations. The first one consists of an Auto-Regressive (AR) modelling of the signal followed by a cepstral transformation in a similar way to the LPC-LPCC transformation. In order to take into account the non-linear nature of the speech signals we used two estimation methods based on second and third-order statistics. They are, respectively, termed as R-SOS-LPCC (residual plus second-order statistic based estimation of the AR model plus cepstral transformation) and R-HOS-LPCC (higher order).
Concerning the frequential approach, we exploit a filter bank method called the Power Difference of Spectra in Sub-band (PDSS) which measures the spectral flatness over the sub-bands. The resulting features are named R-PDSS. The analysis of these proposed schemes are done over a speaker identification problem with two different databases. The first one is the Gaudi database and contains 49 speakers. The main interest lies in the controlled acquisition conditions: mismatch between the microphones and the interval sessions. The second database is the well-known NTIMIT corpus with 630 speakers. The performances of the features are confirmed over this larger corpus. In addition, we propose to compare traditional features and residual ones by the fusion of recognizers (feature extractor + classifier). The results show that residual features carry speaker-dependent features and the combination with the LPCC or the MFCC shows global improvements in terms of robustness under different mismatches. A comparison between the residual features under the opinion fusion framework gives us useful information about the potential of both temporal and frequential representations.

68T10 Pattern recognition, speech recognition
Full Text: DOI
[1] Jang, G.J.; Lee, T.L.; Oh, Y.H., Learning statistically efficient features for speaker recognition, Neurocomputing, 49, 329-348, (2002)
[2] R.E. Slyh, E.G. Hansen, T.R. Anderson, Glottal modeling and closed-phase analysis for speaker recognition, in: Proceedings of the ISCA Tutorial and Research Workshop on Speaker and Language Recognition (Odyssey’04), 2004, pp. 315-322.
[3] L. Mary, K. Sri Rama Murty, S.R. Mahadeva Prasanna, B. Yegnanaraya, Features for speaker and language identification, in: Proceedings of the ISCA Tutorial and Research Workshop on Speaker and Language Recognition (Odyssey’04), 2004, pp. 323-328.
[4] J. Ortega, et al., Ahumada: a large speech corpus in Spanish for speaker identification and verification, in: Proceedings of the IEEE ICASSP’98, vol. 2, 1998, pp. 773-775.
[5] Atal, B.S.; Hanauer, S.L., Speech analysis and synthesis by linear prediction of speech wave, J. acoust. soc. am., 50, 637-655, (1971)
[6] Faundez-Zanuy, M.; Kubin, G.; Kleijn, W.B.; Maragos, P.; McLaughlin, S.; Esposito, A.; Hussain, A.; Schoentgen, J., Nonlinear speech processing: overview and applications, Control intelligent syst., 30, 1, 1-10, (2002)
[7] G. Kubin, Nonlinear processing of speech, in: W.B. Kleijn, K.K. Paliwal (Eds.), Speech Coding and Synthesis, 1995, pp. 557-610.
[8] Thevenaz, P.; Hügli, H., Usefulness of the LPC-residue in text-independent speaker verification, Speech commun., 17, 1-2, 145-157, (1995)
[9] Faundez, M.; Rodriguez, D., Speaker recognition using residual signal of linear and nonlinear prediction models, Icslp, 2, 121-124, (1998)
[10] B. Yegnanaraya, K.S. Reddy, S.P. Kishore, Source and system features for speaker recognition using AANN models, in: Proceedings of the IEEE ICASSP, 2001, pp. 409-412.
[11] Mahadeva Prasanna, S.R.; Gupta, C.S.; Yegnanaraya, B., Extraction of speaker-specific excitation from linear prediction residual of speech, Speech commun., 48, 1243-1261, (2006)
[12] N. Zheng, T. Lee, P.C. Ching, Integration of complementary acoustic features for speaker recognition, IEEE Signal Process. Lett., 2006.
[13] A. Esposito, M. Marinaro, Some notes on nonlinearities of speech, in: G. Chollet, et al. (Eds.), Nonlinear Speech Modeling, Lecture Notes in Artificial Intelligence, vol. 3445, 2005, pp. 1-4.
[14] S. McLaughlin, S. Hovell, A. Lowry, Identification of nonlinearities in vowel generation, in: Proceedings of the EUSIPCO, 1988, pp. 1133-1136.
[15] H. Teager, S. Teager, Evidence for nonlinear sound production mechanisms in the vocal tract, in: Proceedings of the NATO ASI on Speech Production and Speech Modeling, vol. II, 1989, pp. 241-261.
[16] Gazor, S.; Zhang, W., Speech probability distribution, IEEE signal process. lett., 10, 7, 204-207, (2003)
[17] G. Chollet, A. Esposito, M. Faundez-Zanuy, M. Marinaro, Nonlinear speech modeling and applications, in: Lecture Notes in Artificial Intelligence, vol. 3445, 2005.
[18] M. Faundez, D. Rodriguez, Speaker recognition by means of a combination of linear and nonlinear predictive models, in: Proceedings of the IEEE ICASSP’99, 1999.
[19] M. Chetouani, M. Faundez-Zanuy, B. Gas, J.L. Zarader, A new nonlinear speaker parameterization algorithm for speaker identification, in: Proceedings of the ISCA Tutorial and Research Workshop on Speaker and Language Recognition (Odyssey’04), 2004, pp. 309-314. · Zbl 1182.68192
[20] E. Rank, G. Kubin, Nonlinear synthesis of vowels in the LP residual domain with a regularized RBF network, in: Proceedings of the IWANN, vol. 2085(II), 2001, pp. 746-753. · Zbl 0982.68872
[21] J. Thyssen, H. Nielsen, S.D. Hansen, Non-linearities short-term prediction in speech coding, in: Proceedings of the IEEE ICASSP’94, vol. 1, 1994, pp. 185-188.
[22] Tao, C.; Mu, J.; Xu, X.; Du, G., Chaotic characteristics of speech signal and its LPC residual, Acoust. sci. technol., 25, 1, 50-53, (2004)
[23] S.H. Chen, H.C. Wang, Improvement of speaker recognition by combining residual and prosodic features with acoustic features, in: Proceedings of the IEEE ICASSP’04, vol. 1, 2004, pp. 93-96.
[24] K.K. Paliwal, M.M. Sondhi, Recognition of noisy speech using cumulant-based linear prediction analysis, in: Proceedings of the IEEE ICASSP’91, vol. 1, 1991, pp. 429-432.
[25] S. Hayakawa, K. Takeda, F. Itakura, Speaker identification using harmonic structure of LP-residual spectrum, in: Audio Video Biometric Personal Authentification, Lecture Notes in Computer Science, vol. 1206, Springer, Berlin, 1997, pp. 253-260.
[26] J. He, L. Liu, G. Palm, On the use of residual cepstrum in speech recognition, in: Proceedings of the IEEE ICASSP’96, vol. 1, 1991, pp. 5-8.
[27] A. Satue-Villar, M. Faundez-Zanuy, On the relevance of language in speaker recognition, in: Proceedings of the EUROSPEECH’99, vol. 3, 1999, pp. 1231-1234.
[28] C. Jankowski, A. Kalyanswamy, S. Basson, J. Spitz, NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database, in: Proceedings of the IEEE ICASSP, vol. 1, 1990, pp. 109-112.
[29] Bimbot, F.; Magrin-Chagnolleau, I.; Mathan, L., Second-order statistical measures for text-independent speaker identification, Speech commun., 17, 177-192, (1995)
[30] Reynolds, D.A., Speaker identification and verification using Gaussian mixture speaker models, Speech commun., 17, 91-108, (1995)
[31] Besacier, L.; Bonastre, J.F., Subband architecture for automatic speaker recognition, Signal process., 80, 1245-1259, (2000) · Zbl 1034.94512
[32] F. Bimbot, L. Mathan, Text-free speaker recognition using an arithmetic-harmonic sphericity measure, in: Proceedings of the EUROSPEECH’91, 1999, pp. 169-172.
[33] Kittler, J.; Hatef, M.; Duin, R.P.W.; Matas, J., On combining classifiers, IEEE trans. pattern anal. Mach. intell., 20, 3, 226-239, (1998)
[34] Faundez-Zanuy, M., Data fusion in biometrics, IEEE aerosp. electron. syst. mag., 20, 1, 34-38, (2005)
[35] C. Sanderson, Information fusion and person verification using speech and face information, IDIAP Research Report 02-33, 1-37, September 2002.
[36] M. Chetouani, M. Faundez-Zanuy, B. Gas, J.L. Zarader, Non-linear speech feature extraction for phoneme classification and speaker recognition, in: G. Chollet et al. (Eds.), Nonlinear Speech Modeling, Lecture Notes in Artificial Intelligence, vol. 3445, 2005, pp. 344-350. · Zbl 1182.68192
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.