×

zbMATH — the first resource for mathematics

Hierarchical linear dynamical systems for unsupervised musical note recognition. (English) Zbl 1395.94026
Summary: In this paper we develop a new framework for time series segmentation based on a Hierarchical Linear Dynamical System (HLDS), and test its performance on monophonic and polyphonic musical note recognition. The center piece of our approach is the inclusion of constraints in the filter topology, instead of on the cost function as normally done in machine learning. Just by slowing down the dynamics of the top layer of an augmented (multilayer) state model, which is still compatible with the recursive update equation proposed originally by Kalman, the system learns directly from data all the musical notes, without labels, effectively creating a time series clustering algorithm that does not require segmentation. We analyze the HLDS properties and show that it provides better classification accuracy compared to current state-of-the-art approaches.
MSC:
94A08 Image processing (compression, reconstruction, etc.) in information and communication theory
68T10 Pattern recognition, speech recognition
92C20 Neural biology
92B20 Neural networks for/in biological studies, artificial life and related topics
68T05 Learning and adaptive systems in artificial intelligence
Software:
LSTM
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Handel, S., Listening: an introduction to the perception of auditory events, (1993), MIT Press Cambridge, MA
[2] Barrington, L.; Chan, A. B.; Lanckriet, G., Modeling music as a dynamic texture, IEEE Trans. Audio Speech Lang. Process., 18, 3, 602-612, (2010)
[3] Dayan, P.; Hinton, G. E.; Neal, R. M.; Zemel, R. S., The Helmholtz machine, Neural Comput., 7, 5, 889-904, (1995)
[4] Friston, K., A theory of cortical responses, Philos. Trans. R. Soc. B: Biol. Sci., 360, 1456, 815-836, (2005)
[5] Rao, R. P.; Ballard, D. H., Dynamic model of visual recognition predicts neural response properties in the visual cortex, Neural Comput., 9, 4, 721-763, (1997)
[6] Chan, A. B.; Vasconcelos, N., Modeling, clustering, and segmenting video with mixtures of dynamic textures, IEEE Trans. Pattern Anal. Mach. Intell., 30, 5, 909-926, (2008)
[7] Hyvärinen, A.; Hurri, J.; Hoyer, P. O., Natural Image Statistics, 39, (2009), Springer · Zbl 1178.68622
[8] Coviello, E.; Chan, A. B.; Lanckriet, G., Time series models for semantic music annotation, IEEE Trans. Audio Speech Lang. Process., 19, 5, 1343-1359, (2011)
[9] Revow, M.; Williams, C. K.; Hinton, G. E., Using generative models for handwritten digit recognition, IEEE Trans. Pattern Anal. Mach. Intell., 18, 6, 592-606, (1996)
[10] Chan, A. B.; Vasconcelos, N., Layered dynamic textures, IEEE Trans. Pattern Anal. Mach. Intell., 31, 10, 1862-1879, (2009)
[11] Vaizman, Y.; Granot, R. Y.; Lanckriet, G., Modeling dynamic patterns for emotional content in music, Proceedings of the International Society for Music Information Retrieval, 747-752, (2011)
[12] Hopfield, J. J., Neural networks and physical systems with emergent collective computational abilities, Proc. Natl. Acad. Sci., 79, 8, 2554-2558, (1982) · Zbl 1369.92007
[13] Tank, D.; Hopfield, J., Neural computation by concentrating information in time, Proc. Natl. Acad. Sci., 84, 7, 1896-1900, (1987)
[14] Unnikrishnan, K.; Hopfield, J. J.; Tank, D. W., Connected-digit speaker-dependent speech recognition using a neural network with time-delayed connections, IEEE Trans. Signal Process., 39, 3, 698-713, (1991)
[15] Warren Liao, T., Clustering of time series data: a survey, Pattern Recogn., 38, 11, 1857-1874, (2005) · Zbl 1077.68803
[16] Cinar, G. T., Self-organized computational perception in the time frequency domain, (2015), University of Florida, (Ph.D. thesis)
[17] G.T. Cinar, J.C. Principe, Clustering of time series using a hierarchical linear dynamical system, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2014, pp. 6741-6745.
[18] Cinar, G. T.; Loza, C. A.; Principe, J. C., Hierarchical linear dynamical systems: a new model for clustering of time series, Proceedings of the 2014 IEEE International Joint Conference on Neural Networks (IJCNN), 2464-2470, (2014), IEEE
[19] Kalman, R. E., A new approach to linear filtering and prediction problems, J. Basic Eng., 82, 1, 35-45, (1960)
[20] Nelson, A., Nonlinear estimation and modeling of noisy time-series by dual Kalman filtering methods, (2000), Oregon Graduate Institute of Science and Technology, (Ph.D. thesis)
[21] Panuska, V., A new form of the extended Kalman filter for parameter estimation in linear systems with correlated noise, IEEE Trans. Autom. Control, 25, 2, 229-235, (1980) · Zbl 0465.93080
[22] Bryson, A. E.; Ho, Y.-C., Applied Optimal Control: Optimization, Estimation, and Control, (1975), Taylor & Francis Group
[23] Xing, E. P.; Ng, A. Y.; Jordan, M. I.; Russell, S., Distance metric learning with application to clustering with side-information, Adv. Neural Inf. Process. Syst., 15, 505-512, (2003)
[24] Cinar, G. T.; Principe, J. C., A study of musical pitch distance using a self-organized hierarchical linear dynamical system on acoustic signals, Comput. Music J., 40, 3, (2016)
[25] Ng, A. Y.; Jordan, M. I.; Weiss, Y., On spectral clustering: analysis and an algorithm, Adv. Neural Inf. Process. Syst., 2, 849-856, (2002)
[26] Kohonen, T., The self-organizing map, Proc. IEEE, 78, 9, 1464-1480, (1990)
[27] Yang, K.-F.; Li, C.-Y.; Li, Y.-J., Multi-feature based surround inhibition improves contour detection in natural images, IEEE Trans. Image Process., 23, 12, 5020-5032, (2014) · Zbl 1374.94426
[28] Smith, E.; Lewicki, M. S., Learning efficient auditory codes using spikes predicts cochlear filters, Adv. Neural Inf. Process. Syst., 17, 1289-1296, (2005)
[29] Smith, E. C.; Lewicki, M. S., Efficient auditory coding, Nature, 439, 7079, 978-982, (2006)
[30] Glasberg, B. R.; Moore, B. C.J., Derivation of auditory filter shapes from notched-noise data, Hear. Res., 47, 1-2, 103-138, (1990)
[31] Moore, B. C.J.; Glasberg, B. R., Suggested formulae for calculating auditory-filter bandwidths and excitation patterns, J. Acoust. Soc. Am., 74, 3, 750-753, (1983)
[32] U.o. I. E. M. Studios, Musical instrument samples, 1997, (http://theremin.music.uiowa.edu/). (accessed 24-04-12).
[33] De Cheveigné, A.; Kawahara, H., YIN, a fundamental frequency estimator for speech and music, J. Acoust. Soc. Am., 111, 1917, (2002)
[34] Camacho, A., SWIPE: a sawtooth waveform inspired pitch estimator for speech and music, (2007), University of Florida, (Ph.D. thesis)
[35] Vincent, E.; Bertin, N.; Badeau, R., Adaptive harmonic spectral decomposition for multiple pitch estimation, IEEE Trans. Audio Speech Lang. Process., 18, 3, 528-537, (2010)
[36] Tolonen, T.; Karjalainen, M., A computationally efficient multipitch analysis model, IEEE Trans. Speech Audio Process., 8, 6, 708-716, (2000)
[37] Pertusa, A.; Inesta, J. M., Multiple fundamental frequency estimation using Gaussian smoothness, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, 105-108, (2008), IEEE
[38] Klapuri, A., Multiple fundamental frequency estimation by summing harmonic amplitudes., Proceedings of the International Society for Music Information Retrieval, 216-221, (2006)
[39] Bello, J. P.; Daudet, L.; Sandler, M. B., Automatic piano transcription using frequency and time-domain information, IEEE Trans. Audio Speech Lang. Process., 14, 6, 2242-2251, (2006)
[40] Davy, M.; Godsill, S.; Idier, J., Bayesian analysis of polyphonic western tonal music, J. Acoust. Soc. Am., 119, 4, 2498-2517, (2006)
[41] Marolt, M., A connectionist approach to automatic transcription of polyphonic piano music, IEEE Trans. Multimed., 6, 3, 439-449, (2004)
[42] Poliner, G. E.; Ellis, D. P., A discriminative model for polyphonic piano transcription, EURASIP J. Appl. Signal Process., 2007, 1, 154, (2007) · Zbl 1168.68532
[43] Shin, H.-W.; Kang, S. Y.; Hallett, M.; Sohn, Y. H., Reduced surround inhibition in musicians, Exp. Brain Res., 219, 3, 403-408, (2012)
[44] Wang, Y.; Shanbhag, S. J.; Fischer, B. J.; Peña, J. L., Population-wide bias of surround suppression in auditory spatial receptive fields of the owl’s midbrain, J. Neurosci., 32, 31, 10470-10478, (2012)
[45] Livingstone, M. S.; Hubel, D. H., Specificity of intrinsic connections in primate primary visual cortex, J. Neurosci., 4, 11, 2830-2835, (1984)
[46] Petkov, N.; Subramanian, E., Motion detection, noise reduction, texture suppression, and contour enhancement by spatiotemporal Gabor filters with surround inhibition, Biol. Cybern., 97, 5-6, 423-439, (2007) · Zbl 1248.94018
[47] Shamma, S. A., Speech processing in the auditory system II: lateral inhibition and the central processing of speech evoked activity in the auditory nerve, J. Acoust. Soc. Am., 78, 5, 1622-1632, (1985)
[48] Chalasani, R.; Principe, J. C., Deep predictive coding networks, arXiv preprint arXiv:1301.3541, (2013)
[49] Jesion, G.; Gierczak, C. A.; Puskorius, G. V.; Feldkamp, L. A.; Butler, J. W., The application of dynamic neural networks to the estimation of feedgas vehicle emissions, Proceedings of the 1998 IEEE International Joint Conference on Neural Networks World Congress on Computational Intelligence, vol. 1, 69-73, (1998), IEEE
[50] H.-G. Zimmermann, R. Grothmann, A.M. Schäfer, C. Tietz, H. Georg, Modeling large dynamical systems with dynamical consistent neural networks, New Directions in Statistical Signal Processing (2007) 203.
[51] Schmidhuber, J.; Gers, F.; Eck, D., Learning nonregular languages: a comparison of simple recurrent networks and LSTM, Neural Comput., 14, 9, 2039-2041, (2002) · Zbl 1010.68857
[52] Yang, J.; Yu, K.; Gong, Y.; Huang, T., Linear spatial pyramid matching using sparse coding for image classification, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2009, 1794-1801, (2009), IEEE
[53] Arbib, M. A., The Handbook of Brain Theory and Neural Networks, (2003), MIT press · Zbl 1106.92011
[54] Földiák, P.; Young, M. P., Sparse coding in the primate cortex, Handb. Brain Theory Neural Netw., 1, 1064-1068, (1995)
[55] Barak, O.; Rigotti, M.; Fusi, S., The sparseness of mixed selectivity neurons controls the generalization-discrimination trade-off, The Journal of Neuroscience, 33, 9, 3844-3856, (2013)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.