zbMATH — the first resource for mathematics

Applied harmonic analysis and data processing. Abstracts from the workshop held March 25–31, 2018. (English) Zbl 1409.00083
Summary: Massive data sets have their own architecture. Each data source has an inherent structure, which we should attempt to detect in order to utilize it for applications, such as denoising, clustering, anomaly detection, knowledge extraction, or classification. Harmonic analysis revolves around creating new structures for decomposition, rearrangement and reconstruction of operators and functions – in other words inventing and exploring new architectures for information and inference. Two previous very successful workshops on applied harmonic analysis and sparse approximation have taken place in 2012 and in 2015. This workshop was the an evolution and continuation of these workshops and intended to bring together world leading experts in applied harmonic analysis, data analysis, optimization, statistics, and machine learning to report on recent developments, and to foster new developments and collaborations.
00B05 Collections of abstracts of lectures
00B25 Proceedings of conferences of miscellaneous specific interest
65Txx Numerical methods in Fourier analysis
15B52 Random matrices (algebraic aspects)
42-06 Proceedings, conferences, collections, etc. pertaining to harmonic analysis on Euclidean spaces
42Cxx Nontrigonometric harmonic analysis
94A20 Sampling theory in information and communication theory
Full Text: DOI
[1] A. Aldroubi and K. Gr¨ochenig. Nonuniform sampling and reconstruction in shift-invariant spaces. SIAM Rev., 43(4):585–620, 2001.
[2] K. Gr¨ochenig. The mystery of Gabor frames. J. Fourier Anal. Appl., 20(4):865–895, 2014.
[3] K. Gr¨ochenig, J.-L. Romero and J. St¨ockler. Sampling Theorems for Shift-invariant Spaces, Gabor Frames, and Totally Positive Functions Invent. Math. 211 (3), 1119 - 1148.
[4] K. Gr¨ochenig,J.-L. Romero and J. St¨ockler. Sharp results on sampling with derivativesinshift-invariantspacesandmulti-windowGaborFrames.ArXiv https://arxiv.org/pdf/1712.07899.pdf Synthesis of Logical Clifford Operators via Symplectic Geometry Robert Calderbank (joint work with Narayanan Rengaswamy, Swanand Kadhe, Henry Pfister) Quantum error-correcting codes can be used to protect qubits involved in quantum computation. This requires that logical operators acting on protected qubits be translated to physical operators (circuits) acting on physical quantum states. We propose a mathematical framework for synthesizing physical circuits that implement logical Clifford operators for stabilizer codes. Circuit synthesis is enabled by representing the desired physical Clifford operator in CN ×Nas a partial 2m× 2m binary symplectic matrix, where N = 2m. We state and prove two theorems that use symplectic transvections to efficiently enumerate all binary symplectic matrices that satisfy a system of linear equations. As an important corollary of these 732Oberwolfach Report 14/2018 results, we prove that for an [[m, m− k]] stabilizer code every logical Clifford operator has 2k(k+1)/2symplectic solutions. The desired physical circuits are then obtained by decomposing each solution as a product of elementary symplectic matrices, each corresponding to an elementary circuit. Our assembly of the possible physical realizations enables optimization over the ensemble with respect to a suitable metric. Furthermore, we show that any circuit that normalizes the stabilizer of the code can be transformed into a circuit that centralizes the stabilizer, while realizing the same logical operation. However, the optimal circuit for a given metric may not correspond to a centralizing solution. Our method of circuit synthesis can be applied to any stabilizer code, and this paper provides a proof of concept synthesis of universal Clifford gates for the [[6, 4, 2]] CSS code. We conclude with a classical coding-theoretic perspective for constructing logical Pauli operators for CSS codes. Since our circuit synthesis algorithm builds on the logical Pauli operators for the code, this paper provides a complete framework for constructing all logical Clifford operators for CSS codes. Programs implementing the algorithms in this paper, which includes routines to solve for binary symplectic solutions of general linear systems and our overall circuit synthesis algorithm, can be found at https://github.com/nrenga/symplectic-arxiv18a. References
[5] N. Rengaswamy, R. Calderbank, S. Kadhe, and H. Pfister, “Synthesis of logical clifford operators via symplectic geometry,” in Proc. IEEE Int. Symp. Inform. Theory, 2018.
[6] N. Rengaswamy, R. Calderbank, S. Kadhe, and H. D. Pfister, “Synthesis of Logical Clifford Operators via Symplectic Geometry,” arXiv preprint arXiv:1803.06987, 2018.
[7] A. R. Calderbank and P. W. Shor, “Good quantum error-correcting codes exist,” Phys. Rev. A, vol. 54, pp. 1098–1105, Aug 1996.
[8] A. M. Steane, “Simple quantum error-correcting codes,” Phys. Rev. A, vol. 54, no. 6, pp. 4741–4751, 1996.
[9] D. Gottesman, “A Theory of Fault-Tolerant Quantum Computation,” arXiv preprint arXiv:quant-ph/9702029,1997.[Online].Available:http://arxiv.org/pdf/quantph/9702029.pdf.
[10] A. Calderbank, E. Rains, P. Shor, and N. Sloane, “Quantum error correction via codes over GF(4),” IEEE Trans. Inform. Theory, vol. 44, pp. 1369–1387, Jul 1998.
[11] M. M. Wilde, “Logical operators of quantum codes,” Phys. Rev. A, vol. 79, no. 6, p. 062322, 2009.
[12] D. Gottesman, “An Introduction to Quantum Error Correction and Fault-Tolerant Quantum Computation,” arXiv preprint arXiv:0904.2557, 2009. [Online]. Available: http://arxiv.org/pdf/0904.2557.pdf.
[13] M. Grassl and M. Roetteler, “Leveraging automorphisms of quantum codes for fault-tolerant quantum computation,” in Proc. IEEE Int. Symp. Inform. Theory, pp. 534–538, IEEE, Jul 2013.
[14] R. Chao and B. W. Reichardt, “Fault-tolerant quantum computation with few qubits,” arXiv preprint arXiv:1705.05365, 2017. [Online]. Available: http://arxiv.org/pdf/1705.05365.pdf. Applied Harmonic Analysis and Data Processing733 Learning via nonconvex optimization: ReLUs, neural nets, and submodular maximization Mahdi Soltanolkotabi Many problems of contemporary interest in signal processing and machine learning involve highly non-convex optimization problems. While nonconvex problems are known to be intractable in general, simple local search heuristics such as (stochastic) gradient descent are often surprisingly effective at finding global/high quality optima on real or randomly generated data. In this note we summarize some recent results explaining the success of these heuristics focusing on two problems: (1) learning the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU), (2) learning over-parameterized neural networks with a single hidden layer. In the talk we also discussed a third problem of maximizing submodular functions (we omit this description here due to space limitation and refer to [3] for detail on this problem). This summary is based on our papers [1, 2, 3]. We refer to these papers for a comprehensive discussion on related work in these areas. 1. Problem I: Learning ReLUs Nonlinear data-fitting problems are fundamental to many supervised learning tasks in signal processing and machine learning. Given training data consisting of n pairs of input features xi∈ Rdand desired outputs yi∈ R we wish to infer a function that best explains the training data. One form of nonlinearity which is of particular interest in modern learning is that of fitting Rectified Linear Units (ReLUs) to the data which are functions φw: Rd→ R of the form φw(x) = max (0,hw, xi) . A natural approach to fitting ReLUs is via nonlinear least-squares of the form Xn 1 (1)minL(w) :=(max (0,hw, xii) − yi)2subject toR(w) ≤ R, w∈Rdn i=1 withR : Rd→ R a regularization function that encodes prior information on the weight vector. A simple heuristic for optimizing (1) is to use projected gradient descent like updates. A-priori it is completely unclear why such local search heuristics should converge for problems of the form (1), as not only the regularization function maybe nonconvex but also the loss function! Our result aims to explain why gradient descent is effective in this setting. Theorem 1. Let w∗∈ Rdbe an arbitrary weight vector andR : Rd→ R be a proper function (convex or nonconvex). Suppose the feature vectors xi∈ Rdare i.i.d. Gaussian random vectors distributed asN (0, I) with the corresponding labels given by yi= max (0,hxi, w∗i) . 734Oberwolfach Report 14/2018 To estimate w∗, we start from the initial point w0= 0 and apply the Projected Gradient (PGD) updates of the form (2)wτ +1=PK(wτ− µτ∇L(wτ)) , withK := w ∈ Rd:R(w) ≤ R(w∗). Also set the learning parameter sequence µ0= 2 and µτ= 1 for all τ = 1, 2, . . .. Also assume (3)n > cn0, holds for a fixed numerical constant c. Here, n0is a lower bound on the minimum number of samples required using any algorithm (see [1] for a precise definition). Then there is an event of probability at least 1− 9e−γnsuch that on this event the updates (2) obey τ (4)kwτ− w∗kℓ2≤12kw∗kℓ2. Here γ is a fixed numerical constant. Despite the nonconvexity of both the objective and regularizer, the theorem above shows that with a near minimal number of data samples, projected gradient descent provably learns the original weight vector w∗without getting trapped in any local optima. 2. Problem II: Learning over-parameterized shallow neural nets Neural network architectures (a.k.a. deep learning) have recently emerged as powerful tools for automatic knowledge extraction from raw data. These learning architectures have lead to major breakthroughs in many applications. Despite their wide empirical use the mathematical success of these architectures remains a mystery. The main challenge is that training neural networks correspond to extremely high-dimensional and nonconvex optimization problems and it is not clear how to provably solve them to global optimality. These networks are trained successfully in practice via local search heuristics on real or randomly generated data. In particular, over-parameterized neural networks-where the number of parameters exceed the number of data samples-can be optimized to global optimality using local search heuristics such as gradient or stochastic gradient methods. In our paper [2] we provide theoretical insights into this phenomenon by developing a better understanding of optimization landscape of such over-parameterized shallow neural networks. We discuss the main results in [2]. The results we present here focuses on understanding the global landscape of neural network optimization with one hidden layer with quadratic activation functions. The paper [2] also contains results on the local convergence of gradient descent that applies to a broad set of activation functions. We omit these results do to space limitations. Applied Harmonic Analysis and Data Processing735 Theorem 2. Assume we have an arbitrary data set of input/label pairs xi∈ Rd and yifor i = 1, 2, . . . , n. Consider a neural network of the form x7→ vTφ(W x), with φ(z) = z2a quadratic activation and W∈ Rk×d, v∈ Rkdenoting the weights connecting input to hidden and hidden to output layers. We assume k≥ 2d and set the weights of the output layer v so as to have at least d positive entries and at least d negative entries. Then, the training loss as a function of the weights W of the hidden layer n 1X2 2nyi− vTφ(W xi), i=1 obeys the following two properties. • There are no spurious local minima, i.e. all local minima are global. • All saddle points have a direction of strictly negative curvature. That is, at a saddle point Wsthere is a direction U∈ Rk×dsuch that vect(U )T∇2L(Ws)vect(U ) < 0. Furthermore, for almost every data inputsxini=1, as long as d≤ n ≤ cd2, the global optimum ofL(W ) is zero. Here, c > 0 is a fixed numerical constant. The above result states that given an arbitrary data set, the optimization landscape of fitting neural networks have favorable properties that facilitate finding globally optimal models. In particular, by setting the weights of the last layer to have diverse signs all local minima are global minima and all saddles have a direction of negative curvature. This in turn implies that gradient descent on the input-to-hidden weights, when initialized at random, converges to a global optima. All of this holds as long as the neural network is sufficiently wide in the sense that the number of hidden units exceed the dimension of the inputs by a factor of two (k≥ 2d). References
[15] M. Soltanolkotabi, Learning ReLUs via gradient descent, Neural Information Processing Systems (NIPS 2017).
[16] Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Under revision in IEEE Trans. on Info. Theory
[17] H. Hassani, M. Soltanolkotabi and A. Karbasi., Gradient methods for submodular maximization, Neural Information Processing Systems (NIPS 2017). 736Oberwolfach Report 14/2018 SUNLayer: stable denoising with generative networks Soledad Villar (joint work with Dustin G. Mixon) Exploiting the structure of signals is a fundamental idea in signal processing. For instance, natural images are sparse on wavelets basis [4], and sparsity allows the recovery of signals from few measurements (‘a la compressed sensing [3]). The current trend, that takes advantage of the empirical success of deep learning, is to learn the structure of the signal first, and then exploit it. One way to represent the structure of signals is through a generative model. Informally, a generative model can be thought as a form of parametrization of the data G : Rn→ RNn≪ N such that G(Rn) is a proxy for the probability density of the data of interest. Very impressive generative models have been produced (see for instance [1]) using autoencoders and generative adversarial networks [6]. However, there does not seem to currently exist a provable way to produce generative models successfully and even when the generative model produced is useful for application purposes it is not clear whether a reasonable data distribution is actually learned. Producing generative models that are useful for applications is a very active research area within the machine learning community. 1. Inverse problems with generative models If we have a good generative model we can do amazing things with it. For instance, recent work by Bora, Jalal, Price and Dimakis [2] empirically shows that a generative model obtained from a generative adversarial network can be used to solve the compressed sensing problem with 10 times fewer measurements than classical compressed sensing requires. The key idea is to replace the sparse signal assumption by assuming the signal is close to the range of the generative model G. Their theoretical result shows that under mild hypothesis, if y = Ax∗+ η (η is the noise), then (1)z∗= arg minkAG(z) − yk2 z satisfies G(z∗)≈ x∗(see Theorem 1.1 of [2]). However, it is not obvious that one can efficiently solve the optimization problem (1) since its landscape may a priori have many local minima. Recent work by Hand and Voroninski [7] shows the optimization problem (1) can be solved using local methods provided that G = (ρ◦ Gℓ)◦ . . . ◦ (ρ ◦ G1) where ρ(t) = ReLU(t) = max0, t and Giare matrices with i.i.d. Gaussian entries properly scaled. There is no learning in this setting. Applied Harmonic Analysis and Data Processing737 Figure 1.Denoising with generative priors (First line) Digits from the MNIST test set ([8]). (Second line) random noise is added to the digits. (Third line) Denoising of images by shrinkage in wavelet domain ([5]). (Fourth line) Denoising by minimizing total variation ([11]). (Fifth line) We train a GAN using the training set of MNIST to obtain a generative model G. We denoise by finding the closest element in the image of G using stochastic grading descent. 2. SUNLayer and denoising with generative models Motivated by both works [2, 7], in our paper [9] we study the simpler inverse problem of signal denoising with generative networks. The aim of [9] is to explain the phenomenon illustrated in Figure 1, i.e. given y = G(x∗) + η a noisy signal then one can denoise by finding (2)z∗= arg minkG(z) − yk2 z and the optimization problem (2) can be solved by local methods like gradient descent (i.e. (2) has no spurious critical points). We consider a simpler model for a generative model inspired by neural networks. One layer of the SUNLayer (spherical uniform neural layer) is defined as Ln: Sn→ L2(Sn) x7→ ρ(hx, ·i), 738Oberwolfach Report 14/2018 where ρ is an arbitrary activation function. We aim to answer what properties of activation functions allow denoising with local methods in this simplified model. Consider the decomposition of ρ in the Gegenbauer polynomials (choosing theP correct normalization will be important). If ρ(t) =P∞k=0akϕk,n(t), we define gρ(t) =∞k=0a2kϕk,n(t). Then our main result shows that the critical points of (2) are close to±x∗if inft∈[−1,1]|g′ρ(t)| is not too small in comparison with kηk. 3. Spherical harmonics Squaring the coefficients in the Gegenbauer decomposition may look a priori mysterious. However, it shows up naturally due to the nice properties of the spherical harmonics. Consider the simplified setting when there is no noise. We have arg minkLn(z)− L(x∗)k2= arg minkLn(z)k2+kLn(x∗)k2− 2hLn(z), Ln(x∗)i zz R andkLn(x)k2=Snρ(hx, yi)2dy = cρ,nindependent of x. Now we use the relationship between the Gegenbauer polynomials and the spherical harmonics (see chapter 2 of [10]). Decompose L2(Sn) =⊕∞k=0Hkn(Sn) whereHkn(Sn) are the spherical harmonics (homogeneous polynomials in n + 1 variables, of degree k, with Laplacian 0, restricted to the sphere). In particularHkn(Sn) is a finite dimensional vector space. LetPY1, . . . Yr a basis of Hkn(Sn), then define the bilinear r form Fk(σ, τ ) =s=1Ys(σ)Ys(τ ) for σ, τ∈ Sn. It turns out Fkis a reproducing kernel:hFk(x,·), Fk(y,·)i = Fk(x, y) and it also satisfies that Fk(x, y) only depends on the inner product t :=hx, yi and in fact Fk(x, y) = ϕk,n(t). Then hLn(z), Ln(x∗)i = hρ(hz, ·i), ρ(hx∗,·i)i X∞X∞ =hakϕk,n(hz, ·i),akϕk,n(hx∗,·i)i k=0k=0 X∞X∞ =hakFk(z,·),akFk(hx∗,·)i k=0k=0 X∞ =a2khFk(z,·), Fk(hx∗,·)i k=0 X∞ =a2kϕn,k(hz, x∗i). k=0 Therefore X∞ zzzkϕn,k(hz, x∗i) k=0 and a simple computation shows that the only critical points are z =±x∗if g′ρ(t) > 0 for all t∈ [−1, 1]. The analysis for the noisy case we do in [9] is still simple but more interesting since it involves considering tight frames inHnk(Sn). Applied Harmonic Analysis and Data Processing739 References
[18] Berthelot, David, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717 (2017).
[19] Bora, Ashish, Ajil Jalal, Eric Price, and Alexandros G. Dimakis. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208 (2017).
[20] Candes, Emmanuel J., Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics 59, no. 8 (2006): 1207-1223.
[21] Daubechies, Ingrid. Ten lectures on wavelets. Vol. 61. Siam, 1992.
[22] Donoho, David L., and Jain M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, no. 3 (1994): 425-455.
[23] Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, (2014): 2672-2680.
[24] Hand, Paul, and Vladislav Voroninski. Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk. arXiv preprint arXiv:1705.07576 (2017).
[25] LeCun,Yann.The MNIST database of handwritten digits. http://yann.lecun. com/exdb/mnist/ (1998).
[26] Mixon, Dustin G., and Soledad Villar. SUNLayer: Stable denoising with generative networks. arXiv preprint arXiv:1803.09319 (2018).
[27] Morimoto, Mitsuo. Analytic functionals on the sphere and their Fourier-Borel transformations. Complex Analysis Banach Center Publications 11 (1983).
[28] Rudin, Leonid I., Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena 60, no. 1-4 (1992): 259-268. Random Initialization in Nonconvex Phase Retrieval Yuxin Chen (joint work with Cong Ma, Yuejie Chi, Jianqing Fan) Suppose we are interested in learning an unknown object x♮∈ Rn, but only have access to a few quadratic equations of the form 2 (1)yi= a⊤ix♮,1≤ i ≤ m, where yiis the sample we collect and aiis the design vector known a priori. Is it feasible to reconstruct x♮in an accurate and efficient manner? The problem of solving systems of quadratic equations (1) spans multiple domains including physical sciences and machine learning. A natural strategy for inverting the system of quadratic equations (1) is to solve the following nonconvex least squares estimation problem 1Xh2i2 4mix− yi. i=1 i.i.d. Under Gaussian designs where ai∼ N (0, In), the solution to (2) is known to be exact — up to some global sign — with high probability, as soon as the number m of equations (samples) exceeds the order of the number n of unknowns. However, the loss function in (2) is highly nonconvex, thus resulting in severe computational challenges. Fortunately, in spite of nonconvexity, a variety of optimization-based 740Oberwolfach Report 14/2018 methods are shown to be effective in the presence of proper statistical models. Arguably, one of the simplest algorithms for solving (2) is vanilla gradient descent (GD), which attempts recovery via the update rule  (3)xt+1= xt− ηt∇f xt,t = 0, 1,· · · with ηtbeing the stepsize / learning rate. The above iterative procedure is also dubbed Wirtinger flow for phase retrieval, which can accommodate the complexvalued case as well. This simple algorithm is remarkably efficient under Gaussian designs: in conjunction with carefully-designed initialization and stepsize rules, GD provably converges to the truth x♮at a linear rate1, provided that the ratio m/n of the number of equations to the number of unknowns exceeds some logarithmic factor. One crucial element in prior convergence analysis is initialization. In order to guarantee linear convergence, prior works typically recommend spectral initialization or its variants. Two important features are worth emphasizing: • x0falls within a local ℓ2-ball surrounding x♮with a reasonably small radius, where f (·) enjoys strong convexity; • x0is incoherent with all the design vectorsai — in the sense that |a⊤ix0| is reasonably small for all 1≤ i ≤ m — and hence x0falls within a region where f (·) enjoys desired smoothness conditions. These two properties taken collectively allow gradient descent to converge rapidly from the very beginning. The enormous success of spectral initialization gives rise to a curious question: is carefully-designed initialization necessary for achieving fast convergence? A strategy that practitioners often like to employ is to initialize GD randomly. The advantage is clear: compared with spectral methods, random initialization is model-agnostic and is usually more robust vis-a-vis model mismatch. Despite its wide use in practice, however, GD with random initialization is poorly understood in theory. In this work, we prove that under Gaussian designs (i.e. aii.i.d.∼ N (0, In)), gradient descent — when randomly initialized — yields an ǫ-accurate solution in O log n + log(1/ǫ)iterations given nearly minimal samples (up to some logarithmic factor), thus achieving near-optimal computational and sample complexities at once. This provides the first global convergence guarantee concerning vanilla gradient descent for phase retrieval, without the need of (i) carefully-designed initialization, (ii) sample splitting, or (iii) sophisticated saddle-point escaping schemes. All of these are achieved by exploiting the statistical models in analyzing optimization algorithms, via a leave-one-out approach that enables the decoupling of certain statistical dependency between the gradient descent iterates and the data. 1 An iterative algorithm is said to enjoy linear convergence if the iterates xt converge geometrically fast to the minimizer x♮. Applied Harmonic Analysis and Data Processing741 Optimal weighted least squares approximations Albert Cohen (joint work with Benjamin Arras, Markus Bachmayr and Giovanni Migliorati) We consider the problem of approximating an unknown function u∈ L2(X, ρ) from its evaluation at given sampling points x1, . . . , xn∈ X, where X ⊂ Rdis a general domain and ρ a probability measure. The approximation ˜u is picked in a linear space Vmwhere m = dim(Vm). We measure accuracy in the Hilbertian norm Z1/2 kvk =|v(x)|2dρ=kvkL2(X,ρ), X where ρ is a probability measure over X. The error of best approximation is defined by em(u) := minku − vk, v∈Vm The method is said to be near-optimal (or instance optimal with constant C) if the comparison ku − ˜uk ≤ Cem(u), holds for all u, where C > 1 is some fixed constant. For a given probability measure ρ and approximation space Vmof interest, a relevant question is whether instance optimality can be achieved with sample size n that is moderate, ideally linear in m. Recent results of [3, 5] for polynomial spaces and [2] in a general approximation setting, show that this objective can be achieved by certain random sampling schemes in the general framework of weighted least squares methods. The approximation ˜u is defined as the solution to n 1X minw(xi)|yi− v(xi)|2, v∈Vmn i=1 where w is a positive function and the xiare independently drawn according to a probability measure µ, that satisfy the constraint w dµ = dρ. The case w = 1 and µ = ρ corresponds to the standard unweighted least squares method. We denote byk · knthe discrete Euclidean norm defined by 1Xn kvk2n:=w(xi)|v(xi)|2, n i=1 and byh·, ·inthe associated inner product. The solution ˜u may be thought of as an orthogonal projection of u onto Vmfor this norm. Expanding it into Xm u =˜cjϕj, j=1 742Oberwolfach Report 14/2018 in a basisϕ1, . . . , ϕm of Vm, the coefficient vector c = (c1, . . . , cm)Tis solution to the linear system Gc = d, where G is the Gramian matrix for the inner producth·, ·inwith entries n 1X Gj,k:=hϕj, ϕkin=w(xi)ϕj(xi)ϕk(xi), n i=1 Pn and the vector d has entries dk=n1i=1yiϕk(xi). The solution c always exists and is unique when G is invertible. Whenϕ1, . . . , ϕm is an L2(X, ρ)-orthonormal basis of Vm, one has E(G) = I. The stability and accuracy analysis of the weighted least squares method is related to the amount of deviation between G and its expectation I measured in the spectral norm. This deviation also describe the closeness of the normsk · k and k · knover the space Vm, since one has kG − Ik ≤ δ⇐⇒(1− δ)kvk2≤ kvk2n≤ (1 + δ)kvk2,v∈ Vm. The choice of a sampling measure µ that differs from the error norm measure ρ appears to be critical in order to obtain stable and accurate approximations with an optimal sampling budget. The optimal sampling measure and weights are given by kmm µm=dρand wm=, mkm where kmis the so-called Christoffel function defined by Xm km(x) =|ϕj(x)|2, j=1 withϕ1, . . . , ϕm any L2(X, ρ)-orthonormal basis of Vm. With such choices, the following result can be established, see [2, 1]. Theorem 1. With the above choice µmof sampling measure, for any 0 < ε < 1, the condition n≥ cm(ln(2m) − ln(ε)),c := γ−1=2, 1− ln 2 implies the following stability and instance optimality properties: 1 PrkG − Ik ≥≤ ε. 2 and E(ku − ˜uk2)≤1 +ce ln(2m)− ln(ε)m(u)2+ εkuk2. Applied Harmonic Analysis and Data Processing743 In summary, when using the optimal sampling measure µm, stability and instance optimality can be achieved in the near linear regime n = n(m) = nε(m) :=⌈c m (ln(2m) − ln ε)⌉, where ε controls the probability of failure. In various practical applications, the space Vmis picked within a family (Vm)m≥1 that has the nestedness property V1⊂ V2⊂ · · · and accuracy is improved by raising the dimension m. The sequence (Vm)m≥1may either be a priori defined, or adaptively generated, which means that the way Vm is refined into Vm+1may depend on the result of the least squares computation. In this setting, we are facing the difficulty that the optimal measure µmvaries with m. In order to maintain an optimal sampling budget, one should avoid the option of drawing a new sample Sm=x1m, . . . , xnm of increasing size n = n(m) at each step m. For this purpose, we observe that the optimal measure µmenjoys the mixture property 11 µm+1=1−µm+σm+1,wheredσm:=|ϕm|2dρ. m + 1m + 1 As noticed in [4], this leads naturally to sequencial sampling strategies where the sample Smis recycled for generating Sm+1. Here is one instance of such a strategy that was studied in [1]. Algorithm 1 Sequential sampling input: sample Smfrom µm output: sample Sm+1from µm+1 for i = 1, . . . , n(m) do draw aiuniformly distributed in1, . . . , m + 1 if ai= m + 1 then draw xim+1from σm+1 else set xim+1:= xim end if end for for i = n(m) + 1, . . . , n(m + 1) do draw xim+1from µm+1. end for The interest of this sequencial sampling strategy is that the total number of sample Cmwhich has been generated after m steps remains within the same order as the near optimal budget n(m). More precisely, the following result is established in [1]. 744Oberwolfach Report 14/2018 Theorem 2. For Algorithm 1, one has E(Cm)≤ n(m) + n(m − 1) + 1, and for any τ∈ [0, 1],  Pr Cm≥ n(m) + (1 + τ)(n(m − 1) + 1)≤ Mτe−τ 26n(m−1) 2cτ 2 with Mτ:= e3. It should be noted that these results are completely independent of the choice of the spaces (Vm)m≥1, as well as of the spatial dimension d of the domain X. One natural perspective is to develop adaptive least square methods in various context (wavelet refinements, high dimensional sparse polynomials) based on such sequencial sampling strategies. References
[29] B. Arras, M. Bachmayr and A. Cohen, Sequential sampling for optimal weighted least squares approximations in hierarchical spaces, preprint (2018).
[30] A. Cohen and G. Migliorati, Optimal weighted least squares methods, SMAI Journal of Computational Mathematics 3 (2017), 181–203.
[31] A. Doostan and J. Hampton, Coherence motivated sampling and convergence analysis of least squares polynomial Chaos regression, Computer Methods in Applied Mechanics and Engineering 290 (2015), 73-97.
[32] A. Doostan and J. Hampton, Basis Adaptive Sample Efficient Polynomial Chaos (BASEPC), preprint (2017), arXiv:1702.01185.
[33] J.D. Jakeman, A. Narayan, and T. Zhou, A Christoffel function weighted least squares algorithm for collocation approximations, preprint (2016), arXiv:1412.4305. On the Expressive Power of Deep Learning: A Tensor Analysis Nadav Cohen (joint work with Or Sharir, Yoav Levine, Ronen Tamari, David Yakira, Amnon Shashua) It is widely accepted that the driving force behind convolutional networks, and deep learning in general, is the expressive power that comes with depth, i.e. the ability to compactly represent rich and effective spaces of functions through compositionality. Despite the vast empirical evidence supporting this belief, formal arguments to date are scarce. In particular, the machine learning community lacks satisfactory analyses of depth efficiency, a concept which refers to a situation where a deep network of polynomial size realizes a function that cannot be realized (or approximated) by a shallow network unless the latter has super-polynomial size. Moreover, even with a concrete understanding of depth efficiency, the mystery behind the expressive power of deep learning would still remain. Deep networks of polynomial size realize a small fraction of all possible functions, thus even if depth efficiency holds almost always, meaning the space of functions efficiently realizable by deep networks is much larger than that efficiently realizable by shallow networks, it still does not explain why deep networks are effective in practice. To Applied Harmonic Analysis and Data Processing745 address this question one must consider the inductive bias, i.e. the assumptions regarding functions required for real-world tasks that are implicitly encoded into deep networks. In the series of papers described hereafter, we derive an equivalence between convolutional networks and tensor decompositions, and use it to analyze, for the first time, the depth efficiency and inductive bias of convolutional networks. We begin in [3] by constructing a universal hypotheses space over instances defined as tuples of vectors, which in the context of images, corresponds to representation via local patches. The hypotheses space is constructed as a tensor product of finite-dimensional function spaces over the local structures. A general hypothesis may thus be expressed as a linear combination over an exponentially large basis of product functions, where the coefficients of the linear combination are naturally viewed as a high-order tensor (every mode in the tensor corresponds to a patch in the input). Naive computation of hypotheses is intractable, but by applying hierarchical tensor decompositions to coefficient tensors, efficient computation becomes possible. Moreover, circuits realizing the computations form a special case of convolutional networks. Namely, they are convolutional networks with linear activation and product pooling, and we accordingly refer to them as convolutional arithmetic circuits. The key observation is that there is a one-to-one correspondence between the type of decomposition applied to a coefficient tensor, and the structure of the convolutional arithmetic circuit computing the hypothesis (number of hidden layers, number of channels in each hidden layer, sizes and shapes of pooling windows etc.). This facilitates the study of networks through analysis of their corresponding tensor decompositions, bringing forth a plurality of mathematical tools from domains such as matrix algebra and measure theory. We show that classic CANDECOMP/PARAFAC (CP) decomposition corresponds to a shallow network with global pooling in its single hidden layer. The recently introduced Hierarchical Tucker (HT) decomposition corresponds to a deep network with multiple hidden layers, where the sizes of pooling windows (and the resulting network depth) depend on the structure of the mode tree underlying the decomposition. By analyzing tensors generated by CP and HT decompositions in terms of their ranks when subject to canonical matrix arrangements, we show that besides a set of Lebesgue measure zero, all weight settings for a deep network lead to depth efficient functions. That is to say, besides a negligible set, all functions realizable by a deep network of polynomial size cannot be realized (or approximated) by a shallow network unless the latter has super-polynomial size. Such result, which we refer to as complete depth efficiency, has never before been established for any deep learning architecture, convolutional networks in particular. Convolutional arithmetic circuits comprise the fundamental ingredients of convolutional networks – locality, weight sharing and pooling. We have implemented and evaluated such circuits (a.k.a. SimNets), showing that they deliver promising results on various visual recognition benchmarks [1, 2]. Nonetheless, they have yet 746Oberwolfach Report 14/2018 to reach widespread use, especially compared to convolutional rectifier networks – the most successful variant of convolutional networks to date. Convolutional rectifier networks are characterized by ReLU activation and max or average pooling. They do not posses the algebraic nature of convolutional arithmetic circuits, thus it is unclear to what extent our results from [3] apply to such networks. To facilitate an analysis of convolutional rectifier networks, we head on in [4] and define generalized tensor decompositions as constructs that are obtained by replacing the multiplication operator in standard tensor decompositions with a general associative and commutative operator g : R× R → R. Given convolutional networks with activation σ (e.g. σ(z) = maxz, 0 for ReLU) and pooling P (e.g. Pcii= maxciifor max), we define the activation-pooling operator gσ/P(a, b) := Pσ(a), σ(b). Apparently, if gσ/Pis associative and commutative, the generalized tensor decompositions it gives rise to are equivalent to convolutional networks with activation σ and pooling P , where again, there is a one-to-one correspondence between the type of a decomposition and the structure of its respective network. With convolutional rectifier networks the activation-pooling operator gσ/Pis indeed associative and commutative, thus the equivalence holds. We make use of it to analyze the expressive properties of such networks, and, surprisingly, find that in contrast to convolutional arithmetic circuits, with convolutional rectifier networks depth efficiency is not complete. There are still functions efficiently realizable by deep networks and not by shallow ones, but these are not as common – the set of functions in a deep network’s hypotheses space which can be realized (or approximated) by polynomially-sized shallow networks is nonnegligible (has positive Lebesgue measure in the deep network’s weight space). We interpret this result as indicating that in terms of expressiveness, the popular convolutional rectifier networks are inferior to the recently introduced convolutional arithmetic circuits. Of course, to take advantage of a machine learning model, it is not enough for it to be expressive, we have to be able to effectively train it as well. Over the years, a huge body of empirical research has been devoted to training convolutional rectifier networks. We conjecture that directing similar efforts into training convolutional arithmetic circuits, thereby fulfilling their expressive potential, may give rise to a deep learning architecture that is provably superior to convolutional rectifier networks yet has so far been overlooked by practitioners. As discussed in the beginning of this section, depth efficiency alone does not unravel the mystery behind the expressive power of deep convolutional networks. For a complete understanding of the latter, one must consider the inductive bias, i.e. the properties of functions realized by polynomially-sized deep networks, and their suitability for real-world tasks. This is the purpose of the work in [5], where we study the ability of convolutional arithmetic circuits to model correlations among regions of their input. Correlations are formalized through the notion of separation rank, which for a given input partition, measures how far a function is from being separable. We show that a polynomially-sized deep network supports exponentially high separation ranks for certain input partitions, while being limited to polynomial separation ranks for others. The network’s pooling geometry Applied Harmonic Analysis and Data Processing747 effectively determines which input partitions are favored, thus serves as a means for controlling the inductive bias. Contiguous pooling windows as commonly employed in practice favor interleaved (entangled) partitions over coarse ones, orienting the inductive bias towards the statistics of natural images. Other pooling geometries lead to different preferences, and this allows tailoring convolutional networks for new types of data that depart from the usual domain of natural imagery. We validate this empirically with both convolutional arithmetic circuits and convolutional rectifier networks, showing that for image processing tasks of a local nature, such as characterization of shape continuity, standard contiguous pooling is optimal. On the other hand, for tasks such as symmetry detection, where modeling correlations between distinct input regions is important, scattered pooling geometries lead to better performance. The prescription for tailoring a network to model correlations needed for a given task, is an exemplar of how our theory, developed to address fundamental questions regarding the expressiveness of convolutional networks, also brings forth new capabilities to their application in practice. We take this further in the next section, where I discuss two works leveraging our theory for designing new types of networks with novel capabilities and improved performance. Practical Applications. Convolutional arithmetic circuits, born by our construction of a universal hypotheses space equipped with hierarchical tensor decompositions, are in fact closely related to probabilistic generative models. Namely, if the weights of each filter are constrained to lie on the simplex (non-negative and sum to one), the computation carried out by a convolutional arithmetic circuit produces the likelihood of the input under a universal high-dimensional generative model. As opposed to other generative methods recently considered in the literature (e.g. generative adversarial networks or variational models), our model admits tractable inference (computation of likelihood), and more importantly, tractable marginalization. This allows for previously infeasible capabilities such as classification under missing data where the missingness distribution at test time is unknown. We demonstrate this on image recognition benchmarks in [6]. The equivalence we established between convolutional networks and tensor decompositions applies in particular to dilated convolutional networks – a newly introduced variant that provides state of the art performance in audio and text processing tasks. With dilated convolutional networks, the choice of dilations throughout a network corresponds to determination of the mode tree underlying the respective decomposition. We utilize this in [7], and introduce the notion of a mixed tensor decomposition, blending together multiple mode trees. Mixed tensor decompositions correspond to mixed dilated convolutional networks, formed by interconnecting hidden layers of networks with different dilations. We show that mixing decompositions allows representation of tensors much more efficiently than what would have been possible without the mixture, and by this prove that interconnecting dilated convolutional networks brings forth a boost to their expressiveness. Empirical evaluations demonstrate that this translates to significant gains in accuracy. 748Oberwolfach Report 14/2018 Future Work. There are various promising avenues for future research. One direction we are investigating is an extension of the equivalence with tensor decompositions beyond convolutional networks. Specifically, we have recently learned that a decomposition named Tensor Train (TT) can be viewed as equivalent to recurrent neural networks, opening the door to analyzing the expressive properties of the latter, as well as comparing them to convolutional networks. An additional path we are exploring is the relation between our theory and that of tensor networks in quantum mechanics. The latter was developed before hierarchical tensor decompositions were introduced, leading us to believe that it may be possible to suggest more efficient algorithms than those used today, and perhaps even shed some light on physical principles. Finally, in the longer term, I am interested in leveraging the equivalence between convolutional networks and tensor decompositions for addressing the fundamental theoretical questions beyond expressiveness – optimization and generalization. References
[34] N. Cohen and A. Shashua, SimNets: A Generalization of Convolutional Networks, Advances in Neural Information Processing Systems (NIPS), Workshop on Deep Learning and Representation Learning, 2014.
[35] N. Cohen, O. Sharir and A. Shashua, Deep SimNets, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[36] N. Cohen, O. Sharir and A. Shashua, On the Expressive Power of Deep Learning: A Tensor Analysis, Conference On Learning Theory (COLT), 2016.
[37] N. Cohen and A. Shashua, Convolutional Rectifier Networks as Generalized Tensor Decompositions, International Conference on Machine Learning (ICML), 2016.
[38] N. Cohen and A. Shashua, Inductive Bias of Deep Convolutional Networks through Pooling Geometry, International Conference on Learning Representations (ICLR), 2017.
[39] O. Sharir, R. Tamari, N. Cohen and A. Shashua, Tensorial Mixture Models, arXiv preprint, 2017.
[40] N. Cohen, R. Tamari and A. Shashua, Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions, International Conference on Learning Representations (ICLR), 2018. Robust one-bit compressed sensing with non-Gaussian matrices Sjoerd Dirksen (joint work with Hans Christian Jung, Shahar Mendelson, Holger Rauhut) The theory of compressed sensing predicts that one can reconstruct signals from a small number of linear measurements using efficient algorithms, by exploiting the empirical fact that many real-world signals possess a sparse representation. In the traditional compressed sensing literature, it is typically assumed that one can reconstruct a signal based on its analog linear measurements. In a realistic sensing scenario, measurements need to be quantized to a finite number of bits before they can be transmitted, stored, and processed. Formally, this means that one needs to reconstruct a sparse signal x based on non-linear observations of the Applied Harmonic Analysis and Data Processing749 form y = Q(Ax), where Q : Rm→ Amis a quantizer andA denotes a finite quantization alphabet. We consider the one-bit compressed sensing model, which was first studied by Boufounos and Baraniuk [3]. In this model one observes (1)y = sign(Ax + τ ), where A∈ Rm×N, m≪ N, sign denotes the sign function applied entry-wise and τ∈ Rmis a vector consisting of thresholds. Especially interesting is the memoryless one-bit quantization model, in which every linear measurement is quantized independently of the other measurements. The memoryless one-bit quantizer is attractive from a practical point of view, as it can be implemented using an energyefficient comparator to a fixed voltage level (if the thresholds τiare all equal to a fixed constant) combined with dithering (if τ is random). In the original work [3] all thresholds were taken equal to zero. In this scenario, the energy of the original signal is lost during quantization and one can only hope to recover its direction. There is by now a rich theory available for one-bit compressed sensing with standard Gaussian matrices. For instance, it is known that with high probability one can accurately reconstruct the direction of any (approximately) sparse signal via a tractable convex program, even if a fraction of the bits is corrupted at the quantizer in an adversarial manner [8]. This result is valid if the number of onebit measurements m scales in terms of the signal sparsity s as m≥ Cs log(n/s), which is the optimal scaling known from ‘unquantized’ compressed sensing. More recently, it has been shown to be possible to efficiently reconstruct the complete signal by using Gaussian thresholds, provided that one knows an a-priori bound on the energy of the signal [2, 7]. Although these results are very interesting from a mathematical perspective, their practical value is limited by the fact that Gaussian matrices cannot be realized in a real-world measurement setup. It is therefore of substantial interest to extend the known theory to non-Gaussian matrices. This is a non-trivial task, as there exist measurement matrices that perform optimally in unquantized compressed sensing, but may fail if one-bit quantization is used. For instance, as was pointed out in [1], if A is a Bernoulli matrix and τ = 0, then there already exist 2-sparse vectors that cannot be reconstructed based on their one-bit measurements, regardless of how many measurements we take. Still, [1] established a positive recovery result for subgaussian measurement matrices which shows that one can reconstruct a sparse signal x up to accuracy (at most)kxk1/4∞. Informally, this means that one can still hope to recover the signal if it is sparse, but not too sparse. In joint work with H.C. Jung and H. Rauhut [4], we establish the first rigorous reconstruction guarantees for memoryless one-bit compressed sensing with a structured random matrix. We investigate a randomly subsampled Gaussian circulant matrix, a measurement model that is relevant for several applications such as SAR radar imaging, Fourier optical imaging and channel estimation (see e.g.
[41] and the references therein). In contrast to [1], the main results of [4] impose a small sparsity assumption. Under this assumption, we establish guarantees for 750Oberwolfach Report 14/2018 the accurate recovery of the direction of any s-sparse or effectively sparse vector using a single hard thresholding step or a linear program, respectively, in the case that the threshold vector τ is zero. Our analysis relies on work of S. Foucart
[42] , who observed that recovery results for these two reconstruction methods can be obtained by showing that the matrix A satisfies an ℓ1/ℓ2-restricted isometry property. By taking τ to be an appropriately scaled Gaussian vector, one can fully recover effectively sparse signals via a second order cone program, provided that an upper bound on their energy is known. The works [1, 4] give the impression that in a non-Gaussian context one needs additional restrictions on the signal in order to accurately recover from one-bit measurements. In very recent joint work with S. Mendelson [5], we show that this impression is misleading. We prove that if one chooses the random threshold vector τ appropriately, then one can accurately reconstruct signals from general low-complexity sets based on their subgaussian, or even heavy-tailed, one-bit measurements. In the special case of sparse signals we additionally prove recovery results for randomly subsampled circulant matrices generated by a subgaussian vector, without any restriction on the sparsity level. In addition, our recovery results strongly improve over [1, 4] in terms of robustness: recovery is stable in the presence of adversarial bit corruptions in the quantization process, as well as heavy-tailed noise on the analog measurements. In the case of subgaussian and randomly subsampled subgaussian circulant matrices, robust recovery can be achieved via a convex (and, for many signal sets, tractable) recovery program. References
[43] A. Ai, A. Lapanowski, Y. Plan, and R. Vershynin. One-bit compressed sensing with nonGaussian measurements. Linear Algebra Appl., 441:222–239, 2014.
[44] R. G. Baraniuk, S. Foucart, D. Needell, Y. Plan, and M. Wootters. Exponential decay of reconstruction error from binary measurements of sparse signals. IEEE Transactions on Information Theory, 63(6):3368–3385, 2017.
[45] P. T. Boufounos and R. G. Baraniuk. 1-bit compressive sensing. In 2008 42nd Annual Conference on Information Sciences and Systems, pages 16–21. IEEE, 2008.
[46] S. Dirksen, H. C. Jung, and H. Rauhut. One-bit compressed sensing with Gaussian circulant matrices. ArXiv:1710.03287, 2017.
[47] S. Dirksen and S. Mendelson. Robust one-bit compressed sensing with non-Gaussian measurements. in preparation, 2018.
[48] S. Foucart. Flavors of Compressive Sensing, pages 61–104. Springer International Publishing, Cham, 2017.
[49] K. Knudson, R. Saab, and R. Ward. One-bit compressive sensing with norm estimation. IEEE Trans. Inform. Theory, 62(5):2748–2758, 2016.
[50] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: a convex programming approach. IEEE Trans. Inform. Theory, 59(1):482–494, 2013.
[51] J. Romberg. Compressive sensing by random convolution. SIAM J. Imaging Sci., 2(4):1098– 1128, 2009. Applied Harmonic Analysis and Data Processing751 Transports on manifolds in data analysis Ronen Talmon (joint work with Or Yair, Mirela Ben-Chen) noN1noN2 Consider two subsetsx(1)i(t)andx(2)i(t), where x(k)i(t)∈ RD, of N1 i=1i=1 and N2high-dimensional time series, respectively. Assume that each subset is obtained from the same acquisition system in a particular session, deployment, and set of environmental conditions. In our notation, the superscript denotes the index of the subset, the subscript i denotes the index of the time-series within each subset, and t represents the time axis of each time-series. Our exposition focuses only on two subsets, but generalization to any number of subsets is straightforward. Additionally, we consider here time-series, but our derivation does not take the temporal order into account, and therefore, extension to other types of data, e.g., images, is immediate, where t could be just an index of a sample. Analyzing such data typically raises many challenges. For example, one notable problem is how to efficiently compare between high-dimensional point clouds, and particularly, time-series. When the data are real measured signals, sample comparisons become even more challenging, since such high-dimensional measured data usually contain high levels of noise. In particular, in our setting, we face an additional challenge, since the data is given in two separate subsets. Comparing time-series from the same subset is a difficult task by itself, even more so is comparing time-series from two different subsets. Our goal in this work is to find a new joint representation of the two subsets in an unsupervised manner. Broadly, we aim to devise a low-dimensional representation in a Euclidean space that facilitates efficient and meaningful comparisons. As in many unsupervised tasks, this general description of the goal is not well-defined. To make our objective more concrete, we associate each time-series x(1)i(t) with a (1)(2)(2) label yiand xi(t) with a label yi, and define “meaningful” comparisons with respect to these labels. Namely, we design the new representation such that the Euclidean distance between the new representations of any two time-series with similar corresponding labels is small, independently of the time-series respective trial, and particularly, subset. To construct such a representation, we propose an approach that computes covariance matrices as data features, and then employs parallel transport on the manifold of symmetric positive-definite matrices [1, 2, 3]. Based on our new representation, we devise efficient and accurate solutions for transfer learning and domain adaptation, which are long-standing problems in noN1 data analysis. Specifically, given a subsetx(1)i(t)with corresponding labels i=1 n(1)oN1 yi, we train a classifier on the new derived representation of the subset. i=1 noN1 Then, when another unlabeled subsetx(2)i(t)becomes available, we apply i=1 the trained classifier to the derived (joint) representation. 752Oberwolfach Report 14/2018 To put the problem setting and our proposed solution in context, we will use an illustrativeexample,takenfromarecentcompetition (http://www.bbci.de/competition/iv/). Consider data from a brain computer interface (BCI) experiment of motor imagery comprising D Electroencephalography (EEG) recordings. In this experiment, several subjects were asked to repeatedly perform one out of four motor imagery tasks (to raise their right hand, left noN1 hand, or feet, or to move their tongue), Letx(1)i(t)be a subset of recordings i=1 acquired from a single subject, indexed (1), where the time-series x(1)i(t) consists of the signals, recorded simultaneously from the D EEG channels during the ith trial. Each time series x(1)i(t) is attached with a label yi(1), denoting the imagery task performed at the the ith trial. Common practice is to train a classifier based on noN1 x(1)i(t)and yi(1), so that the imagery task could be identified from new EEG i=1 recordings. This capability could then be the basis for devising brain computer interfaces, for example, to control prosthetics. noN1 Suppose that a new subsetx(2)i(t)of recordings acquired from another i=1 subject, indexed (2), becomes available. Applying the classifier, trained based on data from subject (1), to the new subset of recordings from subject (2) typically yields poor results, as we demonstrate in our study. Indeed, to the best of our knowledge, all methods addressing this particular dataset, e.g., [4], as well as other related problems, exclusively analyze data from each individual subject separately. noN1noN1 By constructing a joint representation for bothx(1)i(t)andx(2)i(t), i=1i=1 which is oblivious to the specific subject, we are able to build a classifier that is trained on data from one subject and applied to data from another subject without any calibration, i.e., without any labeled data from the new (test) subject. References
[52] X. Pennec, P. Fillard, and N. Ayache, A riemannian framework for tensor computing, International Journal of computer vision, vol. 66, no. 1, pp. 41–66, 2006.
[53] R. Bhatia, Positive definite matrices.Princeton university press, 2009.
[54] S. Sra and R. Hosseini, Conic geometric optimization on the manifold of positive definite matrices, SIAM Journal on Optimization, vol. 25, no. 1, pp. 713–739, 2015.
[55] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, Classification of covariance matrices using a riemannian-based kernel for bci applications, Neurocomputing, vol. 112, pp. 172–178, 2013. Applied Harmonic Analysis and Data Processing753 Deep networks: engineered, trained, or randomized? R´emi Gribonval Many of the data analysis and processing pipelines that have been carefully engineered by generations of mathematicians and practitioners can in fact be implemented as deep networks. Allowing the parameters of these networks to be automatically trained allows to revisit certain constructions. The talk first describes an empirical approach to approximate a given matrix by a fast linear transform through numerical optimization [1]. The main idea is to write fast linear transforms as products of few sparse factors, and to iteratively optimize over the factors. This corresponds to training a linear multilayer neural network with sparse connections. Algorithms exploiting iterative hard-thresholding projections have been shown to perform well in practice. Yet, developing a solid understanding of their conditions of success remains an open mathematical question. In a second part, the talk outlines the main features of a recent framework for large-scale learning called compressive statistical learning [2]. Inspired by compressive sensing, the framework allows drastic volume and dimension reduction to learn from large/distributed/streamed data collections. Its principle is to compute a low-dimensional (nonlinear) sketch (a vector of random empirical generalized moments), in essentially one pass on the training collection. For certain learning problems such as clustering [3], small sketches have been shown to capture the information relevant to the considered learning task, and empirical learning algorithms have been proposed to learn from such sketches. As a proof of concept, more than a thousands hours of speech recordings can be distilled to a sketch of only a few kilo-bytes, while capturing enough information estimate a Gaussian Mixture Model for speaker verification [4]. The framework, which is endowed with statistical guarantees in terms of learning error, is illustrated on sketched clustering, and sketched PCA, using empirical algorithms inspired by sparse recovery algorithms used in compressive sensing. The promises of the framework in terms of privacy-aware learning are discussed, as well as its connections with information preservation along pooling layers of certain convolutional neural networks with random weights. To conclude the talk, we describe ongoing work [5] providing definitions and some characterizations of the approximation spaces [6] of deep networks and their relations with classical function spaces. Of particular interest is the role of the so-called activation function, and that of the depth of the considered networks. References
[56] L. Le Magoarou, R. Gribonval, Flexible Multi-layer Sparse Approximations of Matrices and Applications, IEEE Journal of Selected Topics in Signal Processing 10 (4) (2016).
[57] R. Gribonval, G. Blanchard, N. Keriven, Y. Traonmilin, Compressive Statistical Learning with Random Feature Moments, preprint, 2017.
[58] N. Keriven, N. Tremblay, Y. Traonmilin, R. Gribonval, Compressive K-means, International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017). 754Oberwolfach Report 14/2018
[59] N. Keriven, A. Bourrier, R. Gribonval, P. Perez, Sketching for Large-Scale Learning of Mixture Models, Information and Inference, 2017
[60] R. Gribonval, G. Kutyniok, M. Nielsen, F. Voigl¨ander, Approximation spaces of deep neural networks, working draft.
[61] R. A. DeVore, G. G. Lorentz, Constructive approximation, 1993, Springer-Verlag. Approximation Properties of Deep ReLU Networks Felix Voigtlaender (joint work with Philipp Petersen) In the area of machine learning, deep learning methods have dramatically improved the state-of-the-art in many classification problems like visual object recognition
[62] . The general goal of machine learning is to find a good approximation f∗to an unknown ground-truth (classifier) function f that can only be observed through known samples (xi, f (xi))i=1,...,N. In the case of deep learning this is achieved by insisting that f∗is implemented by a neural network Φ = Φawhich is parametrized by its weights a∈ RK. To determine these weights, one applies a form of stochastic gradient descent in order to minimize a loss function L that is defined in terms of the samples (xi, f (xi))i=1,...,N. For details we refer to [5, 9]. Despite their incredible performance in applications, a theoretical explanation for this success of deep learning methods is still missing. In this abstract, we present recent results concerning the expressive power of neural networks. In particular, our results partially explain why deep networks tend to perform better than shallow ones, as is observed in practice [5]. We emphasize that we are only interested in the existence of a network Φfε approximating a given ground-truth classifier function f up to error ε. We do not address the practically important question of how one can find such a network, much less if one is only given samples of f . 1. Classical results A neural network Φ : RN0→ RNLcomputes its output by alternatingly applying affine-linear maps and a non-linear activation function ̺ : R→ R; thus Φ(x) = TL(̺(TL−1(. . . ̺(T1(x)) . . . )))for x∈ RN0, where L∈ N denotes the depth of the network, and where each Tℓ: RNℓ−1→ RNℓ is affine-linear, say Tℓ= Aℓ• + bℓ. Note that ̺ is applied componentwise. If we want to emphasize the choice of ̺, we say that Φ is a ̺-network. Observe that a 1-layer network is simply an affine-linear map, while a 2-layerP network is a linear combination of ridge functions, i.e., Φ(x) =Ki=1̺(hx, aii+bi). The number of neurons and the number of weights of Φ are, respectively, XLXL N (Φ) =Nℓand W (Φ) =(kAℓkℓ0+kbℓkℓ0) . ℓ=0ℓ=1 Applied Harmonic Analysis and Data Processing755 For later use, we recall the notion of sigmoidal activation functions: A (measurable, locally bounded) function ̺ : R→ R is sigmoidal of order k ∈ N0if limx→∞̺(x)xk= 1 and limx→−∞̺(x)xk= 0. The so-called ReLU (rectified linear unit) activation function ̺0(x) := x+is sigmoidal of order 1. The expressiveness of neural networks is a well-studied subject [1, 2, 3, 4, 6, 7]. In particular, the following universal approximation theorem seems to settle— at first glance—the question of the expressiveness of neural networks: Theorem (cf. [6, Theorem 1]) If ̺ is continuous but not a polynomial and K⊂ Rd is compact, then the family of 2-layer neural networks is dense in C(K). Note, however, that the theorem does not yield any bounds on the complexity of the networks Φfεthat are used to approximate f up to error ε. Deriving such bounds for ReLU networks under suitable assumptions on f is our main goal. For certain other types of activation functions, such bounds are classical; for example: 1) In [7] it is shown that if ̺ is sigmoidal of order k≥ 2 and if f ∈ Cs([0, 1]d), thenkf −ΦnkL∞.n−s/dfor a ̺-network with N (Φn) = n and L(Φn) = L(d, s, k). 2) In [2] it is shown for order zero sigmoidal activation functions ̺ that if one assumes that the Fourier transform of f : Rd→ C has a finite first moment, then kf − Φnk2L2(µ).n−1for a 2-layer ̺-network Φnwith N (Φn) = n. Here, µ is a fixed (but arbitrary) probability measure on Rdwith compact support. 2. Results for ReLU networks Most classical results about the approximation properties of neural networks do not apply to ReLU networks. Since in practice the ReLU is the most widely used activation function [5], these networks received much attention in recent years. Yarotsky [13] showed for Sobolev functions f∈ Wk,∞([0, 1]d) that there are ReLU networks Φfεof size N (Φfε) . W (Φfε) . ε−d/kand depth L(Φfε) . ln(1/ε) satisfyingkf − ΦfεkL∞.ε. Thus, the depth of the networks Φfεtends to infinity if the approximation accuracy gets better. As far as we know, if one insists on L∞approximation with the same “network size to approximation error” relation as above, it is unknown whether one can avoid this growth of the depth. But for Lpapproximation with p <∞ the situation is different: Theorem ([8, Theorem A.9]) There is some c > 0 such that for any ε, p, β∈ (0, ∞) and f∈ Cβ([0, 1]d), we havekf − ΦfεkLp([0,1]d)≤ ε for a suitable ReLU network Φfεsatisfying N (Φfε) . W (Φfε) . ε−d/βand L(Φfε)≤ c · (1 + β/d) · log2(2 + β). The ground-truth classifier function f , however, usually has a discrete range; e.g. in a digit classification problem, we could have f : RN ×N→ −1, 0, . . . , 9, where−1 stands for “not a number”. Since such a function cannot be smooth, weP consider a different “toy-model” for classifiers f , namely f =Mi=1ai1Ki, where the sets Ki⊂ Rdare assumed to have a smooth boundary, say ∂Ki∈ Cβ. By using that locally—after a change of variables—the functions 1Kiare similar to 756Oberwolfach Report 14/2018 jumps along straight lines, and that ReLU networks can approximate such jumps well, e.g. by ε−1· (̺0(xi)− ̺0(xi− ε)), we proved the following result: PM Theorem ([8, Theorem 3.5]) There is c > 0 such that for any f =i=1ai1Kiwith ∂Ki∈ Cβand any ε > 0 there is a ReLU network Φfεwithkf − ΦfεkL2([0,1]d)≤ ε and N (Φfε) . W (Φfε) . ε−2(d−1)/βas well as L(Φfε)≤ c · log2(2 + β)· (1 + β/d). Using entropy arguments, we showed that the bound W (Φfε) . ε−2(d−1)/βis optimal, assuming that each of the weights of the network can be encoded with .ε−tbits for a fixed t > 0. We refer to [8, Section 4.1] for the details. Telgarsky [12, 11] observed for ReLU networks Φ : Rd→ R that the onedimensional restrictions R→ R, t 7→ Φ(ta + b) (with a, b ∈ Rd) are piecewise affine-linear with at most N (Φ)L(Φ)pieces, and used this to show that there are deep neural networks of size n that can only be approximated by shallow networks whose size is exponential in n. Utilizing this observation, we were able to prove the following lower bound for the approximation of non-linear smooth functions: Theorem ([8, Theorem 4.5], see [10, Theorem 4] for the case p = 2) Let ∅6= Ω ⊂ Rdbe open and connected and let f∈ C3(Ω) not affine-linear. Then there is a constant Cf> 0 such that for every p∈ [1, ∞],  kf − ΦkLp(Ω)≥ Cf· max(N (Φ)− 1)−2L(Φ),(W (Φ) + d)−2L(Φ). Overall, our results show that smoother functions allow for better approximation rates by ReLU networks; but to achieve these rates, deep networks are needed! References
[63] M. Anthony and P.L. Bartlett, Neural network learning: Theoretical foundations, Cambridge University Press, Cambridge, 1999.
[64] A.R. Barron, Universal approximation bounds for superpositions of a sigmoidal function, IEEE Transactions on Information Theory 39(3), (1993), 930–945.
[65] G. Cybenko, Approximations by superpositions of sigmoidal functions, Mathematics of Control, Signals, and Systems 2(4) (1989), 303–314.
[66] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks 4(2) (1991), 251–257.
[67] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature 521, (2015), 436–444.
[68] M. Leshno, V.Y. Lin, A. Pinkus, and S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks 6, (1993), 861–867.
[69] H. N. Mhaskar, Approximation properties of a multilayered feedforward artificial neural network, Advances in Computational Mathematics 1(1), (1993), 61–80.
[70] P. Petersen and F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural networks, arXiv preprints, arxiv.org/abs/1709.05289, 2017.
[71] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Learning representations by backpropagating errors, Nature 323, (1986), 533–536.
[72] I. Safran and O. Shamir, Depth-width tradeoffs in approximating natural functions with neural networks, Proceedings of Machine Learning Research, 70, (2017), 2979–2987. Applied Harmonic Analysis and Data Processing757
[73] M. Telgarsky, Representation benefits of deep feedforward networks, arXiv preprints, https://arxiv.org/abs/1509.08101, 2015.
[74] M. Telgarsky, Benefits of depth in neural networks, Conference on Learning Theory (COLT), (2016), 1517–1539.
[75] D. Yarotsky, Error bounds for approximations with deep ReLU networks, Neural Networks 94 (2017), 103–114. Time evolving data, diffusion geometry, and randomized matrix decomposition Nicholas F. Marshall (joint work with Matthew J. Hirn [5] ) We describe how the geometry of time evolving data can be efficiently summarized using diffusion operators and randomized matrix decomposition. Suppose than an n× d × m tensor corresponding to n points in Rdmeasured over m times is given. For each n× d temporal slice Xiof the tensor we construct a diffusion operator Pi, following the diffusion maps framework [1], and study the product operator P(m):= PmPm−1· · · P1. We prove that this product operator approximates heat flow in a precise sense when a manifold with a time dependent metric is assumed to underlie the data. Furthermore, we generalize the notion of diffusion distance and diffusion maps to this time evolving setting. We observe that the singular value decomposition of the product operator P(m)can be efficiently computed by implementing each Pi as a sparse matrix, applying each matrix successively to a collection of random vectors, and then using the algorithm of Martinsson, Rokhlin, and Tygert [6]. This decomposition can in turn be used to compute a generalized diffusion map, which we call a time coupled diffusion map, that summarizes the geometry of the data tensor. We remark that other recent works in the diffusion geometry literature also consider embeddings defined via products of diffusion kernels, see for example Lederman and Talmon [3], or Lindenbaum, Yeredor, Salhov, and Averbuch [4]. Main result. Our main result establishes a connection between the product operator P(m)and heat flow on an assumed underlying manifold with a time dependent metric. The existence and uniqueness of the heat kernel H0ton a manifold with a time dependent Riemannian metric was established by Guenther [2]. In order to prove a convergence result, we introduce a dependence on a bandwidth parameter ε for our product operator and write P(m)= Pε(m). Recall that n is the number of spatial samples of the manifoldM, and that m is the number of temporal measurements. We assume that the underlying time interval [0, T ] is divided into m intervals [τi−1, τi) each of length ε where τ0= 0 and τm= T . For simplicity, we assume that our m measurements are taken at (τ1, . . . , τm). Our main result is that in the limit of large data, both spatially and temporally, the product operator Pε(⌈t/ε⌉)converges to the heat kernel: Pε(⌈t/ε⌉)→ H0tas n→ ∞ and ε → 0. 758Oberwolfach Report 14/2018 More precisely: Theorem. Suppose the isometric embeddingMτ⊂ Rdof a time dependent manifold (M, g(τ)) is measured at a common set X = xjnj=1⊂ M of n points at ε spaced units of time over a time interval [0, T ], so that, in particular, we have time samples (τi)mi=1⊂ [0, T ] with τi= i· ε and m = T/ε. Then, for any sufficiently smooth function f :M → R and t ≤ T , the heat kernel H0tcan be approximated by the operator Pε(⌈t/ε⌉):  Pε(⌈t/ε⌉)f (xj) = H0tf (xj) +O1, ε,x n1/2εd/4+1/2j∈ X. Time coupled diffusion distance. Let δjdenote a Dirac distribution centered at xj. We compare the points xjand xkby comparing the posterior distributions of δj⊤and δ⊤kunder the Markov operator P(m). More specifically, following [1] we define a diffusion based distance as the L2distance between these posterior distributions weighted by the reciprocal of the stationary distribution of the Markov chain. That is, we define the distance D(m)by D(m)(xj, xk) =kδ⊤jP(m)− δ⊤kP(m)kL2(1/π(m)), where π(m)is the stationary distribution of P(m), i.e., π(m)⊤= π(m)⊤P(m), and k · kL2(1/π(m))is the weighted L2norm: v uuXn kfkL2(1/π(m)):=tf (xj)2π1. j=1(m)(xj) Time coupled diffusion map. The product operator P(m)is not, in general, similar to a symmetric matrix (as in the standard diffusion maps framework [1]) so our definition of a diffusion map necessarily differs. First, we define the operator A(m)by A(m)= Π1/2(m)P(m)Π−1/2(m), where Π(m)denotes the matrix with the stationary distribution π(m)of P(m) along the diagonal and zeros elsewhere. Next, we compute the singular value decomposition (SVD) of A(m): A(m)= U(m)Σ(m)V(m)⊤, where U(m)is an orthogonal matrix of left singular vectors, Σ(m)is a diagonal matrix of corresponding singular values, and V(m)is an orthogonal matrix of right singular vectors. Define Ψ(m):= Π−1/2(m)U(m)Σ(m). Then it is easy to check (see [5]) that the embedding xj7→ δj⊤Ψ(m) Applied Harmonic Analysis and Data Processing759 of the data X into Euclidean space preserves the time coupled diffusion distance. That is to say, D(m)(xj, xk) =kδj⊤Ψ(m)− δk⊤Ψ(m)kL2. We refer to the embedding xj7→ δj⊤Ψ(m)as the time coupled diffusion map. References
[76] R. R. Coifman, S. Lafon, Diffusion maps, Applied and Computational Harmonic Analysis 21 (1) (2006) 5 -30.
[77] C. M. Guenther, The fundamental solution on manifolds with time dependent metrics, The Journal of Geometric Analysis 12 (3) (2002) 425–436.
[78] R. R. Lederman, R. Talmon, Common manifold learning using alternating-diffusion, Tech. rep., Yale (2014).
[79] O. Lindenbaum, A. Yeredor, M. Salhov, A. Averbuch, Multiview diffusion maps, arXiv:1508.05550 (2015).
[80] N. F. Marshall and M. J. Hirn. Time coupled diffusion maps. Applied and Computational Harmonic Analysis (2017).
[81] P. G. Martinsson, V. Rokhlin, M. Tygert. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis (2011). Morphing of Manifold-Valued Images Gabriele Steidl (joint work with Sebastian Neumayer, Johannes Persch) Smooth image transition, also known as image morphing, is a frequently addressed task in image processing and computer vision, and there are various approaches to tackle the problem. For example, in feature based morphing only specific features are mapped to each other and the whole deformation is then calculated by interpolation. This paper is related to a special kind of image morphing, the so-called metamorphosis introduced by Miller, Trouv´e and Younes [4, 5]. The metamorphosis model can be considered as an extension of the flow of diffeomorphism model and its large deformation diffeomorphic metric mapping framework in which each image pixel is transported along a trajectory determined by a diffeomorphism path. As an extension the metamorphosis model allows the variation of image intensities along trajectories of the pixels. This paper builds up on a time discrete geodesic paths model by Berkels, Effland and Rumpf [1], but considers images in L2(Ω,H), where Ω ⊂ Rn, n≥ 2, is an open, bounded connected domain with Lipschitz boundary andH a finite dimensional Hadamard manifold. Hadamard manifolds are simply connected, complete Riemannian manifolds with non-positive sectional curvature. Typical examples are hyperbolic spaces and symmetric positive definite matrices with the affine invariant metric. As an important fact we will use that the distance in Hadamard spaces is jointly convex which will imply weak lower semicontinuity of certain functionals involving the distance function. 760Oberwolfach Report 14/2018 K−1 We aim in finding a minimizing sequence I = (I1, . . . , IK−1)∈ L2(Ω, R) of the discrete path energy XKZZ J(I) :=infW (Dϕk(x)) + γ|Dmϕk(x)|2dx +1d ϕk∈AǫΩδΩ2(Ik◦ ϕk, Ik−1)2dx, k=1 (1)subject to I0= T, IK= R, where δ, γ > 0, d2denotes the distance in L2(Ω,H), n Aǫ:=ϕ ∈ Wm,2(Ω): det(Dϕ)≥ ǫ, ϕ(x) = x for x ∈ ∂Ω, m > 1 +n2 is an admissible set of deformations and the function W has to satisfy certain properties. A particular choice of W is given by the linearized elastic potential. An illustration is given in Fig. 1. We prove that a minimizer of (1) exists. Figure 1.Illustration of the time discrete morphing path. Dealing with digital images we have to introduce a space discrete model. We establish a finite difference model on a staggered grid together with a multiscale strategy. We have used this discretization already for gray-value images in [3]. For finding a minimizer, we also propose an alternating algorithm fixing either the deformation or the image sequence: i) For a fixed image sequence, we have to solve certain registration problems for manifold-valued images in parallel to get a sequence (ϕ1, . . . , ϕK) of deformations. Necessary interpolations were performed via Karcher means computation. ii) For a fixed deformation sequence, we need to find a minimizing image sequence (I1, . . . , IK−1) of K X d22(Ik◦ ϕk, Ik−1)subject toI0= T, IK= R k=1 where d2denotes the distance in L2(Ω,H). Fig. 2 shows a path obtained by our model for images with 3×3 positive definite matrices as entries. For more information we refer to [2]. Applied Harmonic Analysis and Data Processing761 Figure 2.Morphing path between a part of the YZ-slices 49 and 51 of the Camino dataset with SPD(3) matrices. References
[82] B. Berkels, A. Effland, and M. Rumpf. Time discrete geodesic paths in the space of images. SIAM Journal on Imaging Sciences 8(3) (2015), 1457–1488.
[83] J. Persch, F. Pierre, and G. Steidl. Exemplar-based face colorization using image morphing. Journal of Imaging 3(4) (2017) :ArtNum 48. 762Oberwolfach Report 14/2018
[84] J. Persch, F. Pierre, and G. Steidl. Morphing of manifold-valued images inspired by discrete geodesics in image spaces. SIAM Journal on Imaging Sciences, submitted.
[85] A. Trouv´e and L. Younes. Local geometry of deformable templates. SIAM Journal of Mathematical Analysis 37(2) (2005), 17–59.
[86] A. Trouv´e and L. Younes. Metamorphoses through Lie group action. Foundations in Computational Mathematics 5(2) (2005), 173–198. Solving linear Kolmogorov Equations by Means of Deep Learning Philipp Grohs 1. The Mathematical Learning Problem According to [1], the mathematical learning problem can be cast into the following form. Definition 1 (The Mathematical Learning Problem). Let K⊆ Rd, let (Ω,G, P) be a probability space and let X : Ω→ K and Y : Ω → Rnbe random vectors. For a Borel measurable function F : K→ Rndefine the least squares error of F with respect to (w.r.t.) X and Y by Z (1)E(X,Y )(F ) =kF (X) − Y k2RndP = EkF (X) − Y k2Rn∈ [0, ∞]. Ω The Mathematical Learning Problem asks for a function F which minimizes E(X,Y )(F ). This definition can be interpreted as the problem of finding the best functional relation between two random vectors X, Y where X may take the role of a data point and Y that of a label. Since the minimization ofE(X,Y )amounts to a quadratic minimization problem, the solution to the Mathematical Learning Problem of Definition 1 can be easily seen to be the conditional expectation F (x) = E(Yˆ|X = x). In practice, one does not know the distributions of (X, Y ) but one only has access to i.i.d. samples (xi, yi)mi=1∼ (X, Y ) from which one needs to estimate ˆF . A popular method to achieve this goal is Empirical Risk Minimization (ERM), which minimizes the empirical risk 1Xm (2)E(xi,yi)m(F ) :=(F (xi)− yi)2 i=1m i=1 over a hypothesis classH ⊂ C(Rd, Rn). The minimizer (which may be noni=1,H. Classical statistical learning theory, as for example presented in [1] provides an estimate on the error 2 E(X,Y )( ˆF(xi,yi)mi=1,H)− E(X,Y )( ˆF ) = ˆF(xi,yi)mi=1,H− ˆF. L2(K,dPX) Applied Harmonic Analysis and Data Processing763 2 These estimates involve bounds on the approximation error ˆFH− ˆF, L2(K,dPX) whereFˆH∈ argminF ∈HkF − ˆFk2L2(K,dPand the generalization error X) E(X,Y )( ˆF(xi,yi)mi=1,H)− E(X,Y )( ˆFH). The approximation error measures how well the hypothesis classH approximates the regression function ˆF and its estimation implicitly requires the knowledge of regularity properties of (X, Y ) which is in general not available for realistic learning problems. 2. Neural Network Hypothesis Classes In recent years spectacular successes have been achieved by using deep (artificial feedforward) neural networks as hypothesis classes [2]. These can be defined as follows. Definition 2. Let L, N0, . . . , NL∈ N. A neural network (NN) Φ with L layers is a finite sequence of matrix-vector tuples Φ := ((A1, b1), (A2, b2), . . . , (AL, bL))∈ ×Ll=1RNl×Nl−1× RNl. We refer to the sequence arch(Φ) := (N0, N1, . . . , NL) as the architecture of Φ and denote its input and output dimension by din(Φ) := N0 and dout(Φ) := NL, respectively. Suppose σ∈ C(R, R), then we define the realization of Φ with activation function σ as the map Rσ(Φ)∈ C(RN0, RNL) with Rσ(Φ)(x) = xL, where xLis given by the following scheme: x0:= x,xl:= σ(Alxl−1+ bl), for l∈ 1, . . . , L − 1,xL:= ALxL−1+ bL. Here, σ is understood component wise, i.e., σ(y) = (σ(y1), . . . , σ(ym)). Finally, we define size(Φ) :=PLj=1(kAjkℓ0+kbjkℓ0) and set Hσ(N0,...,NL):=Rσ(Φ) : arch(φ) = (N0, . . . , NL) ⊂ C(RN0, RNL). Examples of activation functions include the rectified linear unit ReLU(t) := (t)+or the sigmoidal function sig(t) = tanh(t/2). The resulting ERM problem (2) becomes non-linear and non-convex and it can typically only be solved by stochastic first order optimization methods whose convergence properties are not yet understood. 3. Linear Kolmogorov Equations as Learning Problem Consider linear Kolmogorov equations which are defined as follows for a function u : R+× Rd→ R and initial value ϕ : Rd→ R: (3) ∂u1 ∂t2Trace(σ(x)σT(x)Hessxu(t, x)) + µ(x)· ∇xu(t, x), (t, x)∈ [0, T ] × Rd, u(0, x) = ϕ(x). The problem of approximating the function x7→ u(T, x), T > 0, given the initial condition u(0, x) = ϕ(x) arises in a wide area of applications, for example nonlinear filtering or computational finance. Special cases include diffusion equations, 764Oberwolfach Report 14/2018 the Black-Scholes-Equation or the Heston model which is used day after day in the financial engineering industry. In these applications it is especially relevant to develop efficient numerical schemes for high-dimensional problems with d≥ 100. Due to the curse of dimensionality which states that the complexity of classical methods (Finite Element Methods, Finite Difference Methods, Sparse Tensor Product Methods, Spectral Methods, ...) scales exponentially in the dimension d, such methods are not applicable in this regime. In [3] use the Feynman-Kac formula u(T, x) = E(ϕ(ZxT)), where Zxtis the process defined as Zxt= x =R0tµ(Zxs)ds +R0tσ(Zx2)dWsto observe that u(T, x)|[a,b]d equals the solution to the learning problem in the precise sense of Definition 1 associated with X =U[a,b]d(the uniform distribution on [a, b]d) and Y = ϕ(ZXt). Using this reformulation we can simulate training data (xi, yi)mi=1distributed according to (X, Y ) and solve the resulting ERM problem with a NN hypothesis classHReLU(Nwhich results in a numerical approximation ˆF 0,...,NL)(xi,yi)mi=1,HReLU(N0,...,NL) of ˆF = u(T,·). Numerical simulations carried out in [3] suggest that the resulting algorithm does not suffer from the curse of dimensionality.P In ongoing work we show that in many cases of interest both the sizeLl=1(Nl× Nl−1+ Nl) of the NN hypothesis class as well as the number m of required training samples scale only polynomially in the dimension d. Theorem 1 (informal and simplified version [G-Jentzen-von Wurstemberger] ). Suppose that µ, Σ are affine functions (this includes diffusion equations or the Black-Scholes equation) and suppose that the initial condition ϕ can be very wellP approximated by ReLU NNs (this includes ϕ(x) = maxdi=1xi− Ki, 0 (basket option) or ϕ(x) = maxx1− K1, . . . , xd− Kd, 0 (max option)). Then there exists a polynomial p such that for every ǫ > 0 there exist L, N1, . . . , NL, m∈ N with PL (i)l=1(Nl× Nl−1+ Nl)≤ |p(d)|ǫ−2 (ii) m≤ |p(d)|ǫ−4 R1/2 (iii)(b−a)1d[a,b]d|u(T, x) − ˆF(xi,yi)m,HReLU(x)|2≤ ǫ. i=1(N0,...,NL) In other words, the method does not suffer from the curse of dimensionality. References
[87] F. Cucker, S. Smale. On the Mathematical Foundations of Learning, Bulletin of the AMS 39/1 (2001), 1–49.
[88] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning, MIT press (2016).
[89] C. Beck, S. Becker, P. Grohs, N. Jaafari, A. Jentzen. Solving stochastic differential equations and Kolmogorov equations by means of deep learning. Preprint. Applied Harmonic Analysis and Data Processing765 A new paradigm for function approximation with deep networks Hrushikesh N. Mhaskar A central problem in machine learning is the following. LetD = (xi, yi)Mi=1⊂ X× R for some non-empty set X. Find a model P : X → R such that P (xi)≈ yi, 1≤ i ≤ M. The traditional machine learning paradigm as described in [2, 3] is to consider D as an i.i.d. sample from an unknown probability distribution µ. The goal is toZ estimate the generalization error given by|y − P (x)|2dµ(x, y). Writing X×R f (x) = Eµ(y|x), and denoting by µ∗the marginal distribution of µ for x, theZ generalization error is the sum of the variance, defined by|y−f(x)|2dµ(x, y) ZX×R and the bias, defined by|f(x) − P (x)|2dµ∗(x). In theoretical analysis, one X considers a sequence of model classes V0⊂ V1⊂ · · · . The minimum bias for P∈ Vnis obtained for some P∗= arg minP ∈Vnkf − P kµ∗,2, and is called the approximation errorkf − P∗k2µ∗,2. In the traditional paradigm, the actual construction of P∗is of no interest; only an estimate of this minimum bias is studied in order to get some insight on what space Vnto choose the model from. The actual model P#is computed based only onD typically using some empirical risk minimization process that assumes f to belong, for example, to a reproducing kernel Hilbert space (prior). Since the approximation error decreases as n↑ ∞, while the complexity of the process of finding P#increases with n, there is an built-in trade off between the two estimates. In the analysis of function approximation by deep networks, this paradigm does not work. As pointed out in [10], the main reason for deep networks to have an advantage over shallow networks is their ability to utilize a compositional structure so as to mitigate the curse of dimensionality with the blessing of compositionality. For example, the number of parameters in approximating a function F of 4 variables up to accuracy ǫ isO(ǫ−r/4), where r measures the smoothness of F . However, if F has a compositional structure F (x1,· · · , x4) = f (f1(x1, x2), f2(x3, x4)), then a deep network with binary tree architecture of the same form, P (x1,· · · , x4) = P∗(P1(x1, x2), P2(x3, x4)) can provide the same approximation with onlyO(ǫ−r/2) parameters, since only functions of two variables are approximated at each stage. If the functions f, f1, f2are Lipschitz, one can easily obtain a rate of approximation using triangle inequality (good propagation of errors) (see [10] for details). This requires an approximation of f (f1, f2) by P∗(P1, P2) as bivariate functions. The inputs to f are thus different from those to its approximation P∗, and it is not possible to define a measure with respect to which to take the L2 norm so that the measure is commensurate with the compositionality structure. We propose an alternative, equivalent way to look at the problem that avoids the trade off between approximation error and process complexity, and utilizes the knowledge from approximation theory that can lead simultaneously to a good approximation error as well as an explicit construction for the desired model. Our 766Oberwolfach Report 14/2018 viewpoint is to treat the question of machine learning as the problem of finding a good approximation to an unknown function f that captures the essential functional relationship in the data set, given the values yi= f (xi)+ǫi, where ǫiare i.i.d. samples drawn from an unknown distribution with mean 0 and independent of xi. Although this is an equivalent formulation of the original problem, it is more natural from the point of view of approximation theory to do pointwise estimates (alternately uniform estimates involving some weight functions) than those in L2, and more importantly to look for constructive methods that yield a good (rather than best) approximation. In particular, the generalization error is now defined as the pointwise error|f(x) − P#(x)|. The author has developed constructive methods to accomplish this goal in many contexts (e.g. [5, 8, 9, 6, 7]). It is clear that these methods cannot yield an overall better accuracy than the L2projection methods when the error is measured in the global L2norm. In most applications though, the target function f is smooth on its domain X except for a small set of “singularities”. It is well known in approximation theory that the error in L2projections are very sensitive to these singularities. In contrast, our methods produce errors according to the local smoothness of the target function at each point, analogous to those given in classical wavelet analysis [4, Chapter 9], in spite of the fact that they are defined using global data, with no a priori assumptions made on the smoothness of the target function whether globally or locally. In our talk, available at https://www.mathc.rwth-aachen.de/owncloud/index. php/s/GataT6XimZCWTwl, we illustrated the local approximation properties of our methods in the context of function approximation on the Euclidean (hyper-)sphere. It is noted that approximation by ReLU networks on the Euclidean space can be reduced to an equivalent problem on the sphere. We also gave pointwise and uniform error estimates for shallow and deep networks to explain the phenomenon that it is possible to drive the training error to zero and yet keep the test error under control [1, 12, 11]. References
[90] M. Belkin, S. Ma, and S. Mandal, To understand deep learning we need to understand kernel learning, arXiv preprint arXiv:1802.01396, 2018.
[91] F. Cucker and S. Smale, On the mathematical foundations of learning, Bulletin of the American Mathematical Society, 39 (2002), 1–49.
[92] F. Cucker and D. X. Zhou, Learning theory: an approximation theory viewpoint, 24 (2007), Cambridge University Press.
[93] I. Daubechies, Ten lectures on wavelets, 61 (1990), SIAM.
[94] Q. T. Le Gia and H. N. Mhaskar, Localized linear polynomial operators and quadrature formulas on the sphere, SIAM Journal on Numerical Analysis, 47(1)(2008), 440–466.
[95] H. N. Mhaskar, When is approximation by Gaussian networks necessarily a linear process? Neural Networks, 17(7) (2004), 989–1001.
[96] H. N. Mhaskar, Weighted quadrature formulas and approximation by zonal function networks on the sphere, Journal of Complexity, 22(3) (2006), 348–370.
[97] H. N. Mhaskar, Eignets for function approximation on manifolds, Applied and Computational Harmonic Analysis, 29(1) (2010), 63–87. Applied Harmonic Analysis and Data Processing767
[98] H. N. Mhaskar, Function approximation with relu-like zonal function networks, arXiv preprint arXiv:1709.08174, 2017.
[99] H. N. Mhaskar and T. Poggio, Deep vs. shallow networks: An approximation theory perspective, Analysis and Applications, 14(6) (2016), 829–848.
[100] H. N. Mhaskar and T. Poggio, An analysis of training and generalization errors in shallow and deep networks, arXiv preprint arXiv:1802.06266, 2018.
[101] T. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco, X. Boix, J. Hidary, and H. N. Mhaskar, Theory of deep learning iii: explaining the non-overfitting puzzle, arXiv preprint arXiv:1801.00173, 2017. On the gap between local recovery guarantees in structured compressed sensing and oracle estimates Claire Boyer This is an on-going work with Ben Adcock and Simone Brugiapaglia (Simon Fraser University, Burnaby). Compressed sensing theory provides guarantees to reconstruct sparse signals from a few linear of measurements. However, in order to circumvent combinatorial issue, the ”good” theoretical sensing matrices are often random, typically (i) Gaussian matrices with i.i.d. Gaussian entries, (ii) matrices obtained by stacking rows drawn from a finite-dimensional isometry, for instance randomly selected Fourier atoms. However in practice, the acquisition is very structured due to the physics of acquisition, and measurements can be performed by groups or blocks, standing for admissible sampling patterns. Therefore, in this work, we consider a compressed sensing (CS) theory more compatible with real-life applications: we derive guarantees to ensure reconstruction of a structured sparse signal of interest while imposing structure in the acquisition. We actually extend the setting of [1]. Once this setting established, one can study oracle-type bound: one can show that if we know the support of the signal to reconstruct, to ensure robust recovery, the required number of measurements can read as follows (1)m≥ c · Λ(S, F ) ln(n), where c is a numerical constant, S is the support of the signal, F is the distribution describing how to choose the blocks of measurements. In this bound, Λ(S, F ) is controlling the largest singular value of the sensing matrix restricted to the support of interest. For a fixed signal x to be recovered, assuming that x has a random sign structure, we derive robust reconstruction guarantee with a required number of measurements: (2)m≥ c · Θ(S, F ) ln2(n). We study how far those CS results are from oracle-type guarantees: (i) Θ(S, F ) is an upper bound of Λ(S, F ), (ii) there is an extra log factor. Actually by making an extra assumption, one can derive an oracle-type result in terms of number of 768Oberwolfach Report 14/2018 measurements as in (1). We also show that this additional assumption can be satisfied by very structured sampling matrices: for instance when sampling isolated measurements from a Fourier-wavelet transform (which can model Magnetic Resonance Imaging). These results give an insight to design new optimal sampling strategies when realistic physical constraints are imposed in the acquisition. Indeed, one can minimize Θ(S, F ) or Λ(S, F ) with respect to F given some prior on the support S. For instance in the case of sampling isolated measurements from an isometry, the state-of-the-art results [2, 3] consist in variable density sampling according to the probability distribution π, such that πk∝ kakk2∞, where the (ak)’s are the measurement vectors. The new results show that one should instead sample according to the following probability distribution πk∝ kakk∞kak,Sk1. This new strategy emphasizes that one should sample not only where the transform is coherent but also if there is information in the signal that can be captured (by some prior knowledge on the support S). References
[102] C. Boyer, J. Bigot, P. Weiss Compressed sensing with structured sparsity and structured acquisition, ACHA (2017).
[103] N. Chauffert, P. Ciuciu, J. Kahn, P. Weiss Variable density sampling with continous sampling trajectories, SIAM Journal on Imaging Science (2014).
[104] F. Krahmer, R. Ward Stable and robust sampling strategies for compressive imaging, Image Processing, IEEE Transactions on, 23(2):612–622 (2014). Stable Phase Retrieval in Infinite Dimensions Rima Alaifari (joint work with Ingrid Daubechies, Philipp Grohs, Rujie Yin) The problem of phase retrieval originated from X-ray crystallography, in which an electron density distribution of a crystal or crystallized molecule is sought to be reconstructed from only the magnitude of its Fourier transform. In the more recent technique of coherent diffraction imaging for imaging of non–crystalline nanoscale structures, one way to retrieve lost phase information is by adding redundancy in the measurements. This is typically realized by a pinhole sliding over the object support. Such measurements can be modelled as the intensities of a short–time Fourier transform (STFT) of the underlying density. Another instance of a phase retrieval problem comes from the phase vocoder in audio processing. A phase vocoder is a device that realizes modifications of audio signals, such as time– stretching or pitch–shifting. One way to implement such modifications is through fitting an audio signal to a modified spectrogram, i.e. to STFT magnitudes. Applied Harmonic Analysis and Data Processing769 In our work, we consider the problem of phase retrieval in an infinite–dimensional Hilbert space setting. More precisely, given a Hilbert spaceH and a frame ψλλ∈Λ⊂ H for some index set Λ, we ask when a signal f ∈ H can be uniquely and stably determined from|hf, ψλi|λ∈Λup to a global phase factor τ∈ S1. By stability we mean the existence of uniform constants c1, c2> 0 s.t. for all f, g∈ H, c1dist(f, g)≤ k|hf, ψλi|λ∈Λ− |hg, ψλi|λ∈ΛkL2(Λ,µ)≤ c2dist(f, g), where dist(f, g) := infτ ∈S1kf − τgkH. Clearly, if the frame is not sufficiently redundant, phase retrieval is not uniquely solvable, because too much information has been lost. On the other hand, in certain examples, one can show that sufficient oversampling of a frame can restore the unique solvability of phase retrieval. For example, the reconstruction of real– valued signals f∈ L2(R) from|hf, ψλi|λ∈Λis possible whenψλλ∈Λis a Meyer wavelet frame obtained from oversampling a Meyer wavelet orthonormal basis by a factor of at least 16/3 [1]. A natural question that arises is whether oversampling can also be a tool for restoring the stability of phase retrieval. In [4], it has been proven that whenH is infinite–dimensional and Λ is a discrete index set, phase retrieval can never be uniformly stable. We show in [3] that this is also the case when Λ is allowed to be a continuous index set. Thus, oversampling cannot improve the stability properties of phase recovery. More precisely, we prove a conjecture formulated in [5] for the finite–dimensional case, stating that the so–called strong complement property (SCP) is a necessary condition for stability of phase retrieval. Furthermore, we demonstrate that the SCP can never hold whenH is infinite–dimensional. Consequently, a function f∈ L2(R) cannot even be stably recovered from the magnitudes of its continuous STFT or of its continuous wavelet transform, i.e. even in cases where the signal transform is not sampled at all. While this result on the stability of the problem is negative, we have noticed that in practice, the instabilities that occur are all of a certain kind. Whenever the signal transform is concentrated on at least two disjoint regions of the time–frequency/time–scale domain, and small outside of these regions, phase retrieval up to one global phase factor is no longer possible in practice. On the positive side, we have made the observation that for audio signals that have such STFTs or wavelet transforms, multiplying the signal transform by a phase factor on only one of these regions results in an audio signal that is audibly identical to the original one (although they are no longer equal up to a global phase factor). More precisely, suppose that for example the STFT F of f is concentrated on two disjoint regions D1, D2⊂ C, so that F = F1+ F2and Fiis small outside of Di, i = 1, 2. Then, the audio signal ef for which the STFT is equal to eF = F1+τ F2, for τ∈ S1, is audibly indistinguishable from f . This observation has led us to formulate a new paradigm for stable phase retrieval in audio processing applications [2]. We propose to consider so–called atoll functions, i.e. functions concentrated on disjoint atolls and so that they are small 770Oberwolfach Report 14/2018 outside of these atolls. In the example from above, F is an atoll function concentrated on the two atolls D1and D2. Then, if one aims to reconstruct up to a global phase factor on each atoll separately, stability can be restored. For this, the requirement on the atoll function is that it is bounded below on the atolls (here, one can allow small lagoons with possibly smaller values inside the atoll, where the sizes and the number of the lagoons will enter in the stability constant). Our result holds for signal transforms that are holomorphic up to a weight function, i.e. for the STFT with Gaussian window and the continuous wavelet transform with Cauchy wavelet. An open question we believe to be interesting is that of extending this result to more general window classes and wavelets. References
[105] R. Alaifari, I. Daubechies, P. Grohs, and G. Thakur, Reconstructing real–valued functions from unsigned coefficients with respect to wavelet and other frames, Journal of Fourier Analysis and Applications 23.6 (2017): 1480-1494.
[106] R. Alaifari, I. Daubechies, P. Grohs, and R. Yin, Stable phase retrieval in infinite dimensions, arXiv preprint arXiv:1609.00034 (2016).
[107] R. Alaifari, and P. Grohs, Phase retrieval in the general setting of continuous frames for Banach spaces, SIAM Journal on Mathematical Analysis 49.3 (2017): 1895-1911.
[108] J. Cahill, P. Casazza and I. Daubechies, Phase retrieval in infinite–dimensional Hilbert spaces, Transactions of the American Mathematical Society, Series B 3, (3) (2016), 63–76.
[109] A.S. Bandeira, J. Cahill, D.G. Mixon, and A.A. Nelson, Saving phase: Injectivity and stability for phase retrieval, Applied and Computational Harmonic Analysis 37.1 (2014): 106-125. Low-rank Recovery from Group Orbits David Gross (joint work with Richard Kueng, Markus Grassl, Huangjun Zhu) We are concerned with the problem of recovering an unknown d×d matrix X from m noisy rank-one measurements of the form yi= trXaia∗i+ ǫi,i = 1, . . . , m where ai∈ Cdare measurement vectors, and ǫirepresents noise. We assume that aiare sampled from a group orbit. More precisely, we fix some finite group G⊂ U(Cd) and a “fiducial vector” a∈ Cd. The orbit is thenO = ga | g ∈ G. We assume that a (and hence all elements in the orbit) are normalized in that kak2= 1. The a1, . . . , amare assumed the be sampled independently fromO. The basic insight of Ref. [1] is that in this setting, recovery guarantees can sometimes be proven using just representation-theoretic data about G. Our approach works like this: The basis are the results of Ref. [2] that establish low-rank recovery guarantees using just information about the fourth moments (1)M = E[(aia∗i)⊗4] Applied Harmonic Analysis and Data Processing771 of the rank-1 measurement matrices aia∗i. Roughly speaking, the methods of Ref. [2] require that the matrix element (2)(b⊗4)∗M b⊗4 be “small” for all normalized vectors b∈ Cd. If the random vector aiis sampled from a G-orbit, it has the same distribution as gaifor any g∈ G. It follows that g⊗4M (g−1)⊗4= M or, equivalently, [M, g⊗4] = 0. We can thus apply Schur’s Lemma. Assume for simplicity that all irreducible representations (irreps) of G that appear in the representation g7→ g⊗4are nondegenerate. In this case, Schur’s Lemma says that X M =αiPi, i where i labels irreps, Piprojects on the i-th irrep, and the αiare suitable coefficients. From Eq. (1), one finds that trM = 1 and that M is positive semi-definite. Hence the coefficients fulfill 0≤ αi≤ 1/trPi. Therefore, a sufficient condition for (2) to be small is that the dimensions of all irreps occuring in g⊗4is large. In fact, one can easily verify that one can restrict attention to irreps that are contained in the totally symmetric subspace Sym4(Cd)⊂ (Cd)⊗4. By the polarization identity, this space is equivalent to the space of degree-4 polynomials in d complex variables. In this way, just using the existing techniques in [2], we get stable uniform recovery guarantees for rank-1 measurements that are sampled from any orbit of any matrix group whose action on order-4 polynomials does not contain small irreps. In Refs. [1, 3], we use slightly strengthened arguments to show that a certain Clifford group satisfies these criteria. The Clifford group plays a central role in quantum information theory, and has long been studied e.g. in classical coding theory. Refinements of the representation-theoretica analysis and further applications appear in Refs. [4, 5]. References
[110] H. Zhu,R. Kueng,D. Gross,Low rank matrix recovery from Clifford orbits, arXiv:1610.08070.
[111] R. Kueng, H. Rauhut, and U. Terstiege, Low rank matrix recovery from rank one measurements, Appl. Comput. Harmonic Anal., 2015.
[112] H. Zhu, R. Kueng, M. Grassl, D. Gross, The Clifford group fails gracefully to be a unitary 4-design, arXiv:1609.08172.
[113] R. Kueng, H. Zhu, D. Gross, Distinguishing quantum states using Clifford orbits, arXiv:1609.08595.
[114] D. Gross, S. Nezami, M. Walter, Schur-Weyl Duality for the Clifford Group with Applications, arXiv:1712.08628. 772Oberwolfach Report 14/2018 On unlimited sampling Felix Krahmer (joint work with Ayush Bhandari, Ramesh Raskar) For the conversion of an analog (bandlimited) signal to a digital representation, so called analog–to–digital converters (ADCs) are of key importance. The role of such devices is to extract from an analog signal its values on a discrete grid. Provided these samples are taken at a high enough rate, they then allow for the recovery of the signal via Shannon’s sampling theorem. Unlike the sampling method assumed in Shannon’s sampling theorem, practical ADCs are limited in dynamic range. Whenever a signal exceeds some preset threshold, the ADC saturates, resulting in aliasing due to clipping. Recent developments in ADC design, allow for an alternative ADC construction, namely ADCs that reset rather than to saturate, thus producing modulo samples. Depending on the community, the resulting ADC constructions are known as folding-ADC (cf. [1] and references therein) or the self-reset -ADC, recently proposed by Rhee and Joo [2] in context of CMOS imagers. More precisely, when reaching the upper or lower saturation threshold±λ, these ADCs would reset to the respective other threshold, i.e.,∓λ, in this way allowing to capture subsequent changes even beyond the saturation limit. Mathematically, this is represented by a memoryless, non-linear mapping of the form t11 (1)Mλ: t7→ 2λ+−. 2λ22 These constructions give rise to the following mathematical problem. Given such modulo samples of a bandlimited function as well as the dynamic range of the ADC, how can the original signal be recovered and what are the sufficient conditions that guarantee perfect recovery? The following theorem provides such a sufficiency condition. Theorem 1 (Unlimited Sampling Theorem [3]). Let g(t) be π-bandlimited and consider, for k∈ Z, the modulo samples yk=Mλ(g (kT )) of g(t) with sampling rate T . Then a sufficient condition for recovery of g(t) from theykkup to additive multiples of 2λ is that 1 (2)T≤. 2πe At the core of Theorem 1 is a constructive recovery method, as summarized in Algorithm . While some estimate of the signal norm needs to be available when recovering, the underlying circuit architecture is not limited to certain amplitude ranges: the same architecture allows for the recovery of arbitrary large amplitudes. That is why we refer to our approach as unlimited sampling. The underlying observation of our recovery algorithm is that for significant oversampling, the size of the n-th order finite difference scales like the n-th power Applied Harmonic Analysis and Data Processing773 Algorithm 1 Recovery from Modulo Folded Samples Data:yk=Mλ(g(kT )), N∈ N, and 2λZ ∋ βg≥ kgk∞. Result:eg ≈ g. (1) Compute y = ∆Ny. (2) Compute εγ=Mλ(y)− y. Set s(1)= εγ. (3) for n = 1 : N− 1 Compute κ(n)in (4). s(n+1)= Ss(n)− 2λκ(n). end (4) eγ = Ss(N ). (5) Compute eg from eγ via low-pass filter. of the oversampling rate and hence becomes small. In addition, the finite difference operator and the modulo operationMλsatisfy the following commutativity relation. Proposition 1. For any sequence a it holds that (3)Mλ(∆Na) =Mλ(∆N(Mλ(a)). Combining these observations allows for the recovery of the finite differences. Namely, the right hand side of (3) can be computed from the modulo samples, which hence provides access to the left hand side. As the argument of the modulo operation on the left hand side is small, the operation has no effect, so one has computed the true finite difference. To invert the finite difference operation, we consider the difference between true samples and modulo samples, which will always lie on a grid of spacing 2λ. For this reason, the inversion will be considerably more stable than for arbitrary real inputs. In particular, the integration constant that introduces ambiguity in each of the inversion steps will also lie on a grid. As a consequence, choosing the wrong constant will cause the output function in the subsequent step to exhibit a very strong growth, which in turn can be detected when using enough samples. In this estimate, the a priori bound of the amplitude of the signal 2λZ∋ βg≥ kgk∞ plays an important role. Namely, for J =6βλg, the appropriate inverse of the n-th finite difference operator is given by the sequence of partial sums (this operation is denoted by S) adjusted by a constant of 2λκn, where (S2∆nε (4)κ(n)=γ)1− (S2∆nεγ)J+1+1. 8βg2 When the bandlimited signals under consideration have additional structure, a corresponding approach can sometimes allow the recovery from just finitely many modulo samples (e.g., in the context of superresolution [4] or for sums of sinusoids [5]). In more general scenarios without smoothness assumptions, such as 774Oberwolfach Report 14/2018 redundant representations in RN, it is not clear under which conditions comparable recovery guarantees can be obtained. We consider this to be an interesting follow-up problem. References
[115] W. Kester, ADC architectures VI: Folding ADCs (MT-025 tutorial) Analog Devices, Tech. Rep., 2009.
[116] J. Rhee and Y. Joo, Wide dynamic range CMOS image sensor with pixel level ADC, Electron. Lett., 39 (4), 360, 2003.
[117] A. Bhandari, F. Krahmer, R. Raskar, On Unlimited Sampling, Intl. Conf. Sampling Theory Appl. (SampTA), 2017
[118] A. Bhandari, F. Krahmer and R. Raskar, Unlimited Sampling of Sparse Signals, to appear in IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2018.
[119] A. Bhandari, F. Krahmer and R. Raskar, Unlimited Sampling of Sparse Sinusoidal Mixtures, to appear in IEEE Intl. Symp. on Information Theory (ISIT), 2018. Twisted X-rays – mathematical design of radiation for high-resolution X-ray diffraction imaging Dominik Juestel (joint work with Gero Friesecke, Richard D. James) Conventional methods for X-ray diffraction imaging use plane waves to illuminate a sample. The incoming electromagnetic radiation induces oscillations of the electron density of the sample. Consequently, the moving charges produce an outgoing field that can be recorded by detectors. The inverse problem to infer the structure of the sample form its diffraction patterns can be formulated as a phase retrieval problem: only the absolute values of the complex numbers that are needed for reconstruction can directly be obtained from the measured diffraction intensities. More precisely, the information about the electron density ρ that is contained in the measurements is essentially the modulus of its Fourier transform|bρ|. Atomic resolution X-ray diffraction imaging can today only be achieved by Xray crystallography, where the periodic arrangement of molecules leads to a high amount of constructive and destructive interference, resulting in highly structured peak patterns. Mathematically, this effect is explained by the Poisson summation formula. Let Γ := AZ3, A∈ GL(3, R), be a periodic lattice in R3, then a crystal’s electron density can be modeled as an infinite periodic function ρ = δPΓ∗ ϕ, where δΓ:=x∈Γδxis the Dirac comb of the lattice Γ, and ϕ is a model for the electron density in a unit cell. Combining the Poisson summation formula with the convolution theorem, we get|bρ| =(2π)3δ det(A)Γ′· | bϕ|, where Γ′:= 2πA−TZ3is the corresponding reciprocal lattice. This calculation shows, that the intensity measured at a diffraction peak is essentially the modulus of a Fourier coefficient of the electron density in a unit cell of the crystal. The main drawback of X-ray crystallography is the need to crystallize the structures under consideration. Since proteins, for example, often do not form crystals, Applied Harmonic Analysis and Data Processing775 but aggregate in other highly symmetric assemblies, like rods, sheets, or icosahedral structures, our approach is to design forms of electromagnetic radiation that reflect the symmetry of the considered class of structures in the same way as plane waves reflect the symmetry of crystals. This way, we can profit from the inteference effects while the structures keep their natural form. Mathematically, this is achieved by finding the eigenfunctions of the group action of the symmetry group in the space of monochromatic solutions to Maxwell’s equations in vacuum. More general versions of the Poisson summation formula then imply highly structured interference similar to the classic case (see [2]). Unlike in classic X-ray crystallography, where the vectorial nature of the electromagnetic field can be neglected in most calculations, it needs to be taken into account for more general radiation than plane waves. When translating a vector field, the orientation of the field vectors doesn’t change. Instead, when rotating or reflecting a vector field, the field vectors need to be rotated or reflected accordingly. This simple and intuitive fact has implications for the reconstruction problem from non-plane wave diffraction patterns. The classic scalar phase retrieval problem is generalized to a vectorial phase problem: the information contained in the intensity measurements is the length of complex vectors that are needed for reconstruction of the electron density. In the special case of helical structures like nanotubes or helical viruses – these are structures that have a discrete symmetry of rotations, translations, and screw displacements with respect to a fixed axis – the radiation design problem can be fully solved. We call the resulting electromagnetic fields twisted X-rays, as they are waves that propagate along helices. The solution spaces are finite dimensional, with a parametrization that can be interpreted as the polarization of the radiation (see [3]). A diffraction experiment of twisted X-rays illuminating an aligned helical structure is conjectured to produce a highly structured diffraction pattern, with the outgoing field in axial direction forming an exact double peak pattern when viewed as a function of certain radiation parameters. Moreover, the structure of the sample can be recovered by solving a variation of the classic phase retrieval problem. In fact, the above mentioned vectorial phase retrieval problem reduces to a scalar phase retrieval problem, with the Fourier transform replaced by a Fourier-Hankel transform. For general symmetries, the analysis of the solution spaces of the design equations gets more involved, and can in general not be made as explicit as in the case of plane or twisted waves. Viewing the design problem from the standpoint of representation theory, the space of monochromatic solutions to Maxwell’s equations in vacuum is decomposed into irreducible components with respect to the action of the symmetry group. These components need not be finite-dimensional, as is the case for the translation group or the helical symmetry group. When trying to calculate the diffraction patterns that result from the illumination with radiation from an irreducible component with respect to a general symmetry group, one arrives at the boundary of mathematical research. While 776Oberwolfach Report 14/2018 for abelian and compact groups, there are generalizations of the classic theory, for non-abelian non-compact groups, there is no suitable generalization of the Poisson summation formula (see [4] for the mathematical background). Recent advances in X-ray technology suggest that the proposed experiment, to illuminate helical structures with twisted X-rays, might be realizable in the near future. Several groups managed to produce so-called beams carrying orbital angular momentum, which are closely related to twisted X-rays (see, e.g., [1]). They use helical undulators to force an electron beam from a synchrotrone onto a helical trajectory. The X-rays that are emitted interfere to form the helical waveform. While they do not yet reach the energy that is necessary for atomic resolution, a proof of concepts is within reach. The proposed method has the potential to allow for structure analysis of previously unaccessible molecules, while the theory is a nice example for the usefullness of abstract mathematics in applications. References
[120] J. Bahrdt, K. Holldack, P. Kuske, R. M¨uller, M. Scheer, P. Schmid, First observation of photons carrying orbital angular momentum in undulator radiation, Phys. Rev. Lett. 111 (2013), 034801.
[121] G. Friesecke, R. D. James, D. Juestel, Twisted X-rays: incoming waveforms yielding discrete diffraction patterns for helical structures, SIAM J. Appl. Math. 76-3 (2016), 1191–1218.
[122] D. Juestel, G. Friesecke, R. D. James, Bragg-Von Laue diffraction generalized to twisted X-rays, Acta Cryst. A 72 (2016), 190–196.
[123] D. Juestel, The Zak transform on strongly proper G-spaces and its applications, J. London Math. Soc. 97 (2) (2018), 47–76.Calderbank Synchronization Problems: Geometry Meets Learning Tingran Gao (joint work with Ingrid Daubechies, Sayan Mukherjee, Doug Boyer, Jacek Brodzki, Qixing Huang, Chandrajit Bajaj) Acquiring complex, massive, and often high-dimensional data sets has become a common practice in many fields of science. Bridging recent developments applying differential geometry and topology in probability and statistical sciences, the problem of synchronization arise in a variety of fields in computer vision, signal processing, combinatorial optimization, and natural sciences (e.g. cryoelectron microscopy and geometric morphometrics [3, 4]). The data given in a synchronization problem include a connected graph that encodes similarity relations within a collection of objects, and pairwise correspondences—often realized as elements of a transformation group G—characterizing the nature of the similarity between a pair of objects linked directly by an edge in the relation graph. The goal is to adjust the pairwise correspondences, which often suffer from noisy or incomplete measurements, to obtain a globally consistent characterization of the pairwise relations for the entire dataset, in the sense that unveiling the transformation between Applied Harmonic Analysis and Data Processing777 a pair of objects far-apart in the relation graph can be done by composing transformations along consecutive edges on a path connecting the two objects, and the resulting composed transformation is independent of the choice of the path. We develop a geometric framework in [1] that characterizes the nature of synchronization based on the classical theory of fibre bundles. We first establish the correspondence between synchronization problems in a topological group G over a connected graph Γ and the moduli space of flat principal G-bundles over Γ, and develop a discrete analogy of the renowned theorem of classifying flat principal bundles with fixed base and structural group using the representation variety. In particular, we show that prescribing an edge potential on a graph is equivalent to specifying an equivalence class of flat principal bundles, of which the triviality of holonomy dictates the synchronizability of the edge potential. Based on the fibre bundle interpretation of synchronization problems, we develop in [1] a twisted cohomology theory for associated vector bundles of the flat principal bundle arising from an edge potential, which is a discrete version of the twisted cohomology in differential geometry. This leads to a twisted Hodge theory, which is a fibre bundle analog of the discrete Hodge theory on graphs. The lowest-degree Hodge Laplacian of this twisted Hodge theory recovers a geometric realization of the graph connection Laplacian (GCL), a group-valued graph operator studied extensively in synchronization problems. Similar intuitions have led to an extended diffusion geometry framework for datasets with an underlying fibre bundle structure, referred to as Horizontal Diffusion Maps [2], which models a dataset with pairwise structural correspondences as a fibre bundle equipped with a connection; the role of random walk in standard diffusion maps is replaced with a horizontal random walk on the fibre bundle driven by a random walk on the base space. This novel diffusion geometry framework demonstrates its advantage of leveraging more detailed structural information to improve clustering accuracy in automated geometric morphometrics [5]. The geometric framework established in [1] also motivated us to study the problem of learning group actions—partitioning a collection of objects based on the local synchronizability of pairwise correspondence relations. A dual interpretation is to learn finitely generated subgroups of an ambient transformation group from noisy observed group elements. An iterative two-step synchronization residual spectral clustering algorithm is proposed in [1]. More concretely, assuming the underlying graph consists of multiple clusters, and the transformation groups within each cluster is more consistent than between clusters, the algorithm performs a synchronization procedure over the entire graph, followed by evaluating the discrepancy (“edgewise frustration”) between the synchronized and the original edge potentials, and then performs a spectral clustering for the graph with the edgewise frustration as weights; after that, the algorithm runs synchronization within each cluster, patches the local synchronization solutions together, and repeat the steps starting from another global synchronization. This simple algorithm demonstrates its efficacy on both simulated and real datasets. When the group is 778Oberwolfach Report 14/2018 a permutation group, we established in [6] exact recovery conditions for this iterative synchronization-residual based clustering algorithm under a stochastic block model nested with inhomogeneous random corruption. Many exciting problems are still open in this line of research. Notably, a proper analogy of the Cheeger inequality seems natural but is missing so far in the principal bundle framework. Much more about the horizontal diffusion geometry is unknown either, for instance, whether the eigenfunctions of the horizontal diffusion operator can be manipulated to obtain an embedding of either the total space or the base space of the fibre bundle. It is also of great interest to generalize the techniques developed in [6] to establish similar results for groups other than the permutations group, such as orthogonal or special orthogonal groups, which are commonly encountered in shape alignment and analysis. Last but not the least, we are excited about the connection between differential geometry and learning theory implied by our geometric framework, which seems to suggest that the interchanging of ideas from either field can substantially benefit the other. References
[124] T. Gao, J. Brodzki, S. Mukherjee, The Geometry of Synchronization Problems and Learning Group Actions, submitted. arXiv:1610.09051, (2016)
[125] T. Gao, The Diffusion Geometry of Fibre Bundles, submitted. arXiv:1602.02330, (2016).
[126] T. Gao, G.S. Yapuncich, I. Daubechies, S. Mukherjee, and D.M. Boyer, Development and Assessment of Fully Automated and Globally Transitive Geometric Morphometric Methods, with Application to a Biological Comparative Dataset with High Interspecific Variation, The Anatomical Record. to appear. (2017).
[127] N.S. Vitek, C.L. Manz, T. Gao, J.I. Bloch, S.G. Strait, and D.M. Boyer, Semi-Supervised Determination of Pseudocryptic Morphotypes Using Observer-Free Characterizations of Anatomical Alignment and Shape, Methods in Ecology and Evolution, 7 (2017) 5041-5055.
[128] T. Gao, Hypoelliptic Diffusion Maps and Their Applications in Automated Geometric Morphometrics, PhD thesis, Duke University. (2015)
[129] C. Bajaj, T. Gao, Z. He, Q. Huang, and Z. Liang, SMAC: Simultaneous Mapping and Clustering Using Spectral Decompositions, submitted to 2018 International Conference on Machine Learning. (2018) Solving nonlinear equations using convex programming Justin Romberg (joint work with Sohail Bahmani) We consider the general problem of recovering an unknown vector x⋆∈ RNthat (approximately) satisfies a system of equations y1= f1(x⋆) + ǫ1 y2= f2(x⋆) + ǫ2 ... yM= fM(x⋆) + ǫM, where the fmare known, convex functions and the ǫmare unknown perturbations. Applied Harmonic Analysis and Data Processing779 We attack this problem as follows. Setting the perturbations ǫm= 0 for now to ease the explanation, each equation above gives us a different convex feasibility region for x⋆, namely the sublevel setx : fm(x)≤ ym. It is clear x⋆must lie in the intersection of these sublevel sets: \M x⋆∈ K, K =x : fm(x)≤ ym. m=1 In fact, x⋆must be an extreme point ofK. As such, we can attempt to recover x⋆by maximizing a linear functional overK. For a given a0, we solve (1)maximizehx, a0i subject to x ∈ K. x It is clear that if a0= x⋆, then x⋆is the solution (or at least one of the solutions) to the program above. Our work develops conditions under which there are many a0such that x⋆is the unique solution to (1). In particular, we assume that we have a vector a0that is only roughly correlated with x⋆: hx⋆, a0i≥ δ > 0, (2) kx⋆k2ka0k2 for some constant δ. We call such a a0an anchor vector. Simply writing down the optimality (KKT) conditions for (1) shows us that if indeed ym= fm(x⋆), then x⋆is a solution to (1) if and only if (3)a0∈ cone (∇fm(x⋆), m = 1, . . . , M) . This tells us that whether or not a0will be effective is purely a function of the behavior of the gradients of the fmat the solution. Qualitatively, we can see that the more diverse these gradients are, the larger the cone that they generate is going to be, and the easier it is to find a suitable a0. Our main result gives a guarantee on the number of equations we need for (3) to hold with high probability when the functions are drawn independently and identically distributed according to some probability law. Under this probability law, we let Σ⋆be the correlation matrix of the gradients at the solution Σ⋆= E[∇f(x⋆)∇f(x⋆)T] and define the quantities kΣ⋆k, τ⋆=infE[h∇f(x⋆), hi+],ν⋆= khk2=1τ⋆ whereh·, ·i+takes the positive part of the inner product. Then if we have an a0 that obeys (2), x⋆will be the solution to (1) with high probability when M≥ Const · ν⋆3· N. When the∇f(x⋆) is approximately “isotropic”, the quantity ν⋆will be a constant, and the system of equations ym= fm(x⋆) can be solved for M∼ N. We can also encourage certain types of structure in the solution by adding a regularizer to (1). For example, it is now well-known that sparsity in the solution 780Oberwolfach Report 14/2018 to a system of equations can be encouraged by penalizing with an ℓ1norm. In general, if Ω(x) is a convex regularizer, then we can solve (4)maximizehx, a0i − Ω(x) subject to x ∈ K. x In this case, if y = fm(x⋆), then x⋆is the solution to this optimization program if and only if a0∈ cone (∇fm(x⋆), m = 1, . . . , M) + ∂Ω(x⋆), where ∂Ω(x⋆) is the subgradient of Ω(·) at x⋆. Notice that if x⋆lies on a “corner” of the sublevel setx : Ω(x) ≤ Ω(x⋆), then ∂Ω(x⋆) can be large, allowing for a more liberal choice of anchor vector a0. For random fm, the number of equations we need for x⋆to be the solution to (4) again depends on a measure of statistical complexity of the gradients∇f(x⋆), but now relative to the ascent cone for the functional in (4). An exact statement of this result can be found in [2]. References
[130] S. Bahmani and J. Romberg, A flexible convex relaxation for phase retrieval, Electronic Journal of Statistics, vol. 11, no. 2 (2017), 5254–5281.
[131] S. Bahmani and J. Romberg, Solving equations of random convex functions via anchored regression, Preprint, September 2017, arxiv:1702.05327. Fast Point Cloud Distances and Multi-Sample Testing Alex Cloninger (joint work with Xiuyuan Cheng and Ronald R. Coifman) We consider the question of estimating the total variation between two distributions in high dimensional space, given only a finite number of samples drawn iid from each distribution. This talk introduces a new anisotropic kernel-based Maximum Mean Discrepancy (MMD) statistic for estimating such a distance [1], which builds upon the Reproducing Kernel Hilbert Space MMD proposed by Gretton, et al [2]. The new anisotropic kernel scales linearly in the number of points by establishing landmarks throughout the space that approximate the local geometry of the union of the two datasets by constructing principle components of local covariance matrices. These landmarks can be interpreted as points that, under the action of a heat kernel on the data, diffuse to the entire space as quickly as possible. When the distributions are locally low-dimensional, the proposed test can be made more powerful to distinguish certain alternatives. While the proposed statistic can be viewed as a special class of Reproducing Kernel Hilbert Space MMD, the consistency and power of the test is proved, under mild assumptions of the kernel, as long askp − qk ∼ O(n−1/2+δ) for any δ > 0, based on a result of convergence in distribution of the test statistic. We also establish error bounds on the approximation by landmark points. Applied Harmonic Analysis and Data Processing781 We also consider the k-sample setting in which we measure pairwise distances between the k different point clouds. This test has complexity by O(k2|R| + kN|R|d) for N points per cloud with |R| landmarks in d dimensions. This is opposed to complexity O(k2N2d) of the naive algorithm of directly computing the MMD between any two distributions. Applications to flow cytometry detection of AML and diffusion MRI datasets are demonstrated, which motivate the proposed approach to compare distributions. References
[132] X. Cheng, A. Cloninger, & R. R. Coifman, Two-sample Statistics Based on Anisotropic Kernels, arXiv preprint arXiv:1709.05006 (2017).
[133] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schlkopf, & A. J. Smola, A kernel two-sample test, Journal of Machine Learning Research 13 (2012), 723-773. The Mismatch Principle: An Ignorant Approach to Non-Linear Compressed Sensing? Martin Genzel (joint work with Gitta Kutyniok, Peter Jung) In many real-world problems, one is given a finite collection of samples (a1, y1), . . . , (am, ym)∈ Rn× R which are drawn independently from a joint random pair (a, y) in Rn× R of unknown probability distribution. For example, y∈ R could play the role of an output variable that one would like to predict from certain input data a∈ Rn. Very generally speaking, the problem issue is now as follows: What can we learn from the sample set about the relationship between the input and the output variables? Although we do not impose any specific restrictions on the model, it is useful to think of some (unknown) parameters that determine the underlying observation rule. Let us consider two prototypical scenarios: • Single-index models. Let x0∈ Rnbe a structured vector (e.g., sparse) and assume that y = f (ha, x0i) where f : R→ R can be unknown, non-linear, and noisy. The goal is to estimate the unknown index vector x0. • Variable selection. Let S = j1, . . . , js ⊂ [n] and assume that yi= F (aj1, . . . , ajs) where F : Rs→ R can be again unknown, non-linear, and noisy. The goal is to identify the set of active variables S in a = (a1, . . . , an). 782Oberwolfach Report 14/2018 We would like to investigate whether a standard estimator — which does not require any prior knowledge — can solve those types of estimation problems. One of the most popular algorithmic approaches is the generalized Lasso, Xm x∈Rn2mi− hai, xi)2subject tox∈ K, i=1 where K⊂ Rnis a convex hypothesis set that enforces certain structural constraints on the solution, such as sparsity. The results of [1, 2, 3, 5] show that the Lasso — although originally designed for linear regression — is surprisingly robust against non-linear distortions and can in fact deal with much more complicated observation schemes. Let us state a simplified recovery guarantee: Theorem 1 (informal, cf. [3, Thm. 6.4]) Using the above notation, assume that ais an isotropic, mean-zero sub-Gaussian random vector in Rnand y is also sub-Gaussian. Fix an arbitrary target vector x♮∈ K ⊂ Rn. Then, with high probability, any minimizer ˆxof (PK) satisfies the following error bound: w∧(K, x♮) (1)kˆx− x♮k2.√+ ρ(x♮), m where w∧(K, x♮) denotes the conic Gaussian width of K at x♮and ρ(x♮) :=E[(ha, x♮i − y)a] 2 is called the mismatch covariance. Remarkably, the above statement holds true for every choice of x♮and there are no specific assumptions on the output variable y. But in order to turn (1) into a meaningful error bound, one clearly needs to ensure that the offset term ρ(x♮) is sufficiently small, since it does not decay with m. If the target vector x♮can be chosen in this way, Theorem 1 states that the Lasso (PK) indeed constitutes an almost consistent estimator of x♮. With regard to our initial problem issue, we can now formulate a simplified version of the mismatch principle: Determine a target vector x♮∈ K that captures the “parametric” structure of the observation rule and minimizes the mismatch covariance ρ(x♮) at the same time. For example, we would have to specify a target vector in spanx0 ∩ K for singleindex models and inx | supp(x) ⊆ S ∩ K for variable selection, respectively. It is in fact not hard to see that in either case (if a is standard Gaussian) there exists an appropriate choice of x♮such that ρ(x♮) = 0. In general, the mismatch principle provides a recipe to prove theoretical error bounds for the Lasso under non-linear observations. Combined with Theorem 1, it particularly indicates when one can expect reasonable outcomes and when not. A crucial role is obviously played by the mismatch covariance because it measures the compatibility between the linear fit of (PK) and the true (parametric) model. Applied Harmonic Analysis and Data Processing783 Apart from that, the mismatch principle even applies to more complicated situations, e.g., if the components of a are strongly correlated [4] or if the square loss in (PK) is replaced by a different convex loss function [1]. References
[134] M. Genzel, High-Dimensional Estimation of Structured Signals From Non-Linear Observations With General Convex Loss Functions, IEEE Trans. Inf. Theory 63.3 (2017), 1601– 1619.
[135] M. Genzel and P. Jung, Blind Sparse Recovery From Superimposed Non-Linear Sensor Measurements, In Proceedings of the 12th International Conference on Sampling Theory and Applications (SampTA), 2017.
[136] M. Genzel and P. Jung, Recovering Structured Data From Superimposed Non-Linear Measurements, Preprint arXiv:1708.07451, 2017.
[137] M. Genzel and G. Kutyniok, A Mathematical Framework for Feature Selection from RealWorld Data with Non-Linear Observations, Preprint arXiv:1608.08852, 2016.
[138] Y. Plan and R. Vershynin, The generalized Lasso with non-linear observations, IEEE Trans. Inf. Theory 62.3 (2016), 1528–1537. Dictionary learning - from local towards global and adaptive Karin Schnass The goal of dictionary learning is to decompose a data matrix Y = (y1, . . . , yN), where yn∈ Rd, into a dictionary matrix Φ = (ϕ1, . . . , ϕK), where each column also referred to as atom is normalised,kϕkk2= 1 and a sparse coefficient matrix X = (x1, . . . , xN), (1)Y≈ ΦX One way to concretise that the coefficient matrix should be sparse is to choose a sparsity level S and ask that every coefficient vector xnshould have at most S non zero entries. DefiningDKto be the set of all dictionaries with K atoms and XSthe set of all columnwise S-sparse coefficient matrices the dictionary learning problem can be formulated as optimisation programme (2)minkY − ΨXk2F. Ψ∈DK,X∈XS This problem is highly non-convex and as such difficult to solve. However, randomly initialised alternating projection algorithms, which iterate between finding the best dictionary Ψ based on coefficients X and (trying) to find the best coefficients X based on a dictionary Ψ, such as K-SVD (K Singular Value Decompositions), [2], and ITKrM (Iterative thresholding and K residual means), [5], tend to be very successful on synthetic data - usually recovering 90 to 100% of all atoms - and provide useful dictionaries on image data. The drawback of these algorithms is that assuming that the data Y is synthesized from a generating dictionary Φ and randomly drawn S- sparse coefficients X is that they have no (K-SVD) or very weak (ITKrM) recovery guarantees. This is in sharp contrast to more involved algorithms, which have gobal recovery guarantees but due to their computational complexity can only be used in toy examples in 784Oberwolfach Report 14/2018 small dimensions, [3, 1, 4]. One interesting exception is an algorithm developed by Sun, Qu and Wright, [7, 8], which is gradient based with a Newton trust region method to escape saddle points and proven to recover the generating dictionary if it is a basis. This result together with several results in machine learning which prove that non-convex problems can be well behaved, meaning all local minima are global minima, gives rise to hope that a similar result can be proven for learning overcomplete dictionaries. In this talk and the accompanying paper, [6], we show that ITKrM has a much larger contraction radius, when assuming that the current estimate of the generating dictionary Ψ is incoherent and well conditioned. Assuming additionally that (after potential rearrangement and sign flip of the atoms) the cross-Gram matrix Ψ⋆Φ is diagonally dominant we further show that ITKrM is a contraction on as soon as p (3)maxkϕk− ψkk22≤ 2 − 2kΦk2,2log KS/K, k which is relatively close to the worst case distance maxkkϕk− ψkk22= 2. We also destroy all hope of proving global convergence by sketching the existence of stable fixed points, which are not equivalent to the generating dictionary. However, based on a characterisation of the fixed points and an analysis of the residuals at these fixed points we consider a replacement procedure for coherent atoms and develop a strategy for finding good replacement candidates. Decoupling the replacement strategy into independent pruning of coherent (as well as unused) atoms and adding of promising candidates finally leads to an algorithm for dictionary learning that adaptively chooses the dictionary size K and the sparsity level S. References
[139] A. Agarwal, A. Anandkumar, and P. Netrapalli. Exact recovery of sparsely used overcomplete dictionaries. In COLT 2014 (arXiv:1309.1952), 2014.
[140] M. Aharon, M. Elad, and A.M. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing., 54(11):4311– 4322, November 2006.
[141] S. Arora, R. Ge, and A. Moitra. New algorithms for learning incoherent and overcomplete dictionaries. In COLT 2014 (arXiv:1308.6273), 2014.
[142] B. Barak, J.A. Kelner, and D. Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. In STOC 2015 (arXiv:1407.1543), 2015.
[143] K. Schnass. Convergence radius and sample complexity of ITKM algorithms for dictionary learning. Applied and Computational Harmonic Analysis, online, 2016.
[144] K. Schnass. Dictionary learning - from local towards global and adaptive. arXiv:1804.07101, 2018.
[145] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere I: Overview and geometric picture. IEEE Transactions on Information Theory, 63(2):853–884, 2017.
[146] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere II: Recovery by Riemannian trust-region method. IEEE Transactions on Information Theory, 63(2):885–915, 2017. Applied Harmonic Analysis and Data Processing785 Monte Carlo approximation certificates for k-means clustering Dustin G. Mixon (joint work with Soledad Villar) Geometric clustering is a fundamental problem in data science. At its core, clustering is an optimization problem, and it is natural to analyze the performance of various clustering routines with popular random data models. The last decade of research in applied harmonic analysis has demonstrated a fruitful interaction between optimization and randomness, suggesting that geometric clustering presents an opportunity for these techniques to contribute to the study of fundamental algorithms in data science. This talk discusses an instance of this opportunity and sheds light on additional examples that warrant further attention. Given a finite sequence of data pointsxii∈Tin Rmand a complexity parameter k, the k-means problem seeks a partition C1⊔ · · · ⊔ Ck= T that minimizes the k-means objective: (T -IP) 1XX1X2 |T |t∈[k]i∈Cxi−|Ct|xjsubject toC1⊔ · · · ⊔ Ck= T. tj∈Ct While this optimization problem is NP-hard to solve (even in m = 2 dimensions [6]), real-world instances of this problem are frequently solved using Lloyd’s algorithm, which alternates between computing centroids and reassigning points to the nearest centroid. A data scientist can perform Lloyd’s algorithm with random initializations to produce locally optimal clusterings, but when should he stop looking for a better clustering? One popular initialization for Lloyd’s algorithm is k-means++, which randomly selects k initial “centroids” fromxii∈Tin a way that encourages different centroids to be far apart. Letting W denote the random value of the k-means++ initialization, then the main result of [1] gives that 1 8(log k + 2)· EW. While this gives an approximation guarantee for the optimal clustering, it appears to be quite loose. For example, running several trials of k-means++ on the MNIST training set of 60,000 handwritten digits [4] with k = 10 produces a clustering with value about 39.22, whereas estimating EW leads to a lower bound of about 2.15. In pursuit of a better lower bound, one may consider the Peng–Wei SDP relaxation: (T -SDP) 1 2|T |tr(DX) subject totr(X) = k, X1 = 1, X≥ 0, X  0. Here, D denotes the T× T matrix whose (i, j)th entry is kxi− xjk2, whereas X≥ 0 ensures that X is entrywise nonnegative and X  0 ensures that X is symmetric and positive semidefinite. For any clustering C1⊔ · · · ⊔ Ck= T , the 786Oberwolfach Report 14/2018 P matrix X =t∈[k]|C11 t|Ct1⊤Ctis feasible in (T -SDP) with SDP value equal to its IP value, and so val(T -SDP) is a lower bound for val(T -IP), as desired. Moreover, there has been a lot of work recently to establish how good this lower bound is for various random data models [2, 3, 8, 5]. However, SDPs are notoriously slow for large problem instances, and so it is infeasible to compute this lower bound for the MNIST training set (say). To decrease the complexity of the problem, we can pass to a subset of the data. For example, pick s≤ |T | and then draw S uniformly from all subsets of T of size s. Then letting C1∗⊔ · · · ⊔ Ck∗= T denote the (T -IP)-optimal clustering, we have  Eval(S -SDP)≤ E val(S -IP) ≤ E1XX1X2 sxi−|Ct∗∩ S|xj t∈[k]i∈Ct∗∩Sj∈Ct∗∩S  ≤ E1sXXxi−|C1∗Xx2 t∈[k]i∈Ct∗∩St|j∈Ct∗j = val(T -IP). As such, we can produce a lower bound on val(T -IP) by estimating E val(S -SDP), which is computationally feasible when s is small. For example, if we select s = 200, then E val(S -SDP)≈ 35 for the MNIST training set, meaning the clustering from k-means++ is within 15 percent of optimal. In [7], we prove that this approach leads to a 99%-confidence 3-approximation certificate for any mixture of two spherical Guassians, and furthermore, this certificate can be computed in sub-linear time. For follow-on work, it would be helpful to understand the distribution of val(S -SDP), as this would allow us to use even better test statistics for our certificate. Also, how does our approximation ratio scale with k? Can we extrapolate a near-optimal clustering for T from the SDP of S? Do our techniques transfer to other SDPs to provide sub-linear bounds? We are very interested to pursue these directions further. References
[147] D. Arthur, S. Vassilvitskii, k-means++: The advantages of careful seeding, SODA (2017) 1027–1035.
[148] P. Awasthi, A. S. Bandeira, M. Charikar, R. Krishnaswamy, S. Villar, R. Ward, Relax, no need to round: Integrality of clustering formulations, ITCS (2015) 191–200.
[149] T. Iguchi, D. G. Mixon, J. Peterson, S. Villar, Probably certifiably correct k-means clustering, Math. Program. 165 (2017) 605–642.
[150] Y. LeCun, C. Cortes, C. J. C. Burges, The MNIST database of handwritten digits, http: //yann.lecun.com/exdb/mnist/
[151] X. Li, Y. Li, S. Ling, T. Strohmer, K. Wei, When Do Birds of a Feather Flock Together? K-Means, Proximity, and Conic Programming, arXiv:1710.06008 Applied Harmonic Analysis and Data Processing787
[152] M. Mahajan, P. Nimbhorkar, K. Varadarajan, The planar k-means problem is NP-hard, Theor. Comput. Sci. 442 (2012) 13–21.
[153] D. G. Mixon, S. Villar, Monte Carlo approximation certificates for k-means clustering, arXiv:1710.00956
[154] D. G. Mixon, S. Villar, R. Ward, Clustering subgaussian mixtures by semidefinite programming, Inform. Inferenc 6 (2017) 389–415.
[155] J. Peng
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.