Interior-point methods offer a highly versatile framework for convex optimization that is effective in theory and practice. A key notion in their theory is that of a self-concordant barrier. We give a suitable generalization of self-concordance to Riemannian manifolds and show that it gives the same structural results and guarantees as in the Euclidean setting, in particular local quadratic convergence of Newton's method. We analyze a path-following method for optimizing compatible objectives over a convex domain for which one has a self-concordant barrier, and obtain the standard complexity guarantees as in the Euclidean setting. We provide general constructions of barriers, and show that on the space of positive-definite matrices and other symmetric spaces, the squared distance to a point is self-concordant. To demonstrate the versatility of our framework, we give algorithms with state-of-the-art complexity guarantees for the general class of scaling and non-commutative optimization problems, which have been of much recent interest, and we provide the first algorithms for efficiently finding high-precision solutions for computing minimal enclosing balls and geometric medians in nonpositive curvature.
This paper investigates the asymptotics of the maximal throughput of communication over AWGN channels by $n$ channel uses under a covert constraint in terms of an upper bound $\delta$ of Kullback-Leibler divergence (KL divergence). It is shown that the first and second order asymptotics of the maximal throughput are $\sqrt{n\delta \log e}$ and $(2)^{1/2}(n\delta)^{1/4}(\log e)^{3/4}\cdot Q^{-1}(\epsilon)$, respectively. The technique we use in the achievability is quasi-$\varepsilon$-neighborhood notion from information geometry. We prove that if the generating distribution of the codebook is close to Dirac measure in the weak sense, then the corresponding output distribution at the adversary satisfies covert constraint in terms of most common divergences. This helps link the local differential geometry of the distribution of noise with covert constraint. For the converse, the optimality of Gaussian distribution for minimizing KL divergence under second order moment constraint is extended from dimension $1$ to dimension $n$. It helps to establish the upper bound on the average power of the code to satisfy the covert constraint, which further leads to the direct converse bound in terms of covert metric.
We study three models of the problem of adversarial training in multiclass classification designed to construct robust classifiers against adversarial perturbations of data in the agnostic-classifier setting. We prove the existence of Borel measurable robust classifiers in each model and provide a unified perspective of the adversarial training problem, expanding the connections with optimal transport initiated by the authors in previous work and developing new connections between adversarial training in the multiclass setting and total variation regularization. As a corollary of our results, we prove the existence of Borel measurable solutions to the agnostic adversarial training problem in the binary classification setting, a result that improves results in the literature of adversarial training, where robust classifiers were only known to exist within the enlarged universal $\sigma$-algebra of the feature space.
We study a family of adversarial multiclass classification problems and provide equivalent reformulations in terms of: 1) a family of generalized barycenter problems introduced in the paper and 2) a family of multimarginal optimal transport problems where the number of marginals is equal to the number of classes in the original classification problem. These new theoretical results reveal a rich geometric structure of adversarial learning problems in multiclass classification and extend recent results restricted to the binary classification setting. A direct computational implication of our results is that by solving either the barycenter problem and its dual, or the MOT problem and its dual, we can recover the optimal robust classification rule and the optimal adversarial strategy for the original adversarial problem. Examples with synthetic and real data illustrate our results.
This paper delves into stochastic optimization problems that involve Markovian noise. We present a unified approach for the theoretical analysis of first-order gradient methods for stochastic optimization and variational inequalities. Our approach covers scenarios for both non-convex and strongly convex minimization problems. To achieve an optimal (linear) dependence on the mixing time of the underlying noise sequence, we use the randomized batching scheme, which is based on the multilevel Monte Carlo method. Moreover, our technique allows us to eliminate the limiting assumptions of previous research on Markov noise, such as the need for a bounded domain and uniformly bounded stochastic gradients. Our extension to variational inequalities under Markovian noise is original. Additionally, we provide lower bounds that match the oracle complexity of our method in the case of strongly convex optimization problems.
The successes of modern deep machine learning methods are founded on their ability to transform inputs across multiple layers to build good high-level representations. It is therefore critical to understand this process of representation learning. However, standard theoretical approaches (formally NNGPs) involving infinite width limits eliminate representation learning. We therefore develop a new infinite width limit, the Bayesian representation learning limit, that exhibits representation learning mirroring that in finite-width models, yet at the same time, retains some of the simplicity of standard infinite-width limits. In particular, we show that Deep Gaussian processes (DGPs) in the Bayesian representation learning limit have exactly multivariate Gaussian posteriors, and the posterior covariances can be obtained by optimizing an interpretable objective combining a log-likelihood to improve performance with a series of KL-divergences which keep the posteriors close to the prior. We confirm these results experimentally in wide but finite DGPs. Next, we introduce the possibility of using this limit and objective as a flexible, deep generalisation of kernel methods, that we call deep kernel machines (DKMs). Like most naive kernel methods, DKMs scale cubically in the number of datapoints. We therefore use methods from the Gaussian process inducing point literature to develop a sparse DKM that scales linearly in the number of datapoints. Finally, we extend these approaches to NNs (which have non-Gaussian posteriors) in the Appendices.
PCA-Net is a recently proposed neural operator architecture which combines principal component analysis (PCA) with neural networks to approximate operators between infinite-dimensional function spaces. The present work develops approximation theory for this approach, improving and significantly extending previous work in this direction: First, a novel universal approximation result is derived, under minimal assumptions on the underlying operator and the data-generating distribution. Then, two potential obstacles to efficient operator learning with PCA-Net are identified, and made precise through lower complexity bounds; the first relates to the complexity of the output distribution, measured by a slow decay of the PCA eigenvalues. The other obstacle relates to the inherent complexity of the space of operators between infinite-dimensional input and output spaces, resulting in a rigorous and quantifiable statement of the curse of dimensionality. In addition to these lower bounds, upper complexity bounds are derived. A suitable smoothness criterion is shown to ensure an algebraic decay of the PCA eigenvalues. Furthermore, it is shown that PCA-Net can overcome the general curse of dimensionality for specific operators of interest, arising from the Darcy flow and the Navier-Stokes equations.
A proper fusion of complex data is of interest to many researchers in diverse fields, including computational statistics, computational geometry, bioinformatics, machine learning, pattern recognition, quality management, engineering, statistics, finance, economics, etc. It plays a crucial role in: synthetic description of data processes or whole domains, creation of rule bases for approximate reasoning tasks, reaching consensus and selection of the optimal strategy in decision support systems, imputation of missing values, data deduplication and consolidation, record linkage across heterogeneous databases, and clustering. This open-access research monograph integrates the spread-out results from different domains using the methodology of the well-established classical aggregation framework, introduces researchers and practitioners to Aggregation 2.0, as well as points out the challenges and interesting directions for further research.
Graph neural networks generalize conventional neural networks to graph-structured data and have received widespread attention due to their impressive representation ability. In spite of the remarkable achievements, the performance of Euclidean models in graph-related learning is still bounded and limited by the representation ability of Euclidean geometry, especially for datasets with highly non-Euclidean latent anatomy. Recently, hyperbolic space has gained increasing popularity in processing graph data with tree-like structure and power-law distribution, owing to its exponential growth property. In this survey, we comprehensively revisit the technical details of the current hyperbolic graph neural networks, unifying them into a general framework and summarizing the variants of each component. More importantly, we present various HGNN-related applications. Last, we also identify several challenges, which potentially serve as guidelines for further flourishing the achievements of graph learning in hyperbolic spaces.
In 1954, Alston S. Householder published Principles of Numerical Analysis, one of the first modern treatments on matrix decomposition that favored a (block) LU decomposition-the factorization of a matrix into the product of lower and upper triangular matrices. And now, matrix decomposition has become a core technology in machine learning, largely due to the development of the back propagation algorithm in fitting a neural network. The sole aim of this survey is to give a self-contained introduction to concepts and mathematical tools in numerical linear algebra and matrix analysis in order to seamlessly introduce matrix decomposition techniques and their applications in subsequent sections. However, we clearly realize our inability to cover all the useful and interesting results concerning matrix decomposition and given the paucity of scope to present this discussion, e.g., the separated analysis of the Euclidean space, Hermitian space, Hilbert space, and things in the complex domain. We refer the reader to literature in the field of linear algebra for a more detailed introduction to the related fields.
Since deep neural networks were developed, they have made huge contributions to everyday lives. Machine learning provides more rational advice than humans are capable of in almost every aspect of daily life. However, despite this achievement, the design and training of neural networks are still challenging and unpredictable procedures. To lower the technical thresholds for common users, automated hyper-parameter optimization (HPO) has become a popular topic in both academic and industrial areas. This paper provides a review of the most essential topics on HPO. The first section introduces the key hyper-parameters related to model training and structure, and discusses their importance and methods to define the value range. Then, the research focuses on major optimization algorithms and their applicability, covering their efficiency and accuracy especially for deep learning networks. This study next reviews major services and toolkits for HPO, comparing their support for state-of-the-art searching algorithms, feasibility with major deep learning frameworks, and extensibility for new modules designed by users. The paper concludes with problems that exist when HPO is applied to deep learning, a comparison between optimization algorithms, and prominent approaches for model evaluation with limited computational resources.