We propose and study kernel conjugate gradient methods (KCGM) with random projections for least-squares regression over a separable Hilbert space. Considering two types of random projections generated by randomized sketches and Nystr\"{o}m subsampling, we prove optimal statistical results with respect to variants of norms for the algorithms under a suitable stopping rule. Particularly, our results show that if the projection dimension is proportional to the effective dimension of the problem, KCGM with randomized sketches can generalize optimally, while achieving a computational advantage. As a corollary, we derive optimal rates for classic KCGM in the well-conditioned regimes for the case that the target function may not be in the hypothesis space.
This work contributes to the limited literature on estimating the diffusivity or drift coefficient of nonlinear SPDEs driven by additive noise. Assuming that the solution is measured locally in space and over a finite time interval, we show that the augmented maximum likelihood estimator introduced in Altmeyer, Reiss (2020) retains its asymptotic properties when used for semilinear SPDEs that satisfy some abstract, and verifiable, conditions. The proofs of asymptotic results are based on splitting the solution in linear and nonlinear parts and fine regularity properties in $L^p$-spaces. The obtained general results are applied to particular classes of equations, including stochastic reaction-diffusion equations. The stochastic Burgers equation, as an example with first order nonlinearity, is an interesting borderline case of the general results, and is treated by a Wiener chaos expansion. We conclude with numerical examples that validate the theoretical results.
In this paper, by introducing a low-noise condition, we study privacy and utility (generalization) performances of differentially private stochastic gradient descent (SGD) algorithms in a setting of stochastic convex optimization (SCO) for both pointwise and pairwise learning problems. For pointwise learning, we establish sharper excess risk bounds of order $\mathcal{O}\Big( \frac{\sqrt{d\log(1/\delta)}}{n\epsilon} \Big)$ and $\mathcal{O}\Big( {n^{- \frac{1+\alpha}{2}}}+\frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\Big)$ for the $(\epsilon,\delta)$-differentially private SGD algorithm for strongly smooth and $\alpha$-H\"older smooth losses, respectively, where $n$ is the sample size and $d$ is the dimensionality. For pairwise learning, inspired by \cite{lei2020sharper,lei2021generalization}, we propose a simple private SGD algorithm based on gradient perturbation which satisfies $(\epsilon,\delta)$-differential privacy, and develop novel utility bounds for the proposed algorithm. In particular, we prove that our algorithm can achieve excess risk rates $\mathcal{O}\Big(\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\Big)$ with gradient complexity $\mathcal{O}(n)$ and $\mathcal{O}\big(n^{\frac{2-\alpha}{1+\alpha}}+n\big)$ for strongly smooth and $\alpha$-H\"older smooth losses, respectively. Further, faster learning rates are established in a low-noise setting for both smooth and non-smooth losses. To the best of our knowledge, this is the first utility analysis which provides excess population bounds better than $\mathcal{O}\Big(\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log(1/\delta)}}{n\epsilon}\Big)$ for privacy-preserving pairwise learning.
In recent studies, the generalization properties for distributed learning and random features assumed the existence of the target concept over the hypothesis space. However, this strict condition is not applicable to the more common non-attainable case. In this paper, using refined proof techniques, we first extend the optimal rates for distributed learning with random features to the non-attainable case. Then, we reduce the number of required random features via data-dependent generating strategy, and improve the allowed number of partitions with additional unlabeled data. Theoretical analysis shows these techniques remarkably reduce computational cost while preserving the optimal generalization accuracy under standard assumptions. Finally, we conduct several experiments on both simulated and real-world datasets, and the empirical results validate our theoretical findings.
Two-sample tests utilizing a similarity graph on observations are useful for high-dimensional and non-Euclidean data due to their flexibility and good performance under a wide range of alternatives. Existing works mainly focused on sparse graphs, such as graphs with the number of edges in the order of the number of observations, and their asymptotic results imposed strong conditions on the graph that can easily be violated by commonly constructed graphs they suggested. Moreover, the graph-based tests have better performance with denser graphs under many settings. In this work, we establish the theoretical ground for graph-based tests with graphs ranging from those recommended in current literature to much denser ones.
Acoustic wave propagation through a homogeneous material embedded in an unbounded medium can be formulated as a boundary integral equation and accurately solved with the boundary element method. The computational efficiency deteriorates at high frequencies due to the increase in mesh size with a fixed number of elements per wavelength and ill-conditioning of the linear system due to high material contrasts. This study presents the design of boundary element methods feasible for nonconforming surface meshes at the material interface. The nonconforming algorithm allows for independent grid generation, which improves flexibility and reduces the degrees of freedom. It works for different boundary integral formulations for Helmholtz transmission problems, operator preconditioning, and coupling with finite element solvers. The extensive numerical benchmarks at canonical configurations and an acoustic foam model confirm the significant improvements in computational efficiency when employing the nonconforming grid coupling in the boundary element method.
Although the field of multi-agent reinforcement learning (MARL) has made considerable progress in the last years, solving systems with a large number of agents remains a hard challenge. Graphon mean field games (GMFGs) enable the scalable analysis of MARL problems that are otherwise intractable. By the mathematical structure of graphons, this approach is limited to dense graphs which are insufficient to describe many real-world networks such as power law graphs. Our paper introduces a novel formulation of GMFGs, called LPGMFGs, which leverages the graph theoretical concept of $L^p$ graphons and provides a machine learning tool to efficiently and accurately approximate solutions for sparse network problems. This especially includes power law networks which are empirically observed in various application areas and cannot be captured by standard graphons. We derive theoretical existence and convergence guarantees and give empirical examples that demonstrate the accuracy of our learning approach for systems with many agents. Furthermore, we rigorously extend the Online Mirror Descent (OMD) learning algorithm to our setup to accelerate learning speed, allow for agent interaction through the mean field in the transition kernel, and empirically show its capabilities. In general, we provide a scalable, mathematically well-founded machine learning approach to a large class of otherwise intractable problems of great relevance in numerous research fields.
Subsampling or subdata selection is a useful approach in large-scale statistical learning. Most existing studies focus on model-based subsampling methods which significantly depend on the model assumption. In this paper, we consider the model-free subsampling strategy for generating subdata from the original full data. In order to measure the goodness of representation of a subdata with respect to the original data, we propose a criterion, generalized empirical F-discrepancy (GEFD), and study its theoretical properties in connection with the classical generalized L2-discrepancy in the theory of uniform designs. These properties allow us to develop a kind of low-GEFD data-driven subsampling method based on the existing uniform designs. By simulation examples and a real case study, we show that the proposed subsampling method is superior to the random sampling method. Moreover, our method keeps robust under diverse model specifications while other popular subsampling methods are under-performing. In practice, such a model-free property is more appealing than the model-based subsampling methods, where the latter may have poor performance when the model is misspecified, as demonstrated in our simulation studies.
Template-based synthesis, also known as sketching, is a localized approach to program synthesis in which the programmer provides not only a specification, but also a high-level ``sketch'' of the program. The sketch is basically a partial program that models the general intuition of the programmer, while leaving the low-level details as unimplemented ``holes''. The role of the synthesis engine is then to fill in these holes such that the completed program satisfies the desired specification. In this work, we focus on template-based synthesis of polynomial imperative programs with real variables, i.e.~imperative programs in which all expressions appearing in assignments, conditions and guards are polynomials over program variables. While this problem can be solved in a sound and complete manner by a reduction to the first-order theory of the reals, the resulting formulas will contain a quantifier alternation and are extremely hard for modern SMT solvers, even when considering toy programs with a handful of lines. Moreover, the classical algorithms for quantifier elimination are notoriously unscalable and not at all applicable to this use-case. In contrast, our main contribution is an algorithm, based on several well-known theorems in polyhedral and real algebraic geometry, namely Putinar's Positivstellensatz, the Real Nullstellensatz, Handelman's Theorem and Farkas' Lemma, which sidesteps the quantifier elimination difficulty and reduces the problem directly to Quadratic Programming (QP). Alternatively, one can view our algorithm as an efficient way of eliminating quantifiers in the particular formulas that appear in the synthesis problem. The resulting QP instances can then be handled quite easily by SMT solvers. Notably, our reduction to QP is sound and semi-complete, i.e.~it is complete if polynomials of a sufficiently high degree are used in the templates...
We consider risk-averse learning in repeated unknown games where the goal of the agents is to minimize their individual risk of incurring significantly high cost. Specifically, the agents use the conditional value at risk (CVaR) as a risk measure and rely on bandit feedback in the form of the cost values of the selected actions at every episode to estimate their CVaR values and update their actions. A major challenge in using bandit feedback to estimate CVaR is that the agents can only access their own cost values, which, however, depend on the actions of all agents. To address this challenge, we propose a new risk-averse learning algorithm with momentum that utilizes the full historical information on the cost values. We show that this algorithm achieves sub-linear regret and matches the best known algorithms in the literature. We provide numerical experiments for a Cournot game that show that our method outperforms existing methods.
Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet have recently achieved state-of-the-art performance on a variety of language understanding tasks. However, their size makes them impractical for a number of scenarios, especially on mobile and edge devices. In particular, the input word embedding matrix accounts for a significant proportion of the model's memory footprint, due to the large input vocabulary and embedding dimensions. Knowledge distillation techniques have had success at compressing large neural network models, but they are ineffective at yielding student models with vocabularies different from the original teacher models. We introduce a novel knowledge distillation technique for training a student model with a significantly smaller vocabulary as well as lower embedding and hidden state dimensions. Specifically, we employ a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. We combine this approach with learning shared projection matrices that transfer layer-wise knowledge from the teacher model to the student model. Our method is able to compress the BERT_BASE model by more than 60x, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB. Experimental results also demonstrate higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques.