This paper pushes further the intrinsic capabilities of the GFEM$^{gl}$ global-local approach introduced initially in [1]. We develop a distributed computing approach using MPI (Message Passing Interface) both for the global and local problems. Regarding local problems, a specific scheduling strategy is introduced. Then, to measure correctly the convergence of the iterative process, we introduce a reference solution that revisits the product of classical and enriched functions. As a consequence, we are able to propose a purely matrix-based implementation of the global-local problem. The distributed approach is then compared to other parallel solvers either direct or iterative with domain decomposition. The comparison addresses the scalability as well as the elapsed time. Numerical examples deal with linear elastic problems: a polynomial exact solution problem, a complex micro-structure, and, finally, a pull-out test (with different crack extent). 1: C. A. Duarte, D.-J. Kim, and I. Babu\v{s}ka. A global-local approach for the construction of enrichment functions for the generalized fem and its application to three-dimensional cracks. In Advances in Meshfree Techniques, Dordrecht, 2007. Springer
Curriculum learning (CL) - training using samples that are generated and presented in a meaningful order - was introduced in the machine learning context around a decade ago. While CL has been extensively used and analysed empirically, there has been very little mathematical justification for its advantages. We introduce a CL model for learning the class of k-parities on d bits of a binary string with a neural network trained by stochastic gradient descent (SGD). We show that a wise choice of training examples, involving two or more product distributions, allows to reduce significantly the computational cost of learning this class of functions, compared to learning under the uniform distribution. We conduct experiments to support our analysis. Furthermore, we show that for another class of functions - namely the `Hamming mixtures' - CL strategies involving a bounded number of product distributions are not beneficial, while we conjecture that CL with unbounded many curriculum steps can learn this class efficiently.
Using techniques developed recently in the field of compressed sensing we prove new upper bounds for general (non-linear) sampling numbers of (quasi-)Banach smoothness spaces in $L^2$. In relevant cases such as mixed and isotropic weighted Wiener classes or Sobolev spaces with mixed smoothness, sampling numbers in $L^2$ can be upper bounded by best $n$-term trigonometric widths in $L^\infty$. We describe a recovery procedure based on $\ell^1$-minimization (basis pursuit denoising) using only $m$ function values. With this method, a significant gain in the rate of convergence compared to recently developed linear recovery methods is achieved. In this deterministic worst-case setting we see an additional speed-up of $n^{-1/2}$ compared to linear methods in case of weighted Wiener spaces. For their quasi-Banach counterparts even arbitrary polynomial speed-up is possible. Surprisingly, our approach allows to recover mixed smoothness Sobolev functions belonging to $S^r_pW(\mathbb{T}^d)$ on the $d$-torus with a logarithmically better rate of convergence than any linear method can achieve when $1 < p < 2$ and $d$ is large. This effect is not present for isotropic Sobolev spaces.
We present substantially generalized and improved quantum algorithms over prior work for inhomogeneous linear and nonlinear ordinary differential equations (ODE). Specifically, we show how the norm of the matrix exponential characterizes the run time of quantum algorithms for linear ODEs opening the door to an application to a wider class of linear and nonlinear ODEs. In Berry et al., (2017), a quantum algorithm for a certain class of linear ODEs is given, where the matrix involved needs to be diagonalizable. The quantum algorithm for linear ODEs presented here extends to many classes of non-diagonalizable matrices. The algorithm here is also exponentially faster than the bounds derived in Berry et al., (2017) for certain classes of diagonalizable matrices. Our linear ODE algorithm is then applied to nonlinear differential equations using Carleman linearization (an approach taken recently by us in Liu et al., (2021)). The improvement over that result is two-fold. First, we obtain an exponentially better dependence on error. This kind of logarithmic dependence on error has also been achieved by Xue et al., (2021), but only for homogeneous nonlinear equations. Second, the present algorithm can handle any sparse, invertible matrix (that models dissipation) if it has a negative log-norm (including non-diagonalizable matrices), whereas Liu et al., (2021) and Xue et al., (2021) additionally require normality.
In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where the hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.
Local search is an effective method for solving large-scale combinatorial optimization problems, and it has made remarkable progress in recent years through several subtle mechanisms. In this paper, we found two ways to improve the local search algorithms in solving Pseudo-Boolean Optimization(PBO): Firstly, some of those mechanisms such as unit propagation are merely used in solving MaxSAT before, which can be generalized to solve PBO as well; Secondly, the existing local search algorithms utilize the heuristic on variables, so-called score, to mainly guide the search. We attempt to gain more insights into the clause, as it plays the role of a middleman who builds a bridge between variables and the given formula. Hence, we first extended the combination of unit propagation-based decimation algorithm to PBO problem, giving a further generalized definition of unit clause for PBO problem, and apply it to the existing solver LS-PBO for constructing an initial assignment; then, we introduced a new heuristic on clauses, dubbed care, to set a higher priority for the clauses that are less satisfied in current iterations. Experiments on three real-world application benchmarks including minimum-width confidence band, wireless sensor network optimization, and seating arrangement problems show that our algorithm DeciLS-PBO has a promising performance compared to the state-of-the-art algorithms.
Human visual perception can easily generalize to out-of-distributed visual data, which is far beyond the capability of modern machine learning models. Domain generalization (DG) aims to close this gap, with existing DG methods mainly focusing on the loss function design. In this paper, we propose to explore an orthogonal direction, i.e., the design of the backbone architecture. It is motivated by an empirical finding that transformer-based models trained with empirical risk minimization (ERM) outperform CNN-based models employing state-of-the-art (SOTA) DG algorithms on multiple DG datasets. We develop a formal framework to characterize a network's robustness to distribution shifts by studying its architecture's alignment with the correlations in the dataset. This analysis guides us to propose a novel DG model built upon vision transformers, namely Generalizable Mixture-of-Experts (GMoE). Extensive experiments on DomainBed demonstrate that GMoE trained with ERM outperforms SOTA DG baselines by a large margin. Moreover, GMoE is complementary to existing DG methods and its performance is substantially improved when trained with DG algorithms.
The geometric optimisation of crystal structures is a procedure widely used in Chemistry that changes the geometrical placement of the particles inside a structure. It is called structural relaxation and constitutes a local minimization problem with a non-convex objective function whose domain complexity increases along with the number of particles involved. In this work we study the performance of the two most popular first order optimisation methods, Gradient Descent and Conjugate Gradient, in structural relaxation. The respective pseudocodes can be found in Section 6. Although frequently employed, there is a lack of their study in this context from an algorithmic point of view. In order to accurately define the problem, we provide a thorough derivation of all necessary formulae related to the crystal structure energy function and the function's differentiation. We run each algorithm in combination with a constant step size, which provides a benchmark for the methods' analysis and direct comparison. We also design dynamic step size rules and study how these improve the two algorithms' performance. Our results show that there is a trade-off between convergence rate and the possibility of an experiment to succeed, hence we construct a function to assign utility to each method based on our respective preference. The function is built according to a recently introduced model of preference indication concerning algorithms with deadline and their run time. Finally, building on all our insights from the experimental results, we provide algorithmic recipes that best correspond to each of the presented preferences and select one recipe as the optimal for equally weighted preferences.
Creating fast and accurate force fields is a long-standing challenge in computational chemistry and materials science. Recently, several equivariant message passing neural networks (MPNNs) have been shown to outperform models built using other approaches in terms of accuracy. However, most MPNNs suffer from high computational cost and poor scalability. We propose that these limitations arise because MPNNs only pass two-body messages leading to a direct relationship between the number of layers and the expressivity of the network. In this work, we introduce MACE, a new equivariant MPNN model that uses higher body order messages. In particular, we show that using four-body messages reduces the required number of message passing iterations to just two, resulting in a fast and highly parallelizable model, reaching or exceeding state-of-the-art accuracy on the rMD17, 3BPA, and AcAc benchmark tasks. We also demonstrate that using higher order messages leads to an improved steepness of the learning curves.
The conjoining of dynamical systems and deep learning has become a topic of great interest. In particular, neural differential equations (NDEs) demonstrate that neural networks and differential equation are two sides of the same coin. Traditional parameterised differential equations are a special case. Many popular neural network architectures, such as residual networks and recurrent networks, are discretisations. NDEs are suitable for tackling generative problems, dynamical systems, and time series (particularly in physics, finance, ...) and are thus of interest to both modern machine learning and traditional mathematical modelling. NDEs offer high-capacity function approximation, strong priors on model space, the ability to handle irregular data, memory efficiency, and a wealth of available theory on both sides. This doctoral thesis provides an in-depth survey of the field. Topics include: neural ordinary differential equations (e.g. for hybrid neural/mechanistic modelling of physical systems); neural controlled differential equations (e.g. for learning functions of irregular time series); and neural stochastic differential equations (e.g. to produce generative models capable of representing complex stochastic dynamics, or sampling from complex high-dimensional distributions). Further topics include: numerical methods for NDEs (e.g. reversible differential equations solvers, backpropagation through differential equations, Brownian reconstruction); symbolic regression for dynamical systems (e.g. via regularised evolution); and deep implicit models (e.g. deep equilibrium models, differentiable optimisation). We anticipate this thesis will be of interest to anyone interested in the marriage of deep learning with dynamical systems, and hope it will provide a useful reference for the current state of the art.
Federated learning (FL) is an emerging, privacy-preserving machine learning paradigm, drawing tremendous attention in both academia and industry. A unique characteristic of FL is heterogeneity, which resides in the various hardware specifications and dynamic states across the participating devices. Theoretically, heterogeneity can exert a huge influence on the FL training process, e.g., causing a device unavailable for training or unable to upload its model updates. Unfortunately, these impacts have never been systematically studied and quantified in existing FL literature. In this paper, we carry out the first empirical study to characterize the impacts of heterogeneity in FL. We collect large-scale data from 136k smartphones that can faithfully reflect heterogeneity in real-world settings. We also build a heterogeneity-aware FL platform that complies with the standard FL protocol but with heterogeneity in consideration. Based on the data and the platform, we conduct extensive experiments to compare the performance of state-of-the-art FL algorithms under heterogeneity-aware and heterogeneity-unaware settings. Results show that heterogeneity causes non-trivial performance degradation in FL, including up to 9.2% accuracy drop, 2.32x lengthened training time, and undermined fairness. Furthermore, we analyze potential impact factors and find that device failure and participant bias are two potential factors for performance degradation. Our study provides insightful implications for FL practitioners. On the one hand, our findings suggest that FL algorithm designers consider necessary heterogeneity during the evaluation. On the other hand, our findings urge system providers to design specific mechanisms to mitigate the impacts of heterogeneity.