We propose a new paradigm for designing efficient p-adaptive arbitrary high order methods. We consider arbitrary high order iterative schemes that gain one order of accuracy at each iteration and we modify them in order to match the accuracy achieved in a specific iteration with the discretization accuracy of the same iteration. Apart from the computational advantage, the new modified methods allow to naturally perform p-adaptivity, stopping the iterations when appropriate conditions are met. Moreover, the modification is very easy to be included in an existing implementation of an arbitrary high order iterative scheme and it does not ruin the possibility of parallelization, if this was achievable by the original method. An application to the ADER method for hyperbolic Partial Differential Equations (PDEs) is presented here. We explain how such framework can be interpreted as an arbitrary high order iterative scheme, by recasting it as a Deferred Correction (DeC) method, and how to easily modify it to obtain a more efficient formulation, in which a local a posteriori limiter can be naturally integrated leading to p-adaptivity and structure preserving properties. Finally, the novel approach is extensively tested against classical benchmarks for compressible gas dynamics to show the robustness and the computational efficiency.
When training neural networks, it has been widely observed that a large step size is essential in stochastic gradient descent (SGD) for obtaining superior models. However, the effect of large step sizes on the success of SGD is not well understood theoretically. Several previous works have attributed this success to the stochastic noise present in SGD. However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance.We demonstrate the same effects also in the noise-less case, i.e. for full-batch GD. We formally prove that GD with large step size -- on certain non-convex function classes -- follows a different trajectory than GD with a small step size, which can lead to convergence to a global minimum instead of a local one. Our settings provide a framework for future analysis which allows comparing algorithms based on behaviors that can not be observed in the traditional settings.
How have individuals of social animals in nature evolved to learn from each other, and what would be the optimal strategy for such learning in a specific environment? Here, we address both problems by employing a deep reinforcement learning model to optimize the social learning strategies (SLSs) of agents in a cooperative game in a multi-dimensional landscape. Throughout the training for maximizing the overall payoff, we find that the agent spontaneously learns various concepts of social learning, such as copying, focusing on frequent and well-performing neighbors, self-comparison, and the importance of balancing between individual and social learning, without any explicit guidance or prior knowledge about the system. The SLS from a fully trained agent outperforms all of the traditional, baseline SLSs in terms of mean payoff. We demonstrate the superior performance of the reinforcement learning agent in various environments, including temporally changing environments and real social networks, which also verifies the adaptability of our framework to different social settings.
The paper is concerned with efficient numerical methods for solving a linear system $\phi(A) x= b$, where $\phi(z)$ is a $\phi$-function and $A\in \mathbb R^{N\times N}$. In particular in this work we are interested in the computation of ${\phi(A)}^{-1} b$ for the case where $\phi(z)=\phi_1(z)=\displaystyle\frac{e^z-1}{z}, \quad \phi(z)=\phi_2(z)=\displaystyle\frac{e^z-1-z}{z^2}$. Under suitable conditions on the spectrum of $A$ we design fast algorithms for computing both ${\phi_\ell(A)}^{-1}$ and ${\phi_\ell(A)}^{-1} b$ based on Newton's iteration and Krylov-type methods, respectively. Adaptations of these schemes for structured matrices are considered. In particular the cases of banded and more generally quasiseparable matrices are investigated. Numerical results are presented to show the effectiveness of our proposed algorithms.
In this paper, we construct a novel Eulerian-Lagrangian finite volume (ELFV) method for nonlinear scalar hyperbolic equations in one space dimension. It is well known that the exact solutions to such problems may contain shocks though the initial conditions are smooth, and direct numerical methods may suffer from restricted time step sizes. To relieve the restriction, we propose an ELFV method, where the space-time domain was separated by the partition lines originated from the cell interfaces whose slopes are obtained following the Rakine-Hugoniot junmp condition. Unfortunately, to avoid the intersection of the partition lines, the time step sizes are still limited. To fix this gap, we detect effective troubled cells (ETCs) and carefully design the influence region of each ETC, within which the partitioned space-time regions are merged together to form a new one. Then with the new partition of the space-time domain, we theoretically prove that the proposed first-order scheme with Euler forward time discretization is total-variation-diminishing and maximum-principle-preserving with {at least twice} larger time step constraints than the classical first order Eulerian method for Burgers' equation. Numerical experiments verify the optimality of the designed time step sizes.
Despite recent success on various tasks, deep learning techniques still perform poorly on adversarial examples with small perturbations. While optimization-based methods for adversarial attacks are well-explored in the field of computer vision, it is impractical to directly apply them in natural language processing due to the discrete nature of the text. To address the problem, we propose a unified framework to extend the existing optimization-based adversarial attack methods in the vision domain to craft textual adversarial samples. In this framework, continuously optimized perturbations are added to the embedding layer and amplified in the forward propagation process. Then the final perturbed latent representations are decoded with a masked language model head to obtain potential adversarial samples. In this paper, we instantiate our framework with an attack algorithm named Textual Projected Gradient Descent (T-PGD). We find our algorithm effective even using proxy gradient information. Therefore, we perform the more challenging transfer black-box attack and conduct comprehensive experiments to evaluate our attack algorithm with several models on three benchmark datasets. Experimental results demonstrate that our method achieves an overall better performance and produces more fluent and grammatical adversarial samples compared to strong baseline methods. All the code and data will be made public.
A Peskun ordering between two samplers, implying a dominance of one over the other, is known among the Markov chain Monte Carlo community for being a remarkably strong result, but it is also known for being one that is notably difficult to establish. Indeed, one has to prove that the probability to reach a state $\mathbf{y}$ from a state $\mathbf{x}$, using a sampler, is greater than or equal to the probability using the other sampler, and this must hold for all pairs $(\mathbf{x}, \mathbf{y})$ such that $\mathbf{x} \neq \mathbf{y}$. We provide in this paper a weaker version that does not require an inequality between the probabilities for all these states: essentially, the dominance holds asymptotically, as a varying parameter grows without bound, as long as the states for which the probabilities are greater than or equal to belong to a mass-concentrating set. The weak ordering turns out to be useful to compare lifted samplers for partially-ordered discrete state-spaces with their Metropolis--Hastings counterparts. An analysis in great generality yields a qualitative conclusion: they asymptotically perform better in certain situations (and we are able to identify them), but not necessarily in others (and the reasons why are made clear). A thorough study in a specific context of graphical-model simulation is also conducted.
Graph Convolutional Networks (GCNs) have been widely applied in various fields due to their significant power on processing graph-structured data. Typical GCN and its variants work under a homophily assumption (i.e., nodes with same class are prone to connect to each other), while ignoring the heterophily which exists in many real-world networks (i.e., nodes with different classes tend to form edges). Existing methods deal with heterophily by mainly aggregating higher-order neighborhoods or combing the immediate representations, which leads to noise and irrelevant information in the result. But these methods did not change the propagation mechanism which works under homophily assumption (that is a fundamental part of GCNs). This makes it difficult to distinguish the representation of nodes from different classes. To address this problem, in this paper we design a novel propagation mechanism, which can automatically change the propagation and aggregation process according to homophily or heterophily between node pairs. To adaptively learn the propagation process, we introduce two measurements of homophily degree between node pairs, which is learned based on topological and attribute information, respectively. Then we incorporate the learnable homophily degree into the graph convolution framework, which is trained in an end-to-end schema, enabling it to go beyond the assumption of homophily. More importantly, we theoretically prove that our model can constrain the similarity of representations between nodes according to their homophily degree. Experiments on seven real-world datasets demonstrate that this new approach outperforms the state-of-the-art methods under heterophily or low homophily, and gains competitive performance under homophily.
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.
Invariant approaches have been remarkably successful in tackling the problem of domain generalization, where the objective is to perform inference on data distributions different from those used in training. In our work, we investigate whether it is possible to leverage domain information from the unseen test samples themselves. We propose a domain-adaptive approach consisting of two steps: a) we first learn a discriminative domain embedding from unsupervised training examples, and b) use this domain embedding as supplementary information to build a domain-adaptive model, that takes both the input as well as its domain into account while making predictions. For unseen domains, our method simply uses few unlabelled test examples to construct the domain embedding. This enables adaptive classification on any unseen domain. Our approach achieves state-of-the-art performance on various domain generalization benchmarks. In addition, we introduce the first real-world, large-scale domain generalization benchmark, Geo-YFCC, containing 1.1M samples over 40 training, 7 validation, and 15 test domains, orders of magnitude larger than prior work. We show that the existing approaches either do not scale to this dataset or underperform compared to the simple baseline of training a model on the union of data from all training domains. In contrast, our approach achieves a significant improvement.
The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, if not better than, the original dense networks. Sparsity can reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.