In the stochastic gradient descent (SGD) for sequential simulations such as the neural stochastic differential equations, the Multilevel Monte Carlo (MLMC) method is known to offer better theoretical computational complexity compared to the naive Monte Carlo approach. However, in practice, MLMC scales poorly on massively parallel computing platforms such as modern GPUs, because of its large parallel complexity which is equivalent to that of the naive Monte Carlo method. To cope with this issue, we propose the delayed MLMC gradient estimator that drastically reduces the parallel complexity of MLMC by recycling previously computed gradient components from earlier steps of SGD. The proposed estimator provably reduces the average parallel complexity per iteration at the cost of a slightly worse per-iteration convergence rate. In our numerical experiments, we use an example of deep hedging to demonstrate the superior parallel complexity of our method compared to the standard MLMC in SGD.
The ability to construct a realistic simulator of financial exchanges, including reproducing the dynamics of the limit order book, can give insight into many counterfactual scenarios, such as a flash crash, a margin call, or changes in macroeconomic outlook. In recent years, agent-based models have been developed that reproduce many features of an exchange, as summarised by a set of stylised facts and statistics. However, the ability to calibrate simulators to a specific period of trading remains an open challenge. In this work, we develop a novel approach to the calibration of market simulators by leveraging recent advances in deep learning, specifically using neural density estimators and embedding networks. We demonstrate that our approach is able to correctly identify high probability parameter sets, both when applied to synthetic and historical data, and without reliance on manually selected or weighted ensembles of stylised facts.
This paper categorizes the parameterized complexity of the algorithmic problems Perfect Phylogeny and Triangulating Colored Graphs when parameterized by the number of genes and colors, respectively. We show that they are complete for the parameterized complexity class XALP using a reduction from Tree-chained Multicolor Independent Set and a proof of membership. We introduce the problem Triangulating Multicolored Graphs as a stepping stone and prove XALP-completeness for this problem as well. We also show that, assuming the Exponential Time Hypothesis, there exists no algorithm that solves any of these problems in time $f(k) n^{o(k)}$, where $n$ is the input size, $k$ the parameter, and $f$ any computable function.
The graduated optimization approach is a heuristic method for finding globally optimal solutions for nonconvex functions and has been theoretically analyzed in several studies. This paper defines a new family of nonconvex functions for graduated optimization, discusses their sufficient conditions, and provides a convergence analysis of the graduated optimization algorithm for them. It shows that stochastic gradient descent (SGD) with mini-batch stochastic gradients has the effect of smoothing the function, the degree of which is determined by the learning rate and batch size. This finding provides theoretical insights on why large batch sizes fall into sharp local minima, why decaying learning rates and increasing batch sizes are superior to fixed learning rates and batch sizes, and what the optimal learning rate scheduling is. To the best of our knowledge, this is the first paper to provide a theoretical explanation for these aspects. Moreover, a new graduated optimization framework that uses a decaying learning rate and increasing batch size is analyzed and experimental results of image classification that support our theoretical findings are reported.
Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest setting of learning with overparameterization. We establish an instance-dependent excess risk bound for ASGD within each eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate of exponential decay for bias error, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our result suggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result.
Multimodal reasoning is a challenging task that requires models to reason across multiple modalities to answer questions. Existing approaches have made progress by incorporating language and visual modalities into a two-stage reasoning framework, separating rationale generation from answer inference. However, these approaches often fall short due to the inadequate quality of the generated rationales. In this work, we delve into the importance of rationales in model reasoning. We observe that when rationales are completely accurate, the model's accuracy significantly improves, highlighting the need for high-quality rationale generation. Motivated by this, we propose MC-CoT, a self-consistency training strategy that generates multiple rationales and answers, subsequently selecting the most accurate through a voting process. This approach not only enhances the quality of generated rationales but also leads to more accurate and robust answers. Through extensive experiments, we demonstrate that our approach significantly improves model performance across various benchmarks. Remarkably, we show that even smaller base models, when equipped with our proposed approach, can achieve results comparable to those of larger models, illustrating the potential of our approach in harnessing the power of rationales for improved multimodal reasoning. The code is available at //github.com/chengtan9907/mc-cot.
The problems of determining the permutation-representation number (prn) and the representation number of bipartite graphs are open in the literature. Moreover, the decision problem corresponding to the determination of the prn of a bipartite graph is NP-complete. However, these numbers were established for certain subclasses of bipartite graphs, e.g., for crown graphs. Further, it was conjectured that the crown graphs have the highest representation number among the bipartite graphs. In this work, first, we reconcile the relation between the prn of a comparability graph and the dimension of its induced poset and review the upper bounds on the prn of bipartite graphs. Then, we study the prn of bipartite graphs using the notion called neighborhood graphs. This approach substantiates the aforesaid conjecture and gives us theoretical evidence. In this connection, we devise a polynomial-time procedure to construct a word that represents a given bipartite graph permutationally. Accordingly, we provide a better upper bound for the prn of bipartite graphs. Further, we construct a class of bipartite graphs, viz., extended crown graphs, defined over posets and investigate its prn using the neighborhood graphs.
In the aim of reducing the computational cost of the resolution of parameter-dependent eigenvalue problems, a model order reduction (MOR) procedure is proposed. We focus on the case of non-self-adjoint generalized eigenvalue problems, such as the stationary multigroup neutron diffusion equations. The method lies in an approximation of the manifold of solutions using a Proper Orthogonal Decomposition approach. The numerical method is composed of two stages. In the offline stage, we build a reduced space which approximates the manifold. In the online stage, for any given new set of parameters, we solve a reduced problem on the reduced space within a much smaller computational time than the required time to solve the high-fidelity problem. This method is applied to core computations in the APOLLO3 code.
We introduce and compare computational techniques for sharp extreme event probability estimates in stochastic differential equations with small additive Gaussian noise. In particular, we focus on strategies that are scalable, i.e. their efficiency does not degrade upon temporal and possibly spatial refinement. For that purpose, we extend algorithms based on the Laplace method for estimating the probability of an extreme event to infinite dimensional path space. The method estimates the limiting exponential scaling using a single realization of the random variable, the large deviation minimizer. Finding this minimizer amounts to solving an optimization problem governed by a differential equation. The probability estimate becomes sharp when it additionally includes prefactor information, which necessitates computing the determinant of a second derivative operator to evaluate a Gaussian integral around the minimizer. We present an approach in infinite dimensions based on Fredholm determinants, and develop numerical algorithms to compute these determinants efficiently for the high-dimensional systems that arise upon discretization. We also give an interpretation of this approach using Gaussian process covariances and transition tubes. An example model problem, for which we provide an open-source python implementation, is used throughout the paper to illustrate all methods discussed. To study the performance of the methods, we consider examples of stochastic differential and stochastic partial differential equations, including the randomly forced incompressible three-dimensional Navier-Stokes equations.
Superposed orders of quantum channels have already been proved - both theoretically and experimentally - to enable unparalleled opportunities in the quantum communication domain. As a matter of fact, superposition of orders can be exploited within the quantum computing domain as well, by relaxing the (traditional) assumption underlying quantum computation about applying gates in a well-defined causal order. In this context, we address a fundamental question arising with quantum computing: whether superposed orders of single-qubit gates can enable universal quantum computation. As shown in this paper, the answer to this key question is a definitive "yes". Indeed, we prove that any two-qubit controlled quantum gate can be deterministically realized, including the so-called Barenco gate that alone enables universal quantum computation.
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.