Yurinskii's coupling is a popular theoretical tool for non-asymptotic distributional analysis in mathematical statistics and applied probability, offering a Gaussian strong approximation with an explicit error bound under easily verified conditions. Originally stated in $\ell^2$-norm for sums of independent random vectors, it has recently been extended both to the $\ell^p$-norm, for $1 \leq p \leq \infty$, and to vector-valued martingales in $\ell^2$-norm, under some strong conditions. We present as our main result a Yurinskii coupling for approximate martingales in $\ell^p$-norm, under substantially weaker conditions than those previously imposed. Our formulation further allows for the coupling variable to follow a more general Gaussian mixture distribution, and we provide a novel third-order coupling method which gives tighter approximations in certain settings. We specialize our main result to mixingales, martingales, and independent data, and derive uniform Gaussian mixture strong approximations for martingale empirical processes. Applications to nonparametric partitioning-based and local polynomial regression procedures are provided.
Approximation theorem is one of the most important aspects of numerical analysis that has evolved over the years with many different approaches. Some of the most popular approximation methods include the Lebesgue approximation theorem, the Weierstrass approximation, and the Fourier approximation theorem. The limitations associated with various approximation methods are too crucial to ignore, and thus, the nature of a specific dataset may require using a specific approximation method for such estimates. In this report, we shall delve into Chebyshev's polynomials interpolation in detail as an alternative approach to reconstructing signals and compare the reconstruction to that of the Fourier polynomials. We will also explore the advantages and limitations of the Chebyshev polynomials and discuss in detail their mathematical formulation and equivalence to the cosine function over a given interval [a, b].
Despite the considerable success of Bregman proximal-type algorithms, such as mirror descent, in machine learning, a critical question remains: Can existing stationarity measures, often based on Bregman divergence, reliably distinguish between stationary and non-stationary points? In this paper, we present a groundbreaking finding: All existing stationarity measures necessarily imply the existence of spurious stationary points. We further establish an algorithmic independent hardness result: Bregman proximal-type algorithms are unable to escape from a spurious stationary point in finite steps when the initial point is unfavorable, even for convex problems. Our hardness result points out the inherent distinction between Euclidean and Bregman geometries, and introduces both fundamental theoretical and numerical challenges to both machine learning and optimization communities.
We consider sketching algorithms which first compress data by multiplication with a random sketch matrix, and then apply the sketch to quickly solve an optimization problem, e.g., low-rank approximation and regression. In the learning-based sketching paradigm proposed by~\cite{indyk2019learning}, the sketch matrix is found by choosing a random sparse matrix, e.g., CountSketch, and then the values of its non-zero entries are updated by running gradient descent on a training data set. Despite the growing body of work on this paradigm, a noticeable omission is that the locations of the non-zero entries of previous algorithms were fixed, and only their values were learned. In this work, we propose the first learning-based algorithms that also optimize the locations of the non-zero entries. Our first proposed algorithm is based on a greedy algorithm. However, one drawback of the greedy algorithm is its slower training time. We fix this issue and propose approaches for learning a sketching matrix for both low-rank approximation and Hessian approximation for second order optimization. The latter is helpful for a range of constrained optimization problems, such as LASSO and matrix estimation with a nuclear norm constraint. Both approaches achieve good accuracy with a fast running time. Moreover, our experiments suggest that our algorithm can still reduce the error significantly even if we only have a very limited number of training matrices.
Within the statistical literature, a significant gap exists in methods capable of modeling asymmetric multivariate spatial effects that elucidate the relationships underlying complex spatial phenomena. For such a phenomenon, observations at any location are expected to arise from a combination of within- and between- location effects, where the latter exhibit asymmetry. This asymmetry is represented by heterogeneous spatial effects between locations pertaining to different categories, that is, a feature inherent to each location in the data, such that based on the feature label, asymmetric spatial relations are postulated between neighbouring locations with different labels. Our novel approach synergises the principles of multivariate spatial autoregressive models and the Gaussian graphical model. This synergy enables us to effectively address the gap by accommodating asymmetric spatial relations, overcoming the usual constraints in spatial analyses. Using a Bayesian-estimation framework, the model performance is assessed in a simulation study. We apply the model on intercropping data, where spatial effects between different crops are unlikely to be symmetric, in order to illustrate the usage of the proposed methodology. An R package containing the proposed methodology can be found on //CRAN.R-project.org/package=SAGM.
We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of exponentials. Attention becomes expressible as a composition of log-sums of exponentials that is linearizable, with a latent space of constant size, enabling sequential application with constant time and space complexity per token. We implement our modification, verify that it works in practice, and conclude that it is a promising alternative to conventional attention.
This paper proposes a methodology for generating and perturbing detailed derivations of equations at scale, aided by a symbolic engine, to evaluate the generalisability of Transformers to out-of-distribution mathematical reasoning problems. Instantiating the framework in the context of sequence classification tasks, we compare the capabilities of GPT-4, GPT-3.5, and a canon of fine-tuned BERT models, exploring the relationship between specific operators and generalisation failure via the perturbation of reasoning aspects such as symmetry and variable surface forms. Surprisingly, our empirical evaluation reveals that the average in-distribution performance of fine-tuned models surpasses GPT-3.5, and rivals GPT-4. However, perturbations to input reasoning can reduce their performance by up to 80 F1 points. Overall, the results suggest that the in-distribution performance of smaller open-source models may potentially rival GPT by incorporating appropriately structured derivation dependencies during training, and highlight a shared weakness between BERT and GPT involving a relative inability to decode indirect references to mathematical entities. We release the full codebase, constructed datasets, and fine-tuned models to encourage future progress in the field.
Instrumental variables (IVs) are a popular and powerful tool for estimating causal effects in the presence of unobserved confounding. However, classical approaches rely on strong assumptions such as the $\textit{exclusion criterion}$, which states that instrumental effects must be entirely mediated by treatments. This assumption often fails in practice. When IV methods are improperly applied to data that do not meet the exclusion criterion, estimated causal effects may be badly biased. In this work, we propose a novel solution that provides $\textit{partial}$ identification in linear models given a set of $\textit{leaky instruments}$, which are allowed to violate the exclusion criterion to some limited degree. We derive a convex optimization objective that provides provably sharp bounds on the average treatment effect under some common forms of information leakage, and implement inference procedures to quantify the uncertainty of resulting estimates. We demonstrate our method in a set of experiments with simulated data, where it performs favorably against the state of the art.
One of the current challenges in physically-based simulations, and, more specifically, fluid simulations, is to produce visually appealing results at interactive rates, capable of being used in multiple forms of media. In recent times, a lot of effort has been made with regards to this with the use of multi-core architectures, as many of the computations involved in the algorithms for these simulations are very well suited for these architectures. Although there is a considerable amount of works regarding acceleration techniques in this field, there is yet room to further explore and analyze some of them. To investigate this problem, we surveyed the topic of fluid simulations and some of the recent contributions towards this field. Additionally, we implemented two versions of a fluid simulation algorithm, one on the CPU and the other on the GPU using NVIDIA's CUDA framework, with the intent of gaining a better understanding of the effort needed to move these simulations to a multi-core architecture and the performance gains that we get with it.
We propose a supplement matrix method for computing eigenvalues of a dual Hermitian matrix, and discuss its application in multi-agent formation control. Suppose we have a ring, which can be the real field, the complex field, or the quaternion ring. We study dual number symmetric matrices, dual complex Hermitian matrices and dual quaternion Hermitian matrices in a unified frame of dual Hermitian matrices. An $n \times n$ dual Hermitian matrix has $n$ dual number eigenvalues. We define determinant, characteristic polynomial and supplement matrices for a dual Hermitian matrix. Supplement matrices are Hermitian matrices in the original ring. The standard parts of the eigenvalues of that dual Hermitian matrix are the eigenvalues of the standard part Hermitian matrix in the original ring, while the dual parts of the eigenvalues of that dual Hermitian matrix are the eigenvalues of those supplement matrices. Hence, by applying any practical method for computing eigenvalues of Hermitian matrices in the original ring, we have a practical method for computing eigenvalues of a dual Hermitian matrix. We call this method the supplement matrix method. In multi-agent formation control, a desired relative configuration scheme may be given. People need to know if this scheme is reasonable such that a feasible solution of configurations of these multi-agents exists. By exploring the eigenvalue problem of dual Hermitian matrices, and its link with the unit gain graph theory, we open a cross-disciplinary approach to solve the relative configuration problem. Numerical experiments are reported.
It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). This enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments on natural language processing and small- and large-scale vision tasks, we find that Outlier Exposure significantly improves detection performance. We also observe that cutting-edge generative models trained on CIFAR-10 may assign higher likelihoods to SVHN images than to CIFAR-10 images; we use OE to mitigate this issue. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.