We can define the error distribution as the limiting distribution of the error between the solution $Y$ of a given stochastic differential equation (SDE) and its numerical approximation $\hat{Y}^{(m)}$, weighted by the convergence rate between the two. A goal when studying the error distribution is to provide a way of determination for error distributions for any SDE and numerical scheme that converge to the exact solution. By dividing the error into a main term and a remainder term in a particular way, the author shows that the remainder term can be negligible compared to the main term under certain suitable conditions. Under these conditions, deriving the error distribution reduces to deriving the limiting distribution of the main term. Even if the dimension is one, there are unsolved problems about the asymptotic behavior of the error when the SDE has a drift term and $0<H\leq 1/3$, but our result in the one-dimensional case can be adapted to any Hurst exponent. The main idea of the proof is to define a stochastic process $Y^{m, \rho}$ with the parameter $\rho$ interpolating between $Y$ and $\hat{Y}^{(m)}$ and to estimate the asymptotic expansion for it. Using this estimate, we determine the error distribution of the ($k$)-Milstein scheme and of the Crank-Nicholson scheme in unsolved cases.
We provide a convergence analysis of gradient descent for the problem of agnostically learning a single ReLU function with moderate bias under Gaussian distributions. Unlike prior work that studies the setting of zero bias, we consider the more challenging scenario when the bias of the ReLU function is non-zero. Our main result establishes that starting from random initialization, in a polynomial number of iterations gradient descent outputs, with high probability, a ReLU function that achieves an error that is within a constant factor of the optimal error of the best ReLU function with moderate bias. We also provide finite sample guarantees, and these techniques generalize to a broader class of marginal distributions beyond Gaussians.
We show through numerical simulation that the Quantum Approximate Optimization Algorithm (QAOA) for higher-order, random-coefficient, heavy-hex compatible spin glass Ising models has strong parameter concentration across problem sizes from $16$ up to $127$ qubits for $p=1$ up to $p=5$, which allows for straight-forward transfer learning of QAOA angles on instance sizes where exhaustive grid-search is prohibitive even for $p>1$. We use Matrix Product State (MPS) simulation at different bond dimensions to obtain confidence in these results, and we obtain the optimal solutions to these combinatorial optimization problems using CPLEX. In order to assess the ability of current noisy quantum hardware to exploit such parameter concentration, we execute short-depth QAOA circuits (with a CNOT depth of 6 per $p$, resulting in circuits which contain $1420$ two qubit gates for $127$ qubit $p=5$ QAOA) on $100$ higher-order (cubic term) Ising models on IBM quantum superconducting processors with $16, 27, 127$ qubits using QAOA angles learned from a single $16$-qubit instance. We show that (i) the best quantum processors generally find lower energy solutions up to $p=3$ for 27 qubit systems and up to $p=2$ for 127 qubit systems and are overcome by noise at higher values of $p$, (ii) the best quantum processors find mean energies that are about a factor of two off from the noise-free numerical simulation results. Additional insights from our experiments are that large performance differences exist among different quantum processors even of the same generation and that dynamical decoupling significantly improve performance for some, but decrease performance for other quantum processors. Lastly we show $p=1$ QAOA angle mean energy landscapes computed using up to a $414$ qubit quantum computer, showing that the mean QAOA energy landscapes remain very similar as the problem size changes.
Using statistical learning methods to analyze stochastic simulation outputs can significantly enhance decision-making by uncovering relationships between different simulated systems and between a system's inputs and outputs. We focus on clustering multivariate empirical distributions of simulation outputs to identify patterns and trade-offs among performance measures. We present a novel agglomerative clustering algorithm that utilizes the regularized Wasserstein distance to cluster these multivariate empirical distributions. This framework has several important use cases, including anomaly detection, pre-optimization, and online monitoring. In numerical experiments involving a call-center model, we demonstrate how this methodology can identify staffing plans that yield similar performance outcomes and inform policies for intervening when queue lengths signal potentially worsening system performance.
Computing the approximate quantiles or ranks of a stream is a fundamental task in data monitoring. Given a stream of elements $x_1, x_2, \dots, x_n$ and a query $x$, a relative-error quantile estimation algorithm can estimate the rank of $x$ with respect to the stream, up to a multiplicative $\pm \epsilon \cdot \mathrm{rank}(x)$ error. Notably, this requires the sketch to obtain more precise estimates for the ranks of elements on the tails of the distribution, as compared to the additive $\pm \epsilon n$ error regime. Previously, the best-known algorithms for relative error achieved space $\tilde O(\epsilon^{-1}\log^{1.5}(\epsilon n))$ (Cormode, Karnin, Liberty, Thaler, Vesel{\`y}, 2021) and $\tilde O(\epsilon^{-2}\log(\epsilon n))$ (Zhang, Lin, Xu, Korn, Wang, 2006). In this work, we present a nearly-optimal streaming algorithm for the relative-error quantile estimation problem using $\tilde O(\epsilon^{-1}\log(\epsilon n))$ space, which almost matches the trivial $\Omega(\epsilon^{-1} \log (\epsilon n))$ lower bound. To surpass the $\Omega(\epsilon^{-1}\log^{1.5}(\epsilon n))$ barrier of the previous approach, our algorithm crucially relies on a new data structure, called an elastic compactor, which can be dynamically resized over the course of the stream. Interestingly, we design a space allocation scheme which adaptively allocates space to each compactor based on the "hardness" of the input stream. This approach allows us to avoid using the maximal space simultaneously for every compactor and facilitates the improvement in the total space complexity. Along the way, we also propose and study a new problem called the Top Quantiles Problem, which only requires the sketch to provide estimates for a fixed-length tail of the distribution. This problem serves as an important subproblem in our algorithm, though it is also an interesting problem of its own right.
It is the purpose of this paper to investigate the issue of estimating the regularity index $\beta>0$ of a discrete heavy-tailed r.v. $S$, \textit{i.e.} a r.v. $S$ valued in $\mathbb{N}^*$ such that $\mathbb{P}(S>n)=L(n)\cdot n^{-\beta}$ for all $n\geq 1$, where $L:\mathbb{R}^*_+\to \mathbb{R}_+$ is a slowly varying function. As a first go, we consider the situation where inference is based on independent copies $S_1,\; \ldots,\; S_n$ of the generic variable $S$. Just like the popular Hill estimator in the continuous heavy-tail situation, the estimator $\widehat{\beta}$ we propose can be derived by means of a suitable reformulation of the regularly varying condition, replacing $S$'s survivor function by its empirical counterpart. Under mild assumptions, a non-asymptotic bound for the deviation between $\widehat{\beta}$ and $\beta$ is established, as well as limit results (consistency and asymptotic normality). Beyond the i.i.d. case, the inference method proposed is extended to the estimation of the regularity index of a regenerative $\beta$-null recurrent Markov chain. Since the parameter $\beta$ can be then viewed as the tail index of the (regularly varying) distribution of the return time of the chain $X$ to any (pseudo-) regenerative set, in this case, the estimator is constructed from the successive regeneration times. Because the durations between consecutive regeneration times are asymptotically independent, we can prove that the consistency of the estimator promoted is preserved. In addition to the theoretical analysis carried out, simulation results provide empirical evidence of the relevance of the inference technique proposed.
Federated Learning (FL) is a decentralized machine-learning paradigm, in which a global server iteratively averages the model parameters of local users without accessing their data. User heterogeneity has imposed significant challenges to FL, which can incur drifted global models that are slow to converge. Knowledge Distillation has recently emerged to tackle this issue, by refining the server model using aggregated knowledge from heterogeneous users, other than directly averaging their model parameters. This approach, however, depends on a proxy dataset, making it impractical unless such a prerequisite is satisfied. Moreover, the ensemble knowledge is not fully utilized to guide local model learning, which may in turn affect the quality of the aggregated model. Inspired by the prior art, we propose a data-free knowledge distillation} approach to address heterogeneous FL, where the server learns a lightweight generator to ensemble user information in a data-free manner, which is then broadcasted to users, regulating local training using the learned knowledge as an inductive bias. Empirical studies powered by theoretical implications show that, our approach facilitates FL with better generalization performance using fewer communication rounds, compared with the state-of-the-art.
Approaches based on deep neural networks have achieved striking performance when testing data and training data share similar distribution, but can significantly fail otherwise. Therefore, eliminating the impact of distribution shifts between training and testing data is crucial for building performance-promising deep models. Conventional methods assume either the known heterogeneity of training data (e.g. domain labels) or the approximately equal capacities of different domains. In this paper, we consider a more challenging case where neither of the above assumptions holds. We propose to address this problem by removing the dependencies between features via learning weights for training samples, which helps deep models get rid of spurious correlations and, in turn, concentrate more on the true connection between discriminative features and labels. Extensive experiments clearly demonstrate the effectiveness of our method on multiple distribution generalization benchmarks compared with state-of-the-art counterparts. Through extensive experiments on distribution generalization benchmarks including PACS, VLCS, MNIST-M, and NICO, we show the effectiveness of our method compared with state-of-the-art counterparts.
Domain shift is a fundamental problem in visual recognition which typically arises when the source and target data follow different distributions. The existing domain adaptation approaches which tackle this problem work in the closed-set setting with the assumption that the source and the target data share exactly the same classes of objects. In this paper, we tackle a more realistic problem of open-set domain shift where the target data contains additional classes that are not present in the source data. More specifically, we introduce an end-to-end Progressive Graph Learning (PGL) framework where a graph neural network with episodic training is integrated to suppress underlying conditional shift and adversarial learning is adopted to close the gap between the source and target distributions. Compared to the existing open-set adaptation approaches, our approach guarantees to achieve a tighter upper bound of the target error. Extensive experiments on three standard open-set benchmarks evidence that our approach significantly outperforms the state-of-the-arts in open-set domain adaptation.
Multi-relation Question Answering is a challenging task, due to the requirement of elaborated analysis on questions and reasoning over multiple fact triples in knowledge base. In this paper, we present a novel model called Interpretable Reasoning Network that employs an interpretable, hop-by-hop reasoning process for question answering. The model dynamically decides which part of an input question should be analyzed at each hop; predicts a relation that corresponds to the current parsed results; utilizes the predicted relation to update the question representation and the state of the reasoning process; and then drives the next-hop reasoning. Experiments show that our model yields state-of-the-art results on two datasets. More interestingly, the model can offer traceable and observable intermediate predictions for reasoning analysis and failure diagnosis, thereby allowing manual manipulation in predicting the final answer.
Object detection typically assumes that training and test data are drawn from an identical distribution, which, however, does not always hold in practice. Such a distribution mismatch will lead to a significant performance drop. In this work, we aim to improve the cross-domain robustness of object detection. We tackle the domain shift on two levels: 1) the image-level shift, such as image style, illumination, etc, and 2) the instance-level shift, such as object appearance, size, etc. We build our approach based on the recent state-of-the-art Faster R-CNN model, and design two domain adaptation components, on image level and instance level, to reduce the domain discrepancy. The two domain adaptation components are based on H-divergence theory, and are implemented by learning a domain classifier in adversarial training manner. The domain classifiers on different levels are further reinforced with a consistency regularization to learn a domain-invariant region proposal network (RPN) in the Faster R-CNN model. We evaluate our newly proposed approach using multiple datasets including Cityscapes, KITTI, SIM10K, etc. The results demonstrate the effectiveness of our proposed approach for robust object detection in various domain shift scenarios.