We study Stochastic Gradient Descent with AdaGrad stepsizes: a popular adaptive (self-tuning) method for first-order stochastic optimization. Despite being well studied, existing analyses of this method suffer from various shortcomings: they either assume some knowledge of the problem parameters, impose strong global Lipschitz conditions, or fail to give bounds that hold with high probability. We provide a comprehensive analysis of this basic method without any of these limitations, in both the convex and non-convex (smooth) cases, that additionally supports a general ``affine variance'' noise model and provides sharp rates of convergence in both the low-noise and high-noise~regimes.
We study how a principal can efficiently and effectively intervene on the rewards of a previously unseen learning agent in order to induce desirable outcomes. This is relevant to many real-world settings like auctions or taxation, where the principal may not know the learning behavior nor the rewards of real people. Moreover, the principal should be few-shot adaptable and minimize the number of interventions, because interventions are often costly. We introduce MERMAIDE, a model-based meta-learning framework to train a principal that can quickly adapt to out-of-distribution agents with different learning strategies and reward functions. We validate this approach step-by-step. First, in a Stackelberg setting with a best-response agent, we show that meta-learning enables quick convergence to the theoretically known Stackelberg equilibrium at test time, although noisy observations severely increase the sample complexity. We then show that our model-based meta-learning approach is cost-effective in intervening on bandit agents with unseen explore-exploit strategies. Finally, we outperform baselines that use either meta-learning or agent behavior modeling, in both $0$-shot and $K=1$-shot settings with partial agent information.
In this paper we propose an estimator of spot covariance matrix which ensure symmetric positive semi-definite estimations. The proposed estimator relies on a suitable modification of the Fourier covariance estimator in Malliavin and Mancino (2009) and it is consistent for suitable choices of the weighting kernel. The accuracy and the ability of the estimator to produce positive semi-definite covariance matrices is evaluated with an extensive numerical study, in comparison with the competitors present in the literature. The results of the simulation study are confirmed under many scenarios, that consider the dimensionality of the problem, the asynchronicity of data and the presence of several specification of market microstructure noise.
In bandit algorithms, the randomly time-varying adaptive experimental design makes it difficult to apply traditional limit theorems to off-policy evaluation of the treatment effect. Moreover, the normal approximation by the central limit theorem becomes unsatisfactory for lack of information due to the small sample size of the inferior arm. To resolve this issue, we introduce a backwards asymptotic expansion method and prove the validity of this scheme based on the partial mixing, that was originally introduced for the expansion of the distribution of a functional of a jump-diffusion process in a random environment. The theory is generalized in this paper to incorporate the backward propagation of random functions in the bandit algorithm. Besides the analytical validation, the simulation studies also support the new method. Our formulation is general and applicable to nonlinearly parametrized differentiable statistical models having an adaptive design.
Cochran's $Q$ statistic is routinely used for testing heterogeneity in meta-analysis. Its expected value (under an incorrect null distribution) is part of several popular estimators of the between-study variance, $\tau^2$. Those applications generally do not account for the studies' use of estimated variances in the inverse-variance weights that define $Q$ (more explicitly, $Q_{IV}$). Importantly, those weights make approximating the distribution of $Q_{IV}$ rather complicated. As an alternative, we are investigating a $Q$ statistic, $Q_F$, whose constant weights use only the studies' arm-level sample sizes. For log-odds-ratio, log-relative-risk, and risk difference as the measure of effect, these simulations study approximations to the distributions of $Q_F$ and $Q_{IV}$, as the basis for tests of heterogeneity. We present the results in 132 Figures, 153 pages in total.
We investigate the high-dimensional linear regression problem in situations where there is noise correlated with Gaussian covariates. In regression models, the phenomenon of the correlated noise is called endogeneity, which is due to unobserved variables and others, and has been a major problem setting in causal inference and econometrics. When the covariates are high-dimensional, it has been common to assume sparsity on the true parameters and estimate them using regularization, even with the endogeneity. However, when sparsity does not hold, it has not been well understood to control the endogeneity and high dimensionality simultaneously. In this paper, we demonstrate that an estimator without regularization can achieve consistency, i.e., benign overfitting, under certain assumptions on the covariance matrix. Specifically, we show that the error of this estimator converges to zero when covariance matrices of the correlated noise and instrumental variables satisfy a condition on their eigenvalues. We consider several extensions to relax these conditions and conduct experiments to support our theoretical findings. As a technical contribution, we utilize the convex Gaussian minimax theorem (CGMT) in our dual problem and extend the CGMT itself.
We introduce a novel continual learning method based on multifidelity deep neural networks. This method learns the correlation between the output of previously trained models and the desired output of the model on the current training dataset, limiting catastrophic forgetting. On its own the multifidelity continual learning method shows robust results that limit forgetting across several datasets. Additionally, we show that the multifidelity method can be combined with existing continual learning methods, including replay and memory aware synapses, to further limit catastrophic forgetting. The proposed continual learning method is especially suited for physical problems where the data satisfy the same physical laws on each domain, or for physics-informed neural networks, because in these cases we expect there to be a strong correlation between the output of the previous model and the model on the current training domain.
Backward Stochastic Differential Equations (BSDEs) have been widely employed in various areas of social and natural sciences, such as the pricing and hedging of financial derivatives, stochastic optimal control problems, optimal stopping problems and gene expression. Most BSDEs cannot be solved analytically and thus numerical methods must be applied to approximate their solutions. There have been a variety of numerical methods proposed over the past few decades as well as many more currently being developed. For the most part, they exist in a complex and scattered manner with each requiring a variety of assumptions and conditions. The aim of the present work is thus to systematically survey various numerical methods for BSDEs, and in particular, compare and categorize them, for further developments and improvements. To achieve this goal, we focus primarily on the core features of each method based on an extensive collection of 333 references: the main assumptions, the numerical algorithm itself, key convergence properties and advantages and disadvantages, to provide an up-to-date coverage of numerical methods for BSDEs, with insightful summaries of each and a useful comparison and categorization.
Residual networks (ResNets) have displayed impressive results in pattern recognition and, recently, have garnered considerable theoretical interest due to a perceived link with neural ordinary differential equations (neural ODEs). This link relies on the convergence of network weights to a smooth function as the number of layers increases. We investigate the properties of weights trained by stochastic gradient descent and their scaling with network depth through detailed numerical experiments. We observe the existence of scaling regimes markedly different from those assumed in neural ODE literature. Depending on certain features of the network architecture, such as the smoothness of the activation function, one may obtain an alternative ODE limit, a stochastic differential equation or neither of these. These findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
In this monograph, I introduce the basic concepts of Online Learning through a modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings. All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants. Particular attention is given to the issue of tuning the parameters of the algorithms and learning in unbounded domains, through adaptive and parameter-free online learning algorithms. Non-convex losses are dealt through convex surrogate losses and through randomization. The bandit setting is also briefly discussed, touching on the problem of adversarial and stochastic multi-armed bandits. These notes do not require prior knowledge of convex analysis and all the required mathematical tools are rigorously explained. Moreover, all the proofs have been carefully chosen to be as simple and as short as possible.