We consider power means of independent and identically distributed (i.i.d.) non-integrable random variables. The power mean is a homogeneous quasi-arithmetic mean, and under some conditions, several limit theorems hold for the power mean as well as for the arithmetic mean of i.i.d. integrable random variables. We establish integrabilities and a limit theorem for the variances of the power mean of i.i.d. non-integrable random variables. We also consider behaviors of the power mean when the parameter of the power varies. Our feature is that the generator of the power mean is allowed to be complex-valued, which enables us to consider the power mean of random variables supported on the whole set of real numbers. The complex-valued power mean is an unbiased strongly-consistent estimator for the joint of the location and scale parameters of the Cauchy distribution.
Recently there is a large amount of work devoted to the study of Markov chain stochastic gradient methods (MC-SGMs) which mainly focus on their convergence analysis for solving minimization problems. In this paper, we provide a comprehensive generalization analysis of MC-SGMs for both minimization and minimax problems through the lens of algorithmic stability in the framework of statistical learning theory. For empirical risk minimization (ERM) problems, we establish the optimal excess population risk bounds for both smooth and non-smooth cases by introducing on-average argument stability. For minimax problems, we develop a quantitative connection between on-average argument stability and generalization error which extends the existing results for uniform stability \cite{lei2021stability}. We further develop the first nearly optimal convergence rates for convex-concave problems both in expectation and with high probability, which, combined with our stability results, show that the optimal generalization bounds can be attained for both smooth and non-smooth cases. To the best of our knowledge, this is the first generalization analysis of SGMs when the gradients are sampled from a Markov process.
Positive and unlabelled learning is an important problem which arises naturally in many applications. The significant limitation of almost all existing methods lies in assuming that the propensity score function is constant (SCAR assumption), which is unrealistic in many practical situations. Avoiding this assumption, we consider parametric approach to the problem of joint estimation of posterior probability and propensity score functions. We show that under mild assumptions when both functions have the same parametric form (e.g. logistic with different parameters) the corresponding parameters are identifiable. Motivated by this, we propose two approaches to their estimation: joint maximum likelihood method and the second approach based on alternating maximization of two Fisher consistent expressions. Our experimental results show that the proposed methods are comparable or better than the existing methods based on Expectation-Maximisation scheme.
The edges of the characteristic imset polytope, $\operatorname{CIM}_p$, were recently shown to have strong connections to causal discovery as many algorithms could be interpreted as greedy restricted edge-walks, even though only a strict subset of the edges are known. To better understand the general edge structure of the polytope we describe the edge structure of faces with a clear combinatorial interpretation: for any undirected graph $G$ we have the face $\operatorname{CIM}_G$, the convex hull of the characteristic imsets of DAGs with skeleton $G$. We give a full edge-description of $\operatorname{CIM}_G$ when $G$ is a tree, leading to interesting connections to other polytopes. In particular the well-studied stable set polytope can be recovered as a face of $\operatorname{CIM}_G$ when $G$ is a tree. Building on this connection we are also able to give a description of all edges of $\operatorname{CIM}_G$ when $G$ is a cycle, suggesting possible inroads for generalization. We then introduce an algorithm for learning directed trees from data, utilizing our newly discovered edges, that outperforms classical methods on simulated Gaussian data.
We introduce a two-person non-zero-sum random-turn game that is a variant of the stake-governed games introduced recently in [HP2022]. We call the new game the Trail of Lost Pennies. At time zero, a counter is placed at a given integer location: $X_0 = k \in \mathbb{Z}$, say. At the $i$-th turn (for $i \in \mathbb{N}_+$), Maxine and Mina place non-negative stakes, $a_i$ and $b_i$, for which each pays from her own savings. Maxine is declared to be the turn victor with probability $\tfrac{a_i}{a_i+b_i}$; otherwise, Mina is. If Maxine wins the turn, she will move the counter one place to the right, so that $X_i = X_{i-1} +1$; if Mina does so, the counter will move one place to the left, so that $X_i = X_{i-1} -1$. If $\liminf X_i = \infty$, then Maxine wins the game; if $\limsup X_i = -\infty$, then Mina does. (A special rule is needed to treat the remaining, indeterminate, case.) When Maxine wins, she receives a terminal payment of $m_\infty$, while Mina receives $n_\infty$. If Mina wins, these respective receipts are $m_{-\infty}$ and $n_{-\infty}$. The four terminal payment values are supposed to be real numbers that satisfy $m_\infty > m_{-\infty}$ and $n_\infty < n_{-\infty}$, where these bounds accord with the notion that Maxine wins when the counter ends far to the right, and that Mina does so when it reaches far to the left. Each player is motivated to offer stakes at each turn of the game, in order to secure the higher terminal payment that will arise from her victory; but since these stake amounts accumulate to act as a cost depleting the profit arising from victory, each player must also seek to control these expenses. In this article, we study the Trail of Lost Pennies, formulating strategies for the two players and defining and analysing Nash equilibria in the game.
We consider the stochastic gradient descent (SGD) algorithm driven by a general stochastic sequence, including i.i.d noise and random walk on an arbitrary graph, among others; and analyze it in the asymptotic sense. Specifically, we employ the notion of `efficiency ordering', a well-analyzed tool for comparing the performance of Markov Chain Monte Carlo (MCMC) samplers, for SGD algorithms in the form of Loewner ordering of covariance matrices associated with the scaled iterate errors in the long term. Using this ordering, we show that input sequences that are more efficient for MCMC sampling also lead to smaller covariance of the errors for SGD algorithms in the limit. This also suggests that an arbitrarily weighted MSE of SGD iterates in the limit becomes smaller when driven by more efficient chains. Our finding is of particular interest in applications such as decentralized optimization and swarm learning, where SGD is implemented in a random walk fashion on the underlying communication graph for cost issues and/or data privacy. We demonstrate how certain non-Markovian processes, for which typical mixing-time based non-asymptotic bounds are intractable, can outperform their Markovian counterparts in the sense of efficiency ordering for SGD. We show the utility of our method by applying it to gradient descent with shuffling and mini-batch gradient descent, reaffirming key results from existing literature under a unified framework. Empirically, we also observe efficiency ordering for variants of SGD such as accelerated SGD and Adam, open up the possibility of extending our notion of efficiency ordering to a broader family of stochastic optimization algorithms.
This thesis investigates the quality of randomly collected data by employing a framework built on information-based complexity, a field related to the numerical analysis of abstract problems. The quality or power of gathered information is measured by its radius which is the uniform error obtainable by the best possible algorithm using it. The main aim is to present progress towards understanding the power of random information for approximation and integration problems.
Neural Architecture Search (NAS) has fostered the automatic discovery of neural architectures, which achieve state-of-the-art accuracy in image recognition. Despite the progress achieved with NAS, so far there is little attention to theoretical guarantees on NAS. In this work, we study the generalization properties of NAS under a unifying framework enabling (deep) layer skip connection search and activation function search. To this end, we derive the lower (and upper) bounds of the minimum eigenvalue of Neural Tangent Kernel under the (in)finite width regime from a search space including mixed activation functions, fully connected, and residual neural networks. Our analysis is non-trivial due to the coupling of various architectures and activation functions under the unifying framework. Then, we leverage the eigenvalue bounds to establish generalization error bounds of NAS in the stochastic gradient descent training. Importantly, we theoretically and experimentally show how the derived results can guide NAS to select the top-performing architectures, even in the case without training, leading to a training-free algorithm based on our theory. Accordingly, our numerical validation shed light on the design of computationally efficient methods for NAS.
Variational Bayesian posterior inference often requires simplifying approximations such as mean-field parametrisation to ensure tractability. However, prior work has associated the variational mean-field approximation for Bayesian neural networks with underfitting in the case of small datasets or large model sizes. In this work, we show that invariances in the likelihood function of over-parametrised models contribute to this phenomenon because these invariances complicate the structure of the posterior by introducing discrete and/or continuous modes which cannot be well approximated by Gaussian mean-field distributions. In particular, we show that the mean-field approximation has an additional gap in the evidence lower bound compared to a purpose-built posterior that takes into account the known invariances. Importantly, this invariance gap is not constant; it vanishes as the approximation reverts to the prior. We proceed by first considering translation invariances in a linear model with a single data point in detail. We show that, while the true posterior can be constructed from a mean-field parametrisation, this is achieved only if the objective function takes into account the invariance gap. Then, we transfer our analysis of the linear model to neural networks. Our analysis provides a framework for future work to explore solutions to the invariance problem.
Classic machine learning methods are built on the $i.i.d.$ assumption that training and testing data are independent and identically distributed. However, in real scenarios, the $i.i.d.$ assumption can hardly be satisfied, rendering the sharp drop of classic machine learning algorithms' performances under distributional shifts, which indicates the significance of investigating the Out-of-Distribution generalization problem. Out-of-Distribution (OOD) generalization problem addresses the challenging setting where the testing distribution is unknown and different from the training. This paper serves as the first effort to systematically and comprehensively discuss the OOD generalization problem, from the definition, methodology, evaluation to the implications and future directions. Firstly, we provide the formal definition of the OOD generalization problem. Secondly, existing methods are categorized into three parts based on their positions in the whole learning pipeline, namely unsupervised representation learning, supervised model learning and optimization, and typical methods for each category are discussed in detail. We then demonstrate the theoretical connections of different categories, and introduce the commonly used datasets and evaluation metrics. Finally, we summarize the whole literature and raise some future directions for OOD generalization problem. The summary of OOD generalization methods reviewed in this survey can be found at //out-of-distribution-generalization.com.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.