We study the random reshuffling (RR) method for smooth nonconvex optimization problems with a finite-sum structure. Though this method is widely utilized in practice such as the training of neural networks, its convergence behavior is only understood in several limited settings. In this paper, under the well-known Kurdyka-Lojasiewicz (KL) inequality, we establish strong limit-point convergence results for RR with appropriate diminishing step sizes, namely, the whole sequence of iterates generated by RR is convergent and converges to a single stationary point in an almost sure sense. In addition, we derive the corresponding rate of convergence, depending on the KL exponent and the suitably selected diminishing step sizes. When the KL exponent lies in $[0,\frac12]$, the convergence is at a rate of $\mathcal{O}(t^{-1})$ with $t$ counting the iteration number. When the KL exponent belongs to $(\frac12,1)$, our derived convergence rate is of the form $\mathcal{O}(t^{-q})$ with $q\in (0,1)$ depending on the KL exponent. The standard KL inequality-based convergence analysis framework only applies to algorithms with a certain descent property. We conduct a novel convergence analysis for the non-descent RR method with diminishing step sizes based on the KL inequality, which generalizes the standard KL framework. We summarize our main steps and core ideas in an informal analysis framework, which is of independent interest. As a direct application of this framework, we also establish similar strong limit-point convergence results for the reshuffled proximal point method.
Dimensionality reduction via PCA and factor analysis is an important tool of data analysis. A critical step is selecting the number of components. However, existing methods (such as the scree plot, likelihood ratio, parallel analysis, etc) do not have statistical guarantees in the increasingly common setting where the data are heterogeneous. There each noise entry can have a different distribution. To address this problem, we propose the Signflip Parallel Analysis (Signflip PA) method: it compares data singular values to those of "empirical null" data generated by flipping the sign of each entry randomly with probability one-half. We show that Signflip PA consistently selects factors above the noise level in high-dimensional signal-plus-noise models (including spiked models and factor models) under heterogeneous settings. Here classical parallel analysis is no longer effective. To do this, we rely on recent results in random matrix theory, such as dimension-free operator norm bounds [Latala et al, 2018, Inventiones Mathematicae], and large deviations for the top eigenvalues of nonhomogeneous matrices [Husson, 2020]. We also illustrate that Signflip PA performs well in numerical simulations and on empirical data examples.
The organiser of the UEFA Champions League, one of the most prestigious football tournaments in the world, faces a non-trivial mechanism design problem each autumn: how to choose a perfect matching in a balanced bipartite graph randomly. For the sake of credibility and transparency, the Round of 16 draw consists of some discrete uniform choices from two urns whose compositions are dynamically updated with computer assistance. Even though the adopted mechanism is unevenly distributed over all valid assignments, it resembles the fairest possible lottery according to a recent result. We challenge this finding by analysing the effect of reversing the order of the urns. The optimal draw procedure is shown to be primarily dependent on the lexicographic order of degree sequences for the two sets of teams. An example is provided where exchanging the urns can reduce unfairness by one-third on average and almost halve the worst bias for all pairs of teams. Nonetheless, the current policy of starting the draw with the runners-up remains the best option if the draw order should be determined before the national associations of the clubs are known.
We study the probabilistic sampling of a random variable, in which the variable is sampled only if it falls outside a given set, which is called the silence set. This helps us to understand optimal event-based sampling for the special case of IID random processes, and also to understand the design of a sub-optimal scheme for other cases. We consider the design of this probabilistic sampling for a scalar, log-concave random variable, to minimize either the mean square estimation error, or the mean absolute estimation error. We show that the optimal silence interval: (i) is essentially unique, and (ii) is the limit of an iterative procedure of centering. Further we show through numerical experiments that super-level intervals seem to be remarkably near-optimal for mean square estimation. Finally we use the Gauss inequality for scalar unimodal densities, to show that probabilistic sampling gives a mean square distortion that is less than a third of the distortion incurred by periodic sampling, if the average sampling rate is between 0.3 and 0.9 samples per tick.
Suppose that a random variable $X$ of interest is observed. This paper concerns "the least favorable noise" $\hat{Y}_{\epsilon}$, which maximizes the prediction error $E [X - E[X|X+Y]]^2 $ (or minimizes the variance of $E[X| X+Y]$) in the class of $Y$ with $Y$ independent of $X$ and $\mathrm{var} Y \leq \epsilon^2$. This problem was first studied by Ernst, Kagan, and Rogers ([3]). In the present manuscript, we show that the least favorable noise $\hat{Y}_{\epsilon}$ must exist and that its variance must be $\epsilon^2$. The proof of existence relies on a convergence result we develop for variances of conditional expectations. Further, we show that the function $\inf_{\mathrm{var} Y \leq \epsilon^2} \, \mathrm{var} \, E[X|X+Y]$ is both strictly decreasing and right continuous in $\epsilon$.
We provide a unified operational framework for the study of causality, non-locality and contextuality, in a fully device-independent and theory-independent setting. We define causaltopes, our chosen portmanteau of "causal polytopes", for arbitrary spaces of input histories and arbitrary choices of input contexts. We show that causaltopes are obtained by slicing simpler polytopes of conditional probability distributions with a set of causality equations, which we fully characterise. We provide efficient linear programs to compute the maximal component of an empirical model supported by any given sub-causaltope, as well as the associated causal fraction. We introduce a notion of causal separability relative to arbitrary causal constraints. We provide efficient linear programs to compute the maximal causally separable component of an empirical model, and hence its causally separable fraction, as the component jointly supported by certain sub-causaltopes. We study causal fractions and causal separability for several novel examples, including a selection of quantum switches with entangled or contextual control. In the process, we demonstrate the existence of "causal contextuality", a phenomenon where causal inseparability is clearly correlated to, or even directly implied by, non-locality and contextuality.
(Stochastic) bilevel optimization is a frequently encountered problem in machine learning with a wide range of applications such as meta-learning, hyper-parameter optimization, and reinforcement learning. Most of the existing studies on this problem only focused on analyzing the convergence or improving the convergence rate, while little effort has been devoted to understanding its generalization behaviors. In this paper, we conduct a thorough analysis on the generalization of first-order (gradient-based) methods for the bilevel optimization problem. We first establish a fundamental connection between algorithmic stability and generalization error in different forms and give a high probability generalization bound which improves the previous best one from $\bigO(\sqrt{n})$ to $\bigO(\log n)$, where $n$ is the sample size. We then provide the first stability bounds for the general case where both inner and outer level parameters are subject to continuous update, while existing work allows only the outer level parameter to be updated. Our analysis can be applied in various standard settings such as strongly-convex-strongly-convex (SC-SC), convex-convex (C-C), and nonconvex-nonconvex (NC-NC). Our analysis for the NC-NC setting can also be extended to a particular nonconvex-strongly-convex (NC-SC) setting that is commonly encountered in practice. Finally, we corroborate our theoretical analysis and demonstrate how iterations can affect the generalization error by experiments on meta-learning and hyper-parameter optimization.
Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is governed by a policy with the same structure. Assuming that the nonlinear component comprises kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. Although the cost function is nonconvex in general, we establish the local strong convexity and smoothness in the vicinity of the global optimizer. Additionally, we propose an initialization mechanism to leverage these properties. Building on the developments, we design a policy gradient algorithm that is guaranteed to converge to the globally optimal policy with a linear rate.
Dependence is undoubtedly a central concept in statistics. Though, it proves difficult to locate in the literature a formal definition which goes beyond the self-evident 'dependence = non-independence'. This absence has allowed the term 'dependence' and its declination to be used vaguely and indiscriminately for qualifying a variety of disparate notions, leading to numerous incongruities. For example, the classical Pearson's, Spearman's or Kendall's correlations are widely regarded as 'dependence measures' of major interest, in spite of returning 0 in some cases of deterministic relationships between the variables at play, evidently not measuring dependence at all. Arguing that research on such a fundamental topic would benefit from a slightly more rigid framework, this paper suggests a general definition of the dependence between two random variables defined on the same probability space. Natural enough for aligning with intuition, that definition is still sufficiently precise for allowing unequivocal identification of a 'universal' representation of the dependence structure of any bivariate distribution. Links between this representation and familiar concepts are highlighted, and ultimately, the idea of a dependence measure based on that universal representation is explored and shown to satisfy Renyi's postulates.
Autoregressive language models, which use deep learning to produce human-like texts, have become increasingly widespread. Such models are powering popular virtual assistants in areas like smart health, finance, and autonomous driving. While the parameters of these large language models are improving, concerns persist that these models might not work equally for all subgroups in society. Despite growing discussions of AI fairness across disciplines, there lacks systemic metrics to assess what equity means in dialogue systems and how to engage different populations in the assessment loop. Grounded in theories of deliberative democracy and science and technology studies, this paper proposes an analytical framework for unpacking the meaning of equity in human-AI dialogues. Using this framework, we conducted an auditing study to examine how GPT-3 responded to different sub-populations on crucial science and social topics: climate change and the Black Lives Matter (BLM) movement. Our corpus consists of over 20,000 rounds of dialogues between GPT-3 and 3290 individuals who vary in gender, race and ethnicity, education level, English as a first language, and opinions toward the issues. We found a substantively worse user experience with GPT-3 among the opinion and the education minority subpopulations; however, these two groups achieved the largest knowledge gain, changing attitudes toward supporting BLM and climate change efforts after the chat. We traced these user experience divides to conversational differences and found that GPT-3 used more negative expressions when it responded to the education and opinion minority groups, compared to its responses to the majority groups. We discuss the implications of our findings for a deliberative conversational AI system that centralizes diversity, equity, and inclusion.
The citation network of patents citing prior art arises from the legal obligation of patent applicants to properly disclose their invention. One way to study the relationship between current patents and their antecedents is by analyzing the similarity between the textual elements of patents. Many patent similarity indicators have shown a constant decrease since the mid-70s. Although several explanations have been proposed, more comprehensive analyses of this phenomenon have been rare. In this paper, we use a computationally efficient measure of patent similarity scores that leverages state-of-the-art Natural Language Processing tools, to investigate potential drivers of this apparent similarity decrease. This is achieved by modeling patent similarity scores by means of generalized additive models. We found that non-linear modeling specifications are able to distinguish between distinct, temporally varying drivers of the patent similarity levels that explain more variation in the data ($R^2\sim 18\%$) compared to previous methods. Moreover, the model reveals an underlying trend in similarity scores that is fundamentally different from the one presented previously.