An empirical measure that results from the nearest neighbors to a given point - the nearest neighbor measure - is introduced and studied as a central statistical quantity. First, the resulting empirical process is shown to satisfy a uniform central limit theorem under a (local) bracketing entropy condition on the underlying class of functions (reflecting the localizing nature of nearest neighbor algorithm). Second a uniform non-asymptotic bound is established under a well-known condition, often refereed to as Vapnik-Chervonenkis, on the uniform entropy numbers.
Iterative hard thresholding (IHT) has gained in popularity over the past decades in large-scale optimization. However, convergence properties of this method have only been explored recently in non-convex settings. In matrix completion, existing works often focus on the guarantee of global convergence of IHT via standard assumptions such as incoherence property and uniform sampling. While such analysis provides a global upper bound on the linear convergence rate, it does not describe the actual performance of IHT in practice. In this paper, we provide a novel insight into the local convergence of a specific variant of IHT for matrix completion. We uncover the exact linear rate of IHT in a closed-form expression and identify the region of convergence in which the algorithm is guaranteed to converge. Furthermore, we utilize random matrix theory to study the linear rate of convergence of IHTSVD for large-scale matrix completion. We find that asymptotically, the rate can be expressed in closed form in terms of the relative rank and the sampling rate. Finally, we present various numerical results to verify the aforementioned theoretical analysis.
The goal of this paper is to investigate a control theoretic analysis of linear stochastic iterative algorithm and temporal difference (TD) learning. TD-learning is a linear stochastic iterative algorithm to estimate the value function of a given policy for a Markov decision process, which is one of the most popular and fundamental reinforcement learning algorithms. While there has been a series of successful works in theoretical analysis of TD-learning, it was not until recently that researchers found some guarantees on its statistical efficiency. In this paper, we propose a control theoretic finite-time analysis TD-learning, which exploits standard notions in linear system control communities. Therefore, the proposed work provides additional insights on TD-learning and reinforcement learning with simple concepts and analysis tools in control theory.
We study continuity of the roots of nonmonic polynomials as a function of their coefficients using only the most elementary results from an introductory course in real analysis and the theory of single variable polynomials. Our approach gives both qualitative and quantitative results in the case that the degree of the unperturbed polynomial can change under a perturbation of its coefficients, a case that naturally occurs, for instance, in stability theory of polynomials, singular perturbation theory, or in the perturbation theory for generalized eigenvalue problems. An application of our results in multivariate stability theory is provided which is important in, for example, the study of hyperbolic polynomials or realizability and synthesis problems in passive electrical network theory, and will be of general interest to mathematicians as well as physicists and engineers.
The paper concerns convergence and asymptotic statistics for stochastic approximation driven by Markovian noise: $$ \theta_{n+1}= \theta_n + \alpha_{n + 1} f(\theta_n, \Phi_{n+1}) \,,\quad n\ge 0, $$ in which each $\theta_n\in\Re^d$, $ \{ \Phi_n \}$ is a Markov chain on a general state space X with stationary distribution $\pi$, and $f:\Re^d\times \text{X} \to\Re^d$. In addition to standard Lipschitz bounds on $f$, and conditions on the vanishing step-size sequence $\{\alpha_n\}$, it is assumed that the associated ODE is globally asymptotically stable with stationary point denoted $\theta^*$, where $\bar f(\theta)=E[f(\theta,\Phi)]$ with $\Phi\sim\pi$. Moreover, the ODE@$\infty$ defined with respect to the vector field, $$ \bar f_\infty(\theta):= \lim_{r\to\infty} r^{-1} \bar f(r\theta) \,,\qquad \theta\in\Re^d, $$ is asymptotically stable. The main contributions are summarized as follows: (i) The sequence $\theta$ is convergent if $\Phi$ is geometrically ergodic, and subject to compatible bounds on $f$. The remaining results are established under a stronger assumption on the Markov chain: A slightly weaker version of the Donsker-Varadhan Lyapunov drift condition known as (DV3). (ii) A Lyapunov function is constructed for the joint process $\{\theta_n,\Phi_n\}$ that implies convergence of $\{ \theta_n\}$ in $L_4$. (iii) A functional CLT is established, as well as the usual one-dimensional CLT for the normalized error $z_n:= (\theta_n-\theta^*)/\sqrt{\alpha_n}$. Moment bounds combined with the CLT imply convergence of the normalized covariance, $$ \lim_{n \to \infty} E [ z_n z_n^T ] = \Sigma_\theta, $$ where $\Sigma_\theta$ is the asymptotic covariance appearing in the CLT. (iv) An example is provided where the Markov chain $\Phi$ is geometrically ergodic but it does not satisfy (DV3). While the algorithm is convergent, the second moment is unbounded.
Highly oscillatory integrals of composite type arise in electronic engineering and their calculations is a challenging problem. In this paper, we propose two Gaussian quadrature rules for computing such integrals. The first one is constructed based on the classical theory of orthogonal polynomials and its nodes and weights can be computed efficiently by using tools of numerical linear algebra. We show that the rate of convergence of this rule depends solely on the regularity of the non-oscillatory part of the integrand. The second one is constructed with respect to a sign-changing function and the classical theory of Gaussian quadrature can not be used anymore. We explore theoretical properties of this Gaussian quadrature, including the trajectories of the quadrature nodes and the convergence rate of these nodes to the endpoints of the integration interval, and prove its asymptotic error estimate under suitable hypotheses. Numerical experiments are presented to demonstrate the performance of the proposed methods.
The aim of noisy phase retrieval is to estimate a signal $\mathbf{x}_0\in \mathbb{C}^d$ from $m$ noisy intensity measurements $b_j=\left\lvert \langle \mathbf{a}_j,\mathbf{x}_0 \rangle \right\rvert^2+\eta_j, \; j=1,\ldots,m$, where $\mathbf{a}_j \in \mathbb{C}^d$ are known measurement vectors and $\eta=(\eta_1,\ldots,\eta_m)^\top \in \mathbb{R}^m$ is a noise vector. A commonly used model for estimating $\mathbf{x}_0$ is the intensity-based model $\widehat{\mathbf{x}}:=\mbox{argmin}_{\mathbf{x} \in \mathbb{C}^d} \sum_{j=1}^m \big(\left\lvert \langle \mathbf{a}_j,\mathbf{x} \rangle \right\rvert^2-b_j \big)^2$. Although one has already developed many efficient algorithms to solve the intensity-based model, there are very few results about its estimation performance. In this paper, we focus on the estimation performance of the intensity-based model and prove that the error bound satisfies $\min_{\theta\in \mathbb{R}}\|\widehat{\mathbf{x}}-e^{i\theta}\mathbf{x}_0\|_2 \lesssim \min\Big\{\frac{\sqrt{\|\eta\|_2}}{{m}^{1/4}}, \frac{\|\eta\|_2}{\| \mathbf{x}_0\|_2 \cdot \sqrt{m}}\Big\}$ under the assumption of $m \gtrsim d$ and $\mathbf{a}_j, j=1,\ldots,m,$ being Gaussian random vectors. We also show that the error bound is sharp. For the case where $\mathbf{x}_0$ is a $s$-sparse signal, we present a similar result under the assumption of $m \gtrsim s \log (ed/s)$. To the best of our knowledge, our results are the first theoretical guarantees for the intensity-based model and its sparse version. Our proofs employ Mendelson's small ball method which can deliver an effective lower bound on a nonnegative empirical process.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.
Large margin nearest neighbor (LMNN) is a metric learner which optimizes the performance of the popular $k$NN classifier. However, its resulting metric relies on pre-selected target neighbors. In this paper, we address the feasibility of LMNN's optimization constraints regarding these target points, and introduce a mathematical measure to evaluate the size of the feasible region of the optimization problem. We enhance the optimization framework of LMNN by a weighting scheme which prefers data triplets which yield a larger feasible region. This increases the chances to obtain a good metric as the solution of LMNN's problem. We evaluate the performance of the resulting feasibility-based LMNN algorithm using synthetic and real datasets. The empirical results show an improved accuracy for different types of datasets in comparison to regular LMNN.
In this paper we study the frequentist convergence rate for the Latent Dirichlet Allocation (Blei et al., 2003) topic models. We show that the maximum likelihood estimator converges to one of the finitely many equivalent parameters in Wasserstein's distance metric at a rate of $n^{-1/4}$ without assuming separability or non-degeneracy of the underlying topics and/or the existence of more than three words per document, thus generalizing the previous works of Anandkumar et al. (2012, 2014) from an information-theoretical perspective. We also show that the $n^{-1/4}$ convergence rate is optimal in the worst case.
Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalising to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by product. The results and their inferential implications are showcased on synthetic and real data.