A common misconception is that the Oracle eigenvalue estimator of the covariance matrix yields the best realized portfolio performance. In reality, the Oracle estimator simply modifies the empirical covariance matrix eigenvalues so as to minimize the Frobenius distance between the filtered and the realized covariance matrices. This leads to the best portfolios only when the in-sample eigenvectors coincide with the out-of-sample ones. In all the other cases, the optimal eigenvalue correction can be obtained from the solution of a Quadratic-Programming problem. Solving it shows that the Oracle estimators only yield the best portfolios in the limit of infinite data points per asset and only in stationary systems.
The idea of using polynomial methods to improve simple smoother iterations within a multigrid method for a symmetric positive definite (SPD) system is revisited. When the single-step smoother itself corresponds to an SPD operator, there is in particular a very simple iteration, a close cousin of the Chebyshev semi-iterative method, based on the Chebyshev polynomials of the fourth instead of first kind, that optimizes a two-level bound going back to Hackbusch. A full V-cycle bound for general polynomial smoothers is derived using the V-cycle theory of McCormick. The fourth-kind Chebyshev iteration is quasi-optimal for the V-cycle bound. The optimal polynomials for the V-cycle bound can be found numerically, achieving an about 18% lower error contraction factor bound than the fourth-kind Chebyshev iteration, asymptotically as the number of smoothing steps goes to infinity. Implementation of the optimized iteration is discussed, and the performance of the polynomial smoothers are illustrated with a simple numerical example.
Consider the problem of nonparametric estimation of an unknown $\beta$-H\"older smooth density $p_{XY}$ at a given point, where $X$ and $Y$ are both $d$ dimensional. An infinite sequence of i.i.d.\ samples $(X_i,Y_i)$ are generated according to this distribution, and two terminals observe $(X_i)$ and $(Y_i)$, respectively. They are allowed to exchange $k$ bits either in oneway or interactively in order for Bob to estimate the unknown density. We show that the minimax mean square risk is order $\left(\frac{k}{\log k} \right)^{-\frac{2\beta}{d+2\beta}}$ for one-way protocols and $k^{-\frac{2\beta}{d+2\beta}}$ for interactive protocols. The logarithmic improvement is nonexistent in the parametric counterparts, and therefore can be regarded as a consequence of nonparametric nature of the problem. Moreover, a few rounds of interactions achieve the interactive minimax rate: the number of rounds can grow as slowly as the super-logarithm (i.e., inverse tetration) of $k$. The proof of the upper bound is based on a novel multi-round scheme for estimating the joint distribution of a pair of biased Bernoulli variables.
We revisit convergence rates for maximum likelihood estimation (MLE) under finite mixture models. The Wasserstein distance has become a standard loss function for the analysis of parameter estimation in these models, due in part to its ability to circumvent label switching and to accurately characterize the behaviour of fitted mixture components with vanishing weights. However, the Wasserstein metric is only able to capture the worst-case convergence rate among the remaining fitted mixture components. We demonstrate that when the log-likelihood function is penalized to discourage vanishing mixing weights, stronger loss functions can be derived to resolve this shortcoming of the Wasserstein distance. These new loss functions accurately capture the heterogeneity in convergence rates of fitted mixture components, and we use them to sharpen existing pointwise and uniform convergence rates in various classes of mixture models. In particular, these results imply that a subset of the components of the penalized MLE typically converge significantly faster than could have been anticipated from past work. We further show that some of these conclusions extend to the traditional MLE. Our theoretical findings are supported by a simulation study to illustrate these improved convergence rates.
This paper is concerned with two improved variants of the Hutch++ algorithm for estimating the trace of a square matrix, implicitly given through matrix-vector products. Hutch++ combines randomized low-rank approximation in a first phase with stochastic trace estimation in a second phase. In turn, Hutch++ only requires $O\left(\varepsilon^{-1}\right)$ matrix-vector products to approximate the trace within a relative error $\varepsilon$ with high probability. This compares favorably with the $O\left(\varepsilon^{-2}\right)$ matrix-vector products needed when using stochastic trace estimation alone. In Hutch++, the number of matrix-vector products is fixed a priori and distributed in a prescribed fashion among the two phases. In this work, we derive an adaptive variant of Hutch++, which outputs an estimate of the trace that is within some prescribed error tolerance with a controllable failure probability, while splitting the matrix-vector products in a near-optimal way among the two phases. For the special case of symmetric positive semi-definite matrix, we present another variant of Hutch++, called Nystr\"om++, which utilizes the so called Nystr\"om approximation and requires only one pass over the matrix, as compared to two passes with Hutch++. We extend the analysis of Hutch++ to Nystr\"om++. Numerical experiments demonstrate the effectiveness of our two new algorithms.
Computing the top eigenvectors of a matrix is a problem of fundamental interest to various fields. While the majority of the literature has focused on analyzing the reconstruction error of low-rank matrices associated with the retrieved eigenvectors, in many applications one is interested in finding one vector with high Rayleigh quotient. In this paper we study the problem of approximating the top-eigenvector. Given a symmetric matrix $\mathbf{A}$ with largest eigenvalue $\lambda_1$, our goal is to find a vector \hu that approximates the leading eigenvector $\mathbf{u}_1$ with high accuracy, as measured by the ratio $R(\hat{\mathbf{u}})=\lambda_1^{-1}{\hat{\mathbf{u}}^T\mathbf{A}\hat{\mathbf{u}}}/{\hat{\mathbf{u}}^T\hat{\mathbf{u}}}$. We present a novel analysis of the randomized SVD algorithm of \citet{halko2011finding} and derive tight bounds in many cases of interest. Notably, this is the first work that provides non-trivial bounds of $R(\hat{\mathbf{u}})$ for randomized SVD with any number of iterations. Our theoretical analysis is complemented with a thorough experimental study that confirms the efficiency and accuracy of the method.
This paper deals with solving continuous time, state and action optimization problems in stochastic settings, using reinforcement learning algorithms, and considers the policy evaluation process. We prove that standard learning algorithms based on the discretized temporal difference are doomed to fail when the time discretization tends to zero, because of the stochastic part. We propose a variance-reduction correction of the temporal difference, leading to new learning algorithms that are stable with respect to vanishing time steps. This allows us to give theoretical guarantees of convergence of our algorithms to the solutions of continuous stochastic optimization problems.
$\newcommand{\NP}{\mathsf{NP}}\newcommand{\GapSVP}{\textrm{GapSVP}}$We give a simple proof that the (approximate, decisional) Shortest Vector Problem is $\NP$-hard under a randomized reduction. Specifically, we show that for any $p \geq 1$ and any constant $\gamma < 2^{1/p}$, the $\gamma$-approximate problem in the $\ell_p$ norm ($\gamma$-$\GapSVP_p$) is not in $\mathsf{RP}$ unless $\NP \subseteq \mathsf{RP}$. Our proof follows an approach pioneered by Ajtai (STOC 1998), and strengthened by Micciancio (FOCS 1998 and SICOMP 2000), for showing hardness of $\gamma$-$\GapSVP_p$ using locally dense lattices. We construct such lattices simply by applying "Construction A" to Reed-Solomon codes with suitable parameters, and prove their local density via an elementary argument originally used in the context of Craig lattices. As in all known $\NP$-hardness results for $\GapSVP_p$ with $p < \infty$, our reduction uses randomness. Indeed, it is a notorious open problem to prove $\NP$-hardness via a deterministic reduction. To this end, we additionally discuss potential directions and associated challenges for derandomizing our reduction. In particular, we show that a close deterministic analogue of our local density construction would improve on the state-of-the-art explicit Reed-Solomon list-decoding lower bounds of Guruswami and Rudra (STOC 2005 and IEEE Trans. Inf. Theory 2006). As a related contribution of independent interest, we also give a polynomial-time algorithm for decoding $n$-dimensional "Construction A Reed-Solomon lattices" (with different parameters than those used in our hardness proof) to a distance within an $O(\sqrt{\log n})$ factor of Minkowski's bound. This asymptotically matches the best known distance for decoding near Minkowski's bound, due to Mook and Peikert (IEEE Trans. Inf. Theory 2022), whose work we build on with a somewhat simpler construction and analysis.
We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates the number of topics K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any number of documents (n), individual document length (N_i), dictionary size (p) and number of topics (K), and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.