We study the fundamental problem of estimating the mean of a $d$-dimensional distribution with covariance $\Sigma \preccurlyeq \sigma^2 I_d$ given $n$ samples. When $d = 1$, Catoni \cite{catoni} showed an estimator with error $(1+o(1)) \cdot \sigma \sqrt{\frac{2 \log \frac{1}{\delta}}{n}}$, with probability $1 - \delta$, matching the Gaussian error rate. For $d>1$, a natural estimator outputs the center of the minimum enclosing ball of one-dimensional confidence intervals to achieve a $1-\delta$ confidence radius of $\sqrt{\frac{2 d}{d+1}} \cdot \sigma \left(\sqrt{\frac{d}{n}} + \sqrt{\frac{2 \log \frac{1}{\delta}}{n}}\right)$, incurring a $\sqrt{\frac{2d}{d+1}}$-factor loss over the Gaussian rate. When the $\sqrt{\frac{d}{n}}$ term dominates by a $\sqrt{\log \frac{1}{\delta}}$ factor, \cite{lee2022optimal-highdim} showed an improved estimator matching the Gaussian rate. This raises a natural question: is the Gaussian rate achievable in general? Or is the $\sqrt{\frac{2 d}{d+1}}$ loss \emph{necessary} when the $\sqrt{\frac{2 \log \frac{1}{\delta}}{n}}$ term dominates? We show that the answer to both these questions is \emph{no} -- we show that \emph{some} constant-factor loss over the Gaussian rate is necessary, but construct an estimator that improves over the above naive estimator by a constant factor. We also consider robust estimation, where an adversary is allowed to corrupt an $\epsilon$-fraction of samples arbitrarily: in this case, we show that the above strategy of combining one-dimensional estimates and incurring the $\sqrt{\frac{2d}{d+1}}$-factor \emph{is} optimal in the infinite-sample limit.
Given an $n$-vertex $m$-edge digraph $G = (V,E)$ and a set $S \subseteq V$, $|S| = n^{\sigma}$ (for some $0 < \sigma \le 1$) of designated sources, the $S \times V$-direachability problem is to compute for every $s \in S$, the set of all the vertices reachable from $s$ in $G$. Known naive algorithms for this problem either run a BFS/DFS separately from every source, and as a result require $O(m \cdot n^{\sigma})$ time, or compute the transitive closure of $G$ in $\tilde O(n^{\omega})$ time, where $\omega < 2.371552\ldots$ is the matrix multiplication exponent. Hence, the current state-of-the-art bound for the problem on graphs with $m = \Theta(n^{\mu})$ edges in $\tilde O(n^{\min \{\mu + \sigma, \omega \}})$. Our first contribution is an algorithm with running time $\tilde O(n^{1 + \tiny{\frac{2}{3}} \omega(\sigma)})$ for this problem, where $\omega(\sigma)$ is the rectangular matrix multiplication exponent. Using current state-of-the-art estimates on $\omega(\sigma)$, our exponent is better than $\min \{2 + \sigma, \omega \}$ for $\tilde \sigma \le \sigma \le 0.53$, where $1/3 < \tilde \sigma < 0.3336$ is a universal constant. Our second contribution is a sequence of algorithms $\mathcal A_0, \mathcal A_1, \mathcal A_2, \ldots$ for the $S \times V$-direachability problem. We argue that under a certain assumption that we introduce, for every $\tilde \sigma \le \sigma < 1$, there exists a sufficiently large index $k = k(\sigma)$ so that $\mathcal A_k$ improves upon the current state-of-the-art bounds for $S \times V$-direachability with $|S| = n^{\sigma}$, in the densest regime $\mu =2$. We show that to prove this assumption, it is sufficient to devise an algorithm that computes a rectangular max-min matrix product roughly as efficiently as ordinary $(+, \cdot)$ matrix product. Our algorithms heavily exploit recent constructions of directed shortcuts by Kogan and Parter.
We show new lower bounds in the \emph{Merlin-Arthur} (MA) communication model and the related \emph{annotated streaming} or stream verification model. The MA communication model is an enhancement of the classical communication model, where in addition to the usual players Alice and Bob, there is an all-powerful but untrusted player Merlin who knows their inputs and tries to convince them about the output. Most functions have MA protocols with total communication significantly smaller than what would be needed without Merlin. We focus on the online MA (OMA) model, which is the MA analogue of one-way communication, and introduce the notion of \emph{non-trivial-OMA} complexity of a function. This is the minimum total communication needed by any non-trivial OMA protocol computing that function, where a trivial OMA protocol is one where Alice sends Bob roughly as many bits as she would have sent without Merlin. We prove a lower bound on the non-trivial-OMA complexity of a natural function \emph{Equals-Index} (basically the well-known Index problem on large domains) and identify it as a canonical problem for proving strong lower bounds on this complexity: reductions from it (i) reproduce and/or improve upon the lower bounds for all functions that were previously known to have large non-trivial-OMA complexity, (ii) exhibit the first explicit functions whose non-trivial-OMA complexity is superlinear, and even exponential, in their classical one-way complexity, and (iii) show functions on input size $n$ for which this complexity is as large as $n/\log n$. While exhibiting a function with $\omega(\sqrt{n})$ (standard) OMA complexity is a longstanding open problem, we did not even know of any function with $\omega(\sqrt{n})$ non-trivial-OMA complexity. We further extend the lower bounds to a related streaming model called annotated streaming.
This work concerns elementwise-transformations of spiked matrices: $Y_n = n^{-1/2} f( \sqrt{n} X_n + Z_n)$. Here, $f$ is a function applied elementwise, $X_n$ is a low-rank signal matrix, and $Z_n$ is white noise. We find that principal component analysis is powerful for recovering signal under highly nonlinear or discontinuous transformations. Specifically, in the high-dimensional setting where $Y_n$ is of size $n \times p$ with $n,p \rightarrow \infty$ and $p/n \rightarrow \gamma > 0$, we uncover a phase transition: for signal-to-noise ratios above a sharp threshold -- depending on $f$, the distribution of elements of $Z_n$, and the limiting aspect ratio $\gamma$ -- the principal components of $Y_n$ (partially) recover those of $X_n$. Below this threshold, the principal components of $Y_n$ are asymptotically orthogonal to the signal. In contrast, in the standard setting where $X_n + n^{-1/2}Z_n$ is observed directly, the analogous phase transition depends only on $\gamma$. A similar phenomenon occurs with $X_n$ square and symmetric and $Z_n$ a generalized Wigner matrix.
In this paper, we study the problem of estimating the normalizing constant $\int e^{-\lambda f(x)}dx$ through queries to the black-box function $f$, where $f$ belongs to a reproducing kernel Hilbert space (RKHS), and $\lambda$ is a problem parameter. We show that to estimate the normalizing constant within a small relative error, the level of difficulty depends on the value of $\lambda$: When $\lambda$ approaches zero, the problem is similar to Bayesian quadrature (BQ), while when $\lambda$ approaches infinity, the problem is similar to Bayesian optimization (BO). More generally, the problem varies between BQ and BO. We find that this pattern holds true even when the function evaluations are noisy, bringing new aspects to this topic. Our findings are supported by both algorithm-independent lower bounds and algorithmic upper bounds, as well as simulation studies conducted on a variety of benchmark functions.
We present Classy Ensemble, a novel ensemble-generation algorithm for classification tasks, which aggregates models through a weighted combination of per-class accuracy. Tested over 153 machine learning datasets we demonstrate that Classy Ensemble outperforms two other well-known aggregation algorithms -- order-based pruning and clustering-based pruning -- as well as the recently introduced lexigarden ensemble generator. We then present three enhancements: 1) Classy Cluster Ensemble, which combines Classy Ensemble and cluster-based pruning; 2) Deep Learning experiments, showing the merits of Classy Ensemble over four image datasets: Fashion MNIST, CIFAR10, CIFAR100, and ImageNet; and 3) Classy Evolutionary Ensemble, wherein an evolutionary algorithm is used to select the set of models which Classy Ensemble picks from. This latter, combining learning and evolution, resulted in improved performance on the hardest dataset.
In 1996, Karger [Kar96] gave a startling randomized algorithm that finds a minimum-cut in a (weighted) graph in time $O(m\log^3n)$ which he termed near-linear time meaning linear (in the size of the input) times a polylogarthmic factor. In this paper, we give the first deterministic algorithm which runs in near-linear time for weighted graphs. Previously, the breakthrough results of Kawarabayashi and Thorup [KT19] gave a near-linear time algorithm for simple graphs. The main technique here is a clustering procedure that perfectly preserves minimum cuts. Recently, Li [Li21] gave an $m^{1+o(1)}$ deterministic minimum-cut algorithm for weighted graphs; this form of running time has been termed "almost-linear''. Li uses almost-linear time deterministic expander decompositions which do not perfectly preserve minimum cuts, but he can use these clusterings to, in a sense, "derandomize'' the methods of Karger. In terms of techniques, we provide a structural theorem that says there exists a sparse clustering that preserves minimum cuts in a weighted graph with $o(1)$ error. In addition, we construct it deterministically in near linear time. This was done exactly for simple graphs in [KT19, HRW20] and with polylogarithmic error for weighted graphs in [Li21]. Extending the techniques in [KT19, HRW20] to weighted graphs presents significant challenges, and moreover, the algorithm can only polylogarithmically approximately preserve minimum cuts. A remaining challenge is to reduce the polylogarithmic-approximate clusterings to $1+o(1/\log n)$-approximate so that they can be applied recursively as in [Li21] over $O(\log n)$ many levels. This is an additional challenge that requires building on properties of tree-packings in the presence of a wide range of edge weights to, for example, find sources for local flow computations which identify minimum cuts that cross clusters.
We introduce a novel method that combines differential geometry, kernels smoothing, and spectral analysis to quantify facial muscle activity from widely accessible video recordings, such as those captured on personal smartphones. Our approach emphasizes practicality and accessibility. It has significant potential for applications in national security and plastic surgery. Additionally, it offers remote diagnosis and monitoring for medical conditions such as stroke, Bell's palsy, and acoustic neuroma. Moreover, it is adept at detecting and classifying emotions, from the overt to the subtle. The proposed face muscle analysis technique is an explainable alternative to deep learning methods and a non-invasive substitute to facial electromyography (fEMG).
We consider the performance of a least-squares regression model, as judged by out-of-sample $R^2$. Shapley values give a fair attribution of the performance of a model to its input features, taking into account interdependencies between features. Evaluating the Shapley values exactly requires solving a number of regression problems that is exponential in the number of features, so a Monte Carlo-type approximation is typically used. We focus on the special case of least-squares regression models, where several tricks can be used to compute and evaluate regression models efficiently. These tricks give a substantial speed up, allowing many more Monte Carlo samples to be evaluated, achieving better accuracy. We refer to our method as least-squares Shapley performance attribution (LS-SPA), and describe our open-source implementation.
Standard multidimensional scaling takes as input a dissimilarity matrix of general term $\delta _{ij}$ which is a numerical value. In this paper we input $\delta _{ij}=[\underline{\delta _{ij}},\overline{\delta _{ij}}]$ where $\underline{\delta _{ij}}$ and $\overline{\delta _{ij}}$ are the lower bound and the upper bound of the ``dissimilarity'' between the stimulus/object $S_i$ and the stimulus/object $S_j$ respectively. As output instead of representing each stimulus/object on a factorial plane by a point, as in other multidimensional scaling methods, in the proposed method each stimulus/object is visualized by a rectangle, in order to represent dissimilarity variation. We generalize the classical scaling method looking for a method that produces results similar to those obtained by Tops Principal Components Analysis. Two examples are presented to illustrate the effectiveness of the proposed method.
Meta-reinforcement learning algorithms can enable robots to acquire new skills much more quickly, by leveraging prior experience to learn how to learn. However, much of the current research on meta-reinforcement learning focuses on task distributions that are very narrow. For example, a commonly used meta-reinforcement learning benchmark uses different running velocities for a simulated robot as different tasks. When policies are meta-trained on such narrow task distributions, they cannot possibly generalize to more quickly acquire entirely new tasks. Therefore, if the aim of these methods is to enable faster acquisition of entirely new behaviors, we must evaluate them on task distributions that are sufficiently broad to enable generalization to new behaviors. In this paper, we propose an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic manipulation tasks. Our aim is to make it possible to develop algorithms that generalize to accelerate the acquisition of entirely new, held-out tasks. We evaluate 6 state-of-the-art meta-reinforcement learning and multi-task learning algorithms on these tasks. Surprisingly, while each task and its variations (e.g., with different object positions) can be learned with reasonable success, these algorithms struggle to learn with multiple tasks at the same time, even with as few as ten distinct training tasks. Our analysis and open-source environments pave the way for future research in multi-task learning and meta-learning that can enable meaningful generalization, thereby unlocking the full potential of these methods.