In this work, we study the problem of approximating the distance to subsequence-freeness in the sample-based distribution-free model. For a given subsequence (word) $w = w_1 \dots w_k$, a sequence (text) $T = t_1 \dots t_n$ is said to contain $w$ if there exist indices $1 \leq i_1 < \dots < i_k \leq n$ such that $t_{i_{j}} = w_j$ for every $1 \leq j \leq k$. Otherwise, $T$ is $w$-free. Ron and Rosin (ACM TOCT 2022) showed that the number of samples both necessary and sufficient for one-sided error testing of subsequence-freeness in the sample-based distribution-free model is $\Theta(k/\epsilon)$. Denoting by $\Delta(T,w,p)$ the distance of $T$ to $w$-freeness under a distribution $p :[n]\to [0,1]$, we are interested in obtaining an estimate $\widehat{\Delta}$, such that $|\widehat{\Delta} - \Delta(T,w,p)| \leq \delta$ with probability at least $2/3$, for a given distance parameter $\delta$. Our main result is an algorithm whose sample complexity is $\tilde{O}(k^2/\delta^2)$. We first present an algorithm that works when the underlying distribution $p$ is uniform, and then show how it can be modified to work for any (unknown) distribution $p$. We also show that a quadratic dependence on $1/\delta$ is necessary.
We propose PROPAGATE, a fast approximation framework to estimate distance-based metrics on very large graphs such as the (effective) diameter, the (effective) radius, or the average distance within a small error. The framework assigns seeds to nodes and propagates them in a BFS-like fashion, computing the neighbors set until we obtain either the whole vertex set (the diameter) or a given percentage (the effective diameter). At each iteration, we derive compressed Boolean representations of the neighborhood sets discovered so far. The PROPAGATE framework yields two algorithms: PROPAGATE-P, which propagates all the $s$ seeds in parallel, and PROPAGATE-s which propagates the seeds sequentially. For each node, the compressed representation of the PROPAGATE-P algorithm requires $s$ bits while that of PROPAGATE-S only $1$ bit. Both algorithms compute the average distance, the effective diameter, the diameter, and the connectivity rate within a small error with high probability: for any $\varepsilon>0$ and using $s=\Theta\left(\frac{\log n}{\varepsilon^2}\right)$ sample nodes, the error for the average distance is bounded by $\xi = \frac{\varepsilon \Delta}{\alpha}$, the error for the effective diameter and the diameter are bounded by $\xi = \frac{\varepsilon}{\alpha}$, and the error for the connectivity rate is bounded by $\varepsilon$ where $\Delta$ is the diameter and $\alpha$ is a measure of connectivity of the graph. The time complexity is $\mathcal{O}\left(m\Delta \frac{\log n}{\varepsilon^2}\right)$, where $m$ is the number of edges of the graph. The experimental results show that the PROPAGATE framework improves the current state of the art both in accuracy and speed. Moreover, we experimentally show that PROPAGATE-S is also very efficient for solving the All Pair Shortest Path problem in very large graphs.
Standard conformal prediction methods provide a marginal coverage guarantee, which means that for a random test point, the conformal prediction set contains the true label with a user-chosen probability. In many classification problems, we would like to obtain a stronger guarantee -- that for test points of a specific class, the prediction set contains the true label with the same user-chosen probability. Existing conformal prediction methods do not work well when there is a limited amount of labeled data per class, as is often the case in real applications where the number of classes is large. We propose a method called clustered conformal prediction, which clusters together classes that have "similar" conformal scores and then performs conformal prediction at the cluster level. Based on empirical evaluation across four image data sets with many (up to 1000) classes, we find that clustered conformal typically outperforms existing methods in terms of class-conditional coverage and set size metrics.
Learning to control unknown nonlinear dynamical systems is a fundamental problem in reinforcement learning and control theory. A commonly applied approach is to first explore the environment (exploration), learn an accurate model of it (system identification), and then compute an optimal controller with the minimum cost on this estimated system (policy optimization). While existing work has shown that it is possible to learn a uniformly good model of the system~\citep{mania2020active}, in practice, if we aim to learn a good controller with a low cost on the actual system, certain system parameters may be significantly more critical than others, and we therefore ought to focus our exploration on learning such parameters. In this work, we consider the setting of nonlinear dynamical systems and seek to formally quantify, in such settings, (a) which parameters are most relevant to learning a good controller, and (b) how we can best explore so as to minimize uncertainty in such parameters. Inspired by recent work in linear systems~\citep{wagenmaker2021task}, we show that minimizing the controller loss in nonlinear systems translates to estimating the system parameters in a particular, task-dependent metric. Motivated by this, we develop an algorithm able to efficiently explore the system to reduce uncertainty in this metric, and prove a lower bound showing that our approach learns a controller at a near-instance-optimal rate. Our algorithm relies on a general reduction from policy optimization to optimal experiment design in arbitrary systems, and may be of independent interest. We conclude with experiments demonstrating the effectiveness of our method in realistic nonlinear robotic systems.
In this paper we consider the weighted $k$-Hamming and $k$-Edit distances, that are natural generalizations of the classical Hamming and Edit distances. As main results of this paper we prove that for any $k\geq 2$ the DECIS-$k$-Hamming problem is $\mathbb{P}$-SPACE-complete and the DECIS-$k$-Edit problem is NEXPTIME-complete.
Classic learning theory suggests that proper regularization is the key to good generalization and robustness. In classification, current training schemes only target the complexity of the classifier itself, which can be misleading and ineffective. Instead, we advocate directly measuring the complexity of the decision boundary. Existing literature is limited in this area with few well-established definitions of boundary complexity. As a proof of concept, we start by analyzing ReLU neural networks, whose boundary complexity can be conveniently characterized by the number of affine pieces. With the help of tropical geometry, we develop a novel method that can explicitly count the exact number of boundary pieces, and as a by-product, the exact number of total affine pieces. Numerical experiments are conducted and distinctive properties of our boundary complexity are uncovered. First, the boundary piece count appears largely independent of other measures, e.g., total piece count, and $l_2$ norm of weights, during the training process. Second, the boundary piece count is negatively correlated with robustness, where popular robust training techniques, e.g., adversarial training or random noise injection, are found to reduce the number of boundary pieces.
We consider the task of estimating a conditional density using i.i.d. samples from a joint distribution, which is a fundamental problem with applications in both classification and uncertainty quantification for regression. For joint density estimation, minimax rates have been characterized for general density classes in terms of uniform (metric) entropy, a well-studied notion of statistical capacity. When applying these results to conditional density estimation, the use of uniform entropy -- which is infinite when the covariate space is unbounded and suffers from the curse of dimensionality -- can lead to suboptimal rates. Consequently, minimax rates for conditional density estimation cannot be characterized using these classical results. We resolve this problem for well-specified models, obtaining matching (within logarithmic factors) upper and lower bounds on the minimax Kullback--Leibler risk in terms of the empirical Hellinger entropy for the conditional density class. The use of empirical entropy allows us to appeal to concentration arguments based on local Rademacher complexity, which -- in contrast to uniform entropy -- leads to matching rates for large, potentially nonparametric classes and captures the correct dependence on the complexity of the covariate space. Our results require only that the conditional densities are bounded above, and do not require that they are bounded below or otherwise satisfy any tail conditions.
This paper develops an approximation to the (effective) $p$-resistance and applies it to multi-class clustering. Spectral methods based on the graph Laplacian and its generalization to the graph $p$-Laplacian have been a backbone of non-euclidean clustering techniques. The advantage of the $p$-Laplacian is that the parameter $p$ induces a controllable bias on cluster structure. The drawback of $p$-Laplacian eigenvector based methods is that the third and higher eigenvectors are difficult to compute. Thus, instead, we are motivated to use the $p$-resistance induced by the $p$-Laplacian for clustering. For $p$-resistance, small $p$ biases towards clusters with high internal connectivity while large $p$ biases towards clusters of small ``extent,'' that is a preference for smaller shortest-path distances between vertices in the cluster. However, the $p$-resistance is expensive to compute. We overcome this by developing an approximation to the $p$-resistance. We prove upper and lower bounds on this approximation and observe that it is exact when the graph is a tree. We also provide theoretical justification for the use of $p$-resistance for clustering. Finally, we provide experiments comparing our approximated $p$-resistance clustering to other $p$-Laplacian based methods.
We study the Electrical Impedance Tomography Bayesian inverse problem for recovering the conductivity given noisy measurements of the voltage on some boundary surface electrodes. The uncertain conductivity depends linearly on a countable number of uniformly distributed random parameters in a compact interval, with the coefficient functions in the linear expansion decaying at an algebraic rate. We analyze the surrogate Markov Chain Monte Carlo (MCMC) approach for sampling the posterior probability measure, where the multivariate sparse adaptive interpolation, with interpolating points chosen according to a lower index set, is used for approximating the forward map. The forward equation is approximated once before running the MCMC for all the realizations, using interpolation on the finite element (FE) approximation at the parametric interpolating points. When evaluation of the solution is needed for a realization, we only need to compute a polynomial, thus cutting drastically the computation time. We contribute a rigorous error estimate for the MCMC convergence. In particular, we show that there is a nested sequence of interpolating lower index sets for which we can derive an interpolation error estimate in terms of the cardinality of these sets, uniformly for all the parameter realizations. An explicit convergence rate for the MCMC sampling of the posterior expectation of the conductivity is rigorously derived, in terms of the interpolating point number, the accuracy of the FE approximation of the forward equation, and the MCMC sample number. We perform numerical experiments using an adaptive greedy approach to construct the sets of interpolation points. We show the benefits of this approach over the simple MCMC where the forward equation is repeatedly solved for all the samples and the non-adaptive surrogate MCMC with an isotropic index set treating all the random parameters equally.
We prove the three candidate Plurality is Stablest Conjecture of Khot-Kindler-Mossel-O'Donnell from 2005 for correlations $\rho$ satisfying $-1/36<\rho<1/11$: the Plurality function is the most noise stable three candidate election method with small influences, when the corrupted votes have correlation $-1/36<\rho<1/11$ with the original votes. The previous best result of this type only achieved positive correlations at most $10^{-10^{10}}$. Our result follows by solving the three set Standard Simplex Conjecture of Isaksson-Mossel from 2011 for all correlations $-1/36<\rho<1/11$. The Gaussian Double Bubble Theorem corresponds to the case $\rho\to1^{-}$, so in some sense, our result is a generalization of the Gaussian Double Bubble Theorem. Our result is also notable since it is the first result for any $\rho<0$, which is the only relevant case for computational hardness of MAX-3-CUT. In fact, assuming the Unique Games Conjecture, we show that MAX-3-CUT is NP-hard to approximate within a multiplicative factor of $.9875$, which improves on the known (unconditional) NP-hardness of approximation within a factor of $1-(1/102)$, proven in 1997. As an additional corollary, we conclude that three candidate Borda Count is stablest for all $-1/36<\rho<1/11$.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.