The Chebyshev or $\ell_{\infty}$ estimator is an unconventional alternative to the ordinary least squares in solving linear regressions. It is defined as the minimizer of the $\ell_{\infty}$ objective function \begin{align*} \hat{\boldsymbol{\beta}} := \arg\min_{\boldsymbol{\beta}} \|\boldsymbol{Y} - \mathbf{X}\boldsymbol{\beta}\|_{\infty}. \end{align*} The asymptotic distribution of the Chebyshev estimator under fixed number of covariates was recently studied (Knight, 2020), yet finite sample guarantees and generalizations to high-dimensional settings remain open. In this paper, we develop non-asymptotic upper bounds on the estimation error $\|\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}^*\|_2$ for a Chebyshev estimator $\hat{\boldsymbol{\beta}}$, in a regression setting with uniformly distributed noise $\varepsilon_i\sim U([-a,a])$ where $a$ is either known or unknown. With relatively mild assumptions on the (random) design matrix $\mathbf{X}$, we can bound the error rate by $\frac{C_p}{n}$ with high probability, for some constant $C_p$ depending on the dimension $p$ and the law of the design. Furthermore, we illustrate that there exist designs for which the Chebyshev estimator is (nearly) minimax optimal. On the other hand we also argue that there exist designs for which this estimator behaves sub-optimally in terms of the constant $C_p$'s dependence on $p$. In addition we show that "Chebyshev's LASSO" has advantages over the regular LASSO in high dimensional situations, provided that the noise is uniform. Specifically, we argue that it achieves a much faster rate of estimation under certain assumptions on the growth rate of the sparsity level and the ambient dimension with respect to the sample size.
Fair distribution of indivisible tasks with non-positive valuations (aka chores) has given rise to a large body of work in recent years. A popular approximate fairness notion is envy-freeness up to one item (EF1), which requires that any pairwise envy can be eliminated by the removal of a single item. While an EF1 and Pareto optimal (PO) allocation of goods always exists and can be computed via several well-known algorithms, even the existence of such solutions for chores remains open, to date. We take an epistemic approach utilizing information asymmetry by introducing dubious chores -- items that inflict no cost on receiving agents, but are perceived costly by others. On a technical level, dubious chores provide a more fine-grained approximation of envy-freeness -- compared to relaxations such as EF1 -- which enables progress towards addressing open problems on the existence and computation of EF1 and PO. In particular, we show that finding allocations with optimal number of dubious chores is computationally hard even for highly restricted classes of valuations. Nonetheless, we prove the existence of envy-free and PO allocations for $n$ agents with only $2n-2$ dubious chores and strengthen it to $n-1$ dubious chores in four special classes of valuations. Our experimental analysis demonstrate that baseline algorithms only require a relatively small number of dubious chores to achieve envy-freeness in practice.
When an exposure of interest is confounded by unmeasured factors, an instrumental variable (IV) can be used to identify and estimate certain causal contrasts. Identification of the marginal average treatment effect (ATE) from IVs relies on strong untestable structural assumptions. When one is unwilling to assert such structure, IVs can nonetheless be used to construct bounds on the ATE. Famously, Balke and Pearl (1997) proved tight bounds on the ATE for a binary outcome, in a randomized trial with noncompliance and no covariate information. We demonstrate how these bounds remain useful in observational settings with baseline confounders of the IV, as well as randomized trials with measured baseline covariates. The resulting bounds on the ATE are non-smooth functionals, and thus standard nonparametric efficiency theory is not immediately applicable. To remedy this, we propose (1) under a novel margin condition, influence function-based estimators of the bounds that can attain parametric convergence rates when the nuisance functions are modeled flexibly, and (2) estimators of smooth approximations of these bounds. We propose extensions to continuous outcomes, explore finite sample properties in simulations, and illustrate the proposed estimators in a randomized experiment studying the effects of vaccination encouragement on flu-related hospital visits.
The question of whether $Y$ can be predicted based on $X$ often arises and while a well adjusted model may perform well on observed data, the risk of overfitting always exists, leading to poor generalization error on unseen data. This paper proposes a rigorous permutation test to assess the credibility of high $R^2$ values in regression models, which can also be applied to any measure of goodness of fit, without the need for sample splitting, by generating new pairings of $(X_i, Y_j)$ and providing an overall interpretation of the model's accuracy. It introduces a new formulation of the null hypothesis and justification for the test, which distinguishes it from previous literature. The theoretical findings are applied to both simulated data and sensor data of tennis serves in an experimental context. The simulation study underscores how the available information affects the test, showing that the less informative the predictors, the lower the probability of rejecting the null hypothesis, and emphasizing that detecting weaker dependence between variables requires a sufficient sample size.
We describe a practical algorithm for computing normal forms for semigroups and monoids with finite presentations satisfying so-called small overlap conditions. Small overlap conditions are natural conditions on the relations in a presentation, which were introduced by J. H. Remmers and subsequently studied extensively by M. Kambites. Presentations satisfying these conditions are ubiquitous; Kambites showed that a randomly chosen finite presentation satisfies the $C(4)$ condition with probability tending to 1 as the sum of the lengths of relation words tends to infinity. Kambites also showed that several key problems for finitely presented semigroups and monoids are tractable in $C(4)$ monoids: the word problem is solvable in $O(\min\{|u|, |v|\})$ time in the size of the input words $u$ and $v$; the uniform word problem for $\langle A|R\rangle$ is solvable in $O(N ^ 2 \min\{|u|, |v|\})$ where $N$ is the sum of the lengths of the words in $R$; and a normal form for any given word $u$ can be found in $O(|u|)$ time. Although Kambites' algorithm for solving the word problem in $C(4)$ monoids is highly practical, it appears that the coefficients in the linear time algorithm for computing normal forms are too large in practice. In this paper, we present an algorithm for computing normal forms in $C(4)$ monoids that has time complexity $O(|u| ^ 2)$ for input word $u$, but where the coefficients are sufficiently small to allow for practical computation. Additionally, we show that the uniform word problem for small overlap monoids can be solved in $O(N \min\{|u|, |v|\})$ time.
Score-based generative modeling (SGM) is a highly successful approach for learning a probability distribution from data and generating further samples. We prove the first polynomial convergence guarantees for the core mechanic behind SGM: drawing samples from a probability density $p$ given a score estimate (an estimate of $\nabla \ln p$) that is accurate in $L^2(p)$. Compared to previous works, we do not incur error that grows exponentially in time or that suffers from a curse of dimensionality. Our guarantee works for any smooth distribution and depends polynomially on its log-Sobolev constant. Using our guarantee, we give a theoretical analysis of score-based generative modeling, which transforms white-noise input into samples from a learned data distribution given score estimates at different noise scales. Our analysis gives theoretical grounding to the observation that an annealed procedure is required in practice to generate good samples, as our proof depends essentially on using annealing to obtain a warm start at each step. Moreover, we show that a predictor-corrector algorithm gives better convergence than using either portion alone.
We consider the design of sublinear space and query complexity algorithms for estimating the cost of a minimum spanning tree (MST) and the cost of a minimum traveling salesman (TSP) tour in a metric on $n$ points. We first consider the $o(n)$-space regime and show that, when the input is a stream of all $\binom{n}{2}$ entries of the metric, for any $\alpha \ge 2$, both MST and TSP cost can be $\alpha$-approximated using $\tilde{O}(n/\alpha)$ space, and that $\Omega(n/\alpha^2)$ space is necessary for this task. Moreover, we show that even if the streaming algorithm is allowed $p$ passes over a metric stream, it still requires $\tilde{\Omega}(\sqrt{n/\alpha p^2})$ space. We next consider the semi-streaming regime, where computing even the exact MST cost is easy and the main challenge is to estimate TSP cost to within a factor that is strictly better than $2$. We show that, if the input is a stream of all edges of the weighted graph that induces the underlying metric, for any $\varepsilon > 0$, any one-pass $(2-\varepsilon)$-approximation of TSP cost requires $\Omega(\varepsilon^2 n^2)$ space; on the other hand, there is an $\tilde{O}(n)$ space two-pass algorithm that approximates the TSP cost to within a factor of 1.96. Finally, we consider the query complexity of estimating metric TSP cost to within a factor that is strictly better than $2$, when the algorithm is given access to a matrix that specifies pairwise distances between all points. For MST estimation in this model, it is known that a $(1+\varepsilon)$-approximation is achievable with $\tilde{O}(n/\varepsilon^{O(1)})$ queries. We design an algorithm that performs $\tilde{O}(n^{1.5})$ distance queries and achieves a strictly better than $2$-approximation when either the metric is known to contain a spanning tree supported on weight-$1$ edges or the algorithm is given access to a minimum spanning tree of the graph.
In [Meurant, Pape\v{z}, Tich\'y; Numerical Algorithms 88, 2021], we presented an adaptive estimate for the energy norm of the error in the conjugate gradient (CG) method. In this paper, we extend the estimate to algorithms for solving linear approximation problems with a general, possibly rectangular matrix that are based on applying CG to a system with a positive (semi-)definite matrix build from the original matrix. We show that the resulting estimate preserves its key properties: it can be very cheaply evaluated, and it is numerically reliable in finite-precision arithmetic under some mild assumptions. We discuss algorithms based on Hestenes-Stiefel-like implementation (often called CGLS and CGNE in the literature) as well as on bidiagonalization (LSQR and CRAIG), and both unpreconditioned and preconditioned variants. The numerical experiments confirm the robustness and very satisfactory behaviour of the estimate.
This article introduces new multiplicative updates for nonnegative matrix factorization with the $\beta$-divergence and sparse regularization of one of the two factors (say, the activation matrix). It is well known that the norm of the other factor (the dictionary matrix) needs to be controlled in order to avoid an ill-posed formulation. Standard practice consists in constraining the columns of the dictionary to have unit norm, which leads to a nontrivial optimization problem. Our approach leverages a reparametrization of the original problem into the optimization of an equivalent scale-invariant objective function. From there, we derive block-descent majorization-minimization algorithms that result in simple multiplicative updates for either $\ell_{1}$-regularization or the more "aggressive" log-regularization. In contrast with other state-of-the-art methods, our algorithms are universal in the sense that they can be applied to any $\beta$-divergence (i.e., any value of $\beta$) and that they come with convergence guarantees. We report numerical comparisons with existing heuristic and Lagrangian methods using various datasets: face images, an audio spectrogram, hyperspectral data, and song play counts. We show that our methods obtain solutions of similar quality at convergence (similar objective values) but with significantly reduced CPU times.
We derive information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We show experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.