We establish higher-order nonasymptotic expansions for a difference between probability distributions of sums of i.i.d. random vectors in a Euclidean space. The derived bounds are uniform over two classes of sets: the set of all Euclidean balls and the set of all half-spaces. These results allow to account for an impact of higher-order moments or cumulants of the considered distributions; the obtained error terms depend on a sample size and a dimension explicitly. The new inequalities outperform accuracy of the normal approximation in existing Berry-Esseen inequalities under very general conditions. Under some symmetry assumptions on the probability distribution of random summands, the obtained results are optimal in terms of the ratio between the dimension and the sample size. The new technique which we developed for establishing nonasymptotic higher-order expansions can be interesting by itself. Using the new higher-order inequalities, we study accuracy of the nonparametric bootstrap approximation and propose a bootstrap score test under possible model misspecification. The results of the paper also include explicit error bounds for general elliptic confidence regions for an expected value of the random summands, and optimality of the Gaussian anti-concentration inequality over the set of all Euclidean balls.
We consider a distributionally robust stochastic optimization problem and formulate it as a stochastic two-level composition optimization problem with the use of the mean--semideviation risk measure. In this setting, we consider a single time-scale algorithm, involving two versions of the inner function value tracking: linearized tracking of a continuously differentiable loss function, and SPIDER tracking of a weakly convex loss function. We adopt the norm of the gradient of the Moreau envelope as our measure of stationarity and show that the sample complexity of $\mathcal{O}(\varepsilon^{-3})$ is possible in both cases, with only the constant larger in the second case. Finally, we demonstrate the performance of our algorithm with a robust learning example and a weakly convex, non-smooth regression example.
We study the problem of fairly allocating a set of indivisible items to a set of agents with additive valuations. Recently, Feige et al. (WINE'21) proved that a maximin share (MMS) allocation exists for all instances with $n$ agents and no more than $n + 5$ items. Moreover, they proved that an MMS allocation is not guaranteed to exist for instances with $3$ agents and at least $9$ items, or $n \ge 4$ agents and at least $3n + 3$ items. In this work, we shrink the gap between these upper and lower bounds for guaranteed existence of MMS allocations. We prove that for any integer $c > 0$, there exists a number of agents $n_c$ such that an MMS allocation exists for any instance with $n \ge n_c$ agents and at most $n + c$ items, where $n_c \le \lfloor 0.6597^c \cdot c!\rfloor$ for allocation of goods and $n_c \le \lfloor 0.7838^c \cdot c!\rfloor$ for chores. Furthermore, we show that for $n \neq 3$ agents, all instances with $n + 6$ goods have an MMS allocation.
Given a matrix $A\in \mathbb{R}^{n\times d}$ and a vector $b\in \mathbb{R}^n$, we consider the regression problem with $\ell_\infty$ guarantees: finding a vector $x'\in \mathbb{R}^d$ such that $ \|x'-x^*\|_\infty \leq \frac{\epsilon}{\sqrt{d}}\cdot \|Ax^*-b\|_2\cdot \|A^\dagger\|$ where $x^*=\arg\min_{x\in \mathbb{R}^d}\|Ax-b\|_2$. One popular approach for solving such $\ell_2$ regression problem is via sketching: picking a structured random matrix $S\in \mathbb{R}^{m\times n}$ with $m\ll n$ and $SA$ can be quickly computed, solve the ``sketched'' regression problem $\arg\min_{x\in \mathbb{R}^d} \|SAx-Sb\|_2$. In this paper, we show that in order to obtain such $\ell_\infty$ guarantee for $\ell_2$ regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with $m=\epsilon^{-2}d\log^3(n/\delta)$ such that solving the sketched regression problem gives the $\ell_\infty$ guarantee, with probability at least $1-\delta$. Moreover, the matrix $SA$ can be computed in time $O(nd\log n)$. Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [Price, Song and Woodruff, ICALP'17], in which a super-linear in $d$ rows, $m=\Omega(\epsilon^{-2}d^{1+\gamma})$ for $\gamma=\Theta(\sqrt{\frac{\log\log n}{\log d}})$ is required. We also develop a novel analytical framework for $\ell_\infty$ guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [Song and Yu, ICML'21]. Our analysis is arguably much simpler and more general than [Price, Song and Woodruff, ICALP'17], and it extends to dense sketches for tensor product of vectors.
In this paper, we study the identifiability and the estimation of the parameters of a copula-based multivariate model when the margins are unknown and are arbitrary, meaning that they can be continuous, discrete, or mixtures of continuous and discrete. When at least one margin is not continuous, the range of values determining the copula is not the entire unit square and this situation could lead to identifiability issues that are discussed here. Next, we propose estimation methods when the margins are unknown and arbitrary, using pseudo log-likelihood adapted to the case of discontinuities. In view of applications to large data sets, we also propose a pairwise composite pseudo log-likelihood. These methodologies can also be easily modified to cover the case of parametric margins. One of the main theoretical result is an extension to arbitrary distributions of known convergence results of rank-based statistics when the margins are continuous. As a by-product, under smoothness assumptions, we obtain that the asymptotic distribution of the estimation errors of our estimators are Gaussian. Finally, numerical experiments are presented to assess the finite sample performance of the estimators, and the usefulness of the proposed methodologies is illustrated with a copula-based regression model for hydrological data.
A proper $k$-coloring of a graph $G$ is a \emph{neighbor-locating $k$-coloring} if for each pair of vertices in the same color class, the sets of colors found in their neighborhoods are different. The neighbor-locating chromatic number $\chi_{NL}(G)$ is the minimum $k$ for which $G$ admits a neighbor-locating $k$-coloring. A proper $k$-coloring of a graph $G$ is a \emph{locating $k$-coloring} if for each pair of vertices $x$ and $y$ in the same color-class, there exists a color class $S_i$ such that $d(x,S_i)\neq d(y,S_i)$. The locating chromatic number $\chi_{L}(G)$ is the minimum $k$ for which $G$ admits a locating $k$-coloring. It follows that $\chi(G)\leq\chi_L(G)\leq\chi_{NL}(G)$ for any graph $G$, where $\chi(G)$ is the usual chromatic number of $G$. We show that for any three integers $p,q,r$ with $2\leq p\leq q\leq r$ (except when $2=p=q<r$), there exists a connected graph $G_{p,q,r}$ with $\chi(G_{p,q,r})=p$, $\chi_L(G_{p,q,r})=q$ and $\chi_{NL}(G_{p,q,r})=r$. We also show that the locating chromatic number (resp., neighbor-locating chromatic number) of an induced subgraph of a graph $G$ can be arbitrarily larger than that of $G$. Alcon \textit{et al.} showed that the number $n$ of vertices of $G$ is bounded above by $k(2^{k-1}-1)$, where $\chi_{NL}(G)=k$ and $G$ is connected (this bound is tight). When $G$ has maximum degree $\Delta$, they also showed that a smaller upper-bound on $n$ of order $k^{\Delta+1}$ holds. We generalize the latter by proving that if $G$ has order $n$ and at most $an+b$ edges, then $n$ is upper-bounded by a bound of the order of $k^{2a+1}+2b$. Moreover, we describe constructions of such graphs which are close to reaching the bound.
In statistics, independent, identically distributed random samples do not carry a natural ordering, and their statistics are typically invariant with respect to permutations of their order. Thus, an $n$-sample in a space $M$ can be considered as an element of the quotient space of $M^n$ modulo the permutation group. The present paper takes this definition of sample space and the related concept of orbit types as a starting point for developing a geometric perspective on statistics. We aim at deriving a general mathematical setting for studying the behavior of empirical and population means in spaces ranging from smooth Riemannian manifolds to general stratified spaces. We fully describe the orbifold and path-metric structure of the sample space when $M$ is a manifold or path-metric space, respectively. These results are non-trivial even when $M$ is Euclidean. We show that the infinite sample space exists in a Gromov-Hausdorff type sense and coincides with the Wasserstein space of probability distributions on $M$. We exhibit Fr\'echet means and $k$-means as metric projections onto 1-skeleta or $k$-skeleta in Wasserstein space, and we define a new and more general notion of polymeans. This geometric characterization via metric projections applies equally to sample and population means, and we use it to establish asymptotic properties of polymeans such as consistency and asymptotic normality.
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We leverage weight normalization as a means of constraining parameters during training using accumulator bit width bounds that we derive. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline. We then show that this reduction translates to increased design efficiency for custom FPGA-based accelerators. Finally, we show that our algorithm not only constrains weights to fit into an accumulator of user-defined bit width, but also increases the sparsity and compressibility of the resulting weights. Across all of our benchmark models trained with 8-bit weights and activations, we observe that constraining the hidden layers of quantized neural networks to fit into 16-bit accumulators yields an average 98.2% sparsity with an estimated compression rate of 46.5x all while maintaining 99.2% of the floating-point performance.
Self-distillation (SD) is the process of first training a \enquote{teacher} model and then using its predictions to train a \enquote{student} model with the \textit{same} architecture. Specifically, the student's objective function is $\big(\xi*\ell(\text{teacher's predictions}, \text{ student's predictions}) + (1-\xi)*\ell(\text{given labels}, \text{ student's predictions})\big)$, where $\ell$ is some loss function and $\xi$ is some parameter $\in [0,1]$. Empirically, SD has been observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with \textit{noisy labels}. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $\xi$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $\xi > 1$ works better than $\xi \leq 1$ even with the cross-entropy loss for several classification datasets when 50\% or 30\% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher in terms of accuracy. To our knowledge, this is the first result of its kind for the cross-entropy loss.
We consider $L^2$-approximation on weighted reproducing kernel Hilbert spaces of functions depending on infinitely many variables. We focus on unrestricted linear information, admitting evaluations of arbitrary continuous linear functionals. We distinguish between ANOVA and non-ANOVA spaces, where, by ANOVA spaces, we refer to function spaces whose norms are induced by an underlying ANOVA function decomposition. In ANOVA spaces, we prove that there is an optimal algorithm to solve the approximation problem using linear information. This way, we can determine the exact polynomial convergence rate of $n$-th minimal worst-case errors. For non-ANOVA spaces, we also establish upper and lower error bounds. Even though the bounds do not match in this case, they reveal that for weights with a moderate decay behavior, the convergence rate of $n$-th minimal errors is strictly higher in ANOVA than in non-ANOVA spaces.
Biedl et al. introduced the minimum ply cover problem in CG 2021 following the seminal work of Erlebach and van Leeuwen in SODA 2008. They showed that determining the minimum ply cover number for a given set of points by a given set of axis-parallel unit squares is NP-hard, and gave a polynomial time $2$-approximation algorithm for instances in which the minimum ply cover number is bounded by a constant. Durocher et al. recently presented a polynomial time $(8 + \epsilon)$-approximation algorithm for the general case when the minimum ply cover number is $\omega(1)$, for every fixed $\epsilon > 0$. They divide the problem into subproblems by using a standard grid decomposition technique. They have designed an involved dynamic programming scheme to solve the subproblem where each subproblem is defined by a unit side length square gridcell. Then they merge the solutions of the subproblems to obtain the final ply cover. We use a horizontal slab decomposition technique to divide the problem into subproblems. Our algorithm uses a simple greedy heuristic to obtain a $(27+\epsilon)$-approximation algorithm for the general problem, for a small constant $\epsilon>0$. Our algorithm runs considerably faster than the algorithm of Durocher et al. We also give a fast $2$-approximation algorithm for the special case where the input squares are intersected by a horizontal line. The hardness of this special case is still open. Our algorithm is potentially extendable to minimum ply covering with other geometric objects such as unit disks, identical rectangles etc.