Challenges with data in the big-data era include (i) the dimension $p$ is often larger than the sample size $n$ (ii) outliers or contaminated points are frequently hidden and more difficult to detect. Challenge (i) renders most conventional methods inapplicable. Thus, it attracts tremendous attention from statistics, computer science, and bio-medical communities. Numerous penalized regression methods have been introduced as modern methods for analyzing high-dimensional data. Disproportionate attention has been paid to the challenge (ii) though. Penalized regression methods can do their job very well and are expected to handle the challenge (ii) simultaneously. Most of them, however, can break down by a single outlier (or single adversary contaminated point) as revealed in this article. The latter systematically examines leading penalized regression methods in the literature in terms of their robustness, provides quantitative assessment, and reveals that most of them can break down by a single outlier. Consequently, a novel robust penalized regression method based on the least sum of squares of depth trimmed residuals is proposed and studied carefully. Experiments with simulated and real data reveal that the newly proposed method can outperform some leading competitors in estimation and prediction accuracy in the cases considered.
We give a strongly explicit construction of $\varepsilon$-approximate $k$-designs for the orthogonal group $\mathrm{O}(N)$ and the unitary group $\mathrm{U}(N)$, for $N=2^n$. Our designs are of cardinality $\mathrm{poly}(N^k/\varepsilon)$ (equivalently, they have seed length $O(nk + \log(1/\varepsilon)))$; up to the polynomial, this matches the number of design elements used by the construction consisting of completely random matrices.
A code $C \subseteq \{0, 1, 2\}^n$ of length $n$ is called trifferent if for any three distinct elements of $C$ there exists a coordinate in which they all differ. By $T(n)$ we denote the maximum cardinality of trifferent codes with length. $T(5)=10$ and $T(6)=13$ were recently determined. Here we determine $T(7)=16$, $T(8)=20$, and $T(9)=27$. For the latter case $n=9$ there also exist linear codes attaining the maximum possible cardinality $27$.
We study the problem of the nonparametric estimation for the density $\pi$ of the stationary distribution of a $d$-dimensional stochastic differential equation $(X_t)_{t \in [0, T]}$. From the continuous observation of the sampling path on $[0, T]$, we study the estimation of $\pi(x)$ as $T$ goes to infinity. For $d\ge2$, we characterize the minimax rate for the $\mathbf{L}^2$-risk in pointwise estimation over a class of anisotropic H\"older functions $\pi$ with regularity $\beta = (\beta_1, ... , \beta_d)$. For $d \ge 3$, our finding is that, having ordered the smoothness such that $\beta_1 \le ... \le \beta_d$, the minimax rate depends on whether $\beta_2 < \beta_3$ or $\beta_2 = \beta_3$. In the first case, this rate is $(\frac{\log T}{T})^\gamma$, and in the second case, it is $(\frac{1}{T})^\gamma$, where $\gamma$ is an explicit exponent dependent on the dimension and $\bar{\beta}_3$, the harmonic mean of smoothness over the $d$ directions after excluding $\beta_1$ and $\beta_2$, the smallest ones. We also demonstrate that kernel-based estimators achieve the optimal minimax rate. Furthermore, we propose an adaptive procedure for both $L^2$ integrated and pointwise risk. In the two-dimensional case, we show that kernel density estimators achieve the rate $\frac{\log T}{T}$, which is optimal in the minimax sense. Finally we illustrate the validity of our theoretical findings by proposing numerical results.
Statistical depth functions order the elements of a space with respect to their centrality in a probability distribution or dataset. Since many depth functions are maximized in the real line by the median, they provide a natural approach to defining median-like location estimators for more general types of data (in our case, fuzzy data). We analyze the relationships between depth-based medians, medians based on the support function, and some notions of a median for fuzzy data in the literature. We take advantage of specific depth functions for fuzzy data defined in our former papers: adaptations of Tukey depth, simplicial depth, $L^1$-depth and projection depth.
Given a pair of non-negative random variables $X$ and $Y$, we introduce a class of nonparametric tests for the null hypothesis that $X$ dominates $Y$ in the total time on test order. Critical values are determined using bootstrap-based inference, and the tests are shown to be consistent. The same approach is used to construct tests for the excess wealth order. As a byproduct, we also obtain a class of goodness-of-fit tests for the NBUE family of distributions.
We describe a new dependent-rounding algorithmic framework for bipartite graphs. Given a fractional assignment $y$ of values to edges of graph $G = (U \cup V, E)$, the algorithms return an integral solution $Y$ such that each right-node $v \in V$ has at most one neighboring edge $f$ with $Y_f = 1$, and where the variables $Y_e$ also satisfy broad nonpositive-correlation properties. In particular, for any edges $e_1, e_2$ sharing a left-node $u \in U$, the variables $Y_{e_1}, Y_{e_2}$ have strong negative-correlation properties, i.e. the expectation of $Y_{e_1} Y_{e_2}$ is significantly below $y_{e_1} y_{e_2}$. This algorithm is based on generating negatively-correlated Exponential random variables and using them in a contention-resolution scheme inspired by an algorithm Im & Shadloo (2020). Our algorithm gives stronger and much more flexible negative correlation properties. Dependent rounding schemes with negative correlation properties have been used for approximation algorithms for job-scheduling on unrelated machines to minimize weighted completion times (Bansal, Srinivasan, & Svensson (2021), Im & Shadloo (2020), Im & Li (2023)). Using our new dependent-rounding algorithm, among other improvements, we obtain a $1.4$-approximation for this problem. This significantly improves over the prior $1.45$-approximation ratio of Im & Li (2023).
We consider the problem of finding edge-disjoint paths between given pairs of vertices in a sufficiently strong $d$-regular expander graph $G$ with $n$ vertices. In particular, we describe a deterministic, polynomial time algorithm which maintains an initially empty collection of edge-disjoint paths $\mathcal P$ in $G$ and fulfills any series of two types of requests: 1. Given two vertices $a$ and $b$ such that each appears as an endpoint in $O(d)$ paths in $\mathcal P$ and, additionally, $|\mathcal P| = O(n d / \log n)$, the algorithm finds a path of length at most $\log n$ connecting $a$ and $b$ which is edge-disjoint from all other paths in $\mathcal P$, and adds it to $\mathcal P$. 2. Remove a given path $P \in \mathcal{P}$ from $\mathcal{P}$. Importantly, each request is processed before seeing the next one. The upper bound on the length of found paths and the constraints are the best possible up to a constant factor. This establishes the first online algorithm for finding edge-disjoint paths in expanders which also allows removals, significantly strengthening a long list of previous results on the topic.
High-dimensional central limit theorems have been intensively studied with most focus being on the case where the data is sub-Gaussian or sub-exponential. However, heavier tails are omnipresent in practice. In this article, we study the critical growth rates of dimension $d$ below which Gaussian approximations are asymptotically valid but beyond which they are not. We are particularly interested in how these thresholds depend on the number of moments $m$ that the observations possess. For every $m\in(2,\infty)$, we construct i.i.d. random vectors $\textbf{X}_1,...,\textbf{X}_n$ in $\mathbb{R}^d$, the entries of which are independent and have a common distribution (independent of $n$ and $d$) with finite $m$th absolute moment, and such that the following holds: if there exists an $\varepsilon\in(0,\infty)$ such that $d/n^{m/2-1+\varepsilon}\not\to 0$, then the Gaussian approximation error (GAE) satisfies $$ \limsup_{n\to\infty}\sup_{t\in\mathbb{R}}\left[\mathbb{P}\left(\max_{1\leq j\leq d}\frac{1}{\sqrt{n}}\sum_{i=1}^n\textbf{X}_{ij}\leq t\right)-\mathbb{P}\left(\max_{1\leq j\leq d}\textbf{Z}_j\leq t\right)\right]=1,$$ where $\textbf{Z} \sim \mathsf{N}_d(\textbf{0}_d,\mathbf{I}_d)$. On the other hand, a result in Chernozhukov et al. (2023a) implies that the left-hand side above is zero if just $d/n^{m/2-1-\varepsilon}\to 0$ for some $\varepsilon\in(0,\infty)$. In this sense, there is a moment-dependent phase transition at the threshold $d=n^{m/2-1}$ above which the limiting GAE jumps from zero to one.
We investigate the randomized decision tree complexity of a specific class of read-once threshold functions. A read-once threshold formula can be defined by a rooted tree, every internal node of which is labeled by a threshold function $T_k^n$ (with output 1 only when at least $k$ out of $n$ input bits are 1) and each leaf by a distinct variable. Such a tree defines a Boolean function in a natural way. We focus on the randomized decision tree complexity of such functions, when the underlying tree is a uniform tree with all its internal nodes labeled by the same threshold function. We prove lower bounds of the form $c(k,n)^d$, where $d$ is the depth of the tree. We also treat trees with alternating levels of AND and OR gates separately and show asymptotically optimal bounds, extending the known bounds for the binary case.
The deletion distance between two binary words $u,v \in \{0,1\}^n$ is the smallest $k$ such that $u$ and $v$ share a common subsequence of length $n-k$. A set $C$ of binary words of length $n$ is called a $k$-deletion code if every pair of distinct words in $C$ has deletion distance greater than $k$. In 1965, Levenshtein initiated the study of deletion codes by showing that, for $k\ge 1$ fixed and $n$ going to infinity, a $k$-deletion code $C\subseteq \{0,1\}^n$ of maximum size satisfies $\Omega_k(2^n/n^{2k}) \leq |C| \leq O_k( 2^n/n^k)$. We make the first asymptotic improvement to these bounds by showing that there exist $k$-deletion codes with size at least $\Omega_k(2^n \log n/n^{2k})$. Our proof is inspired by Jiang and Vardy's improvement to the classical Gilbert--Varshamov bounds. We also establish several related results on the number of longest common subsequences and shortest common supersequences of a pair of words with given length and deletion distance.