In the random geometric graph model $\mathsf{Geo}_d(n,p)$, we identify each of our $n$ vertices with an independently and uniformly sampled vector from the $d$-dimensional unit sphere, and we connect pairs of vertices whose vectors are ``sufficiently close'', such that the marginal probability of an edge is $p$. We investigate the problem of testing for this latent geometry, or in other words, distinguishing an Erd\H{o}s-R\'enyi graph $\mathsf{G}(n, p)$ from a random geometric graph $\mathsf{Geo}_d(n, p)$. It is not too difficult to show that if $d\to \infty$ while $n$ is held fixed, the two distributions become indistinguishable; we wish to understand how fast $d$ must grow as a function of $n$ for indistinguishability to occur. When $p = \frac{\alpha}{n}$ for constant $\alpha$, we prove that if $d \ge \mathrm{polylog} n$, the total variation distance between the two distributions is close to $0$; this improves upon the best previous bound of Brennan, Bresler, and Nagaraj (2020), which required $d \gg n^{3/2}$, and further our result is nearly tight, resolving a conjecture of Bubeck, Ding, Eldan, \& R\'{a}cz (2016) up to logarithmic factors. We also obtain improved upper bounds on the statistical indistinguishability thresholds in $d$ for the full range of $p$ satisfying $\frac{1}{n}\le p\le \frac{1}{2}$, improving upon the previous bounds by polynomial factors. Our analysis uses the Belief Propagation algorithm to characterize the distributions of (subsets of) the random vectors {\em conditioned on producing a particular graph}. In this sense, our analysis is connected to the ``cavity method'' from statistical physics. To analyze this process, we rely on novel sharp estimates for the area of the intersection of a random sphere cap with an arbitrary subset of the sphere, which we prove using optimal transport maps and entropy-transport inequalities on the unit sphere.
Motivated by the serious problem that hospitals in rural areas suffer from a shortage of residents, we study the Hospitals/Residents model in which hospitals are associated with lower quotas and the objective is to satisfy them as much as possible. When preference lists are strict, the number of residents assigned to each hospital is the same in any stable matching because of the well-known rural hospitals theorem; thus there is no room for algorithmic interventions. However, when ties are introduced to preference lists, this will no longer apply because the number of residents may vary over stable matchings. In this paper, we formulate an optimization problem to find a stable matching with the maximum total satisfaction ratio for lower quotas. We first investigate how the total satisfaction ratio varies over choices of stable matchings in four natural scenarios and provide the exact values of these maximum gaps. Subsequently, we propose a strategy-proof approximation algorithm for our problem; in one scenario it solves the problem optimally, and in the other three scenarios, which are NP-hard, it yields a better approximation factor than that of a naive tie-breaking method. Finally, we show inapproximability results for the above-mentioned three NP-hard scenarios.
We study the off-policy evaluation (OPE) problem in an infinite-horizon Markov decision process with continuous states and actions. We recast the $Q$-function estimation into a special form of the nonparametric instrumental variables (NPIV) estimation problem. We first show that under one mild condition the NPIV formulation of $Q$-function estimation is well-posed in the sense of $L^2$-measure of ill-posedness with respect to the data generating distribution, bypassing a strong assumption on the discount factor $\gamma$ imposed in the recent literature for obtaining the $L^2$ convergence rates of various $Q$-function estimators. Thanks to this new well-posed property, we derive the first minimax lower bounds for the convergence rates of nonparametric estimation of $Q$-function and its derivatives in both sup-norm and $L^2$-norm, which are shown to be the same as those for the classical nonparametric regression (Stone, 1982). We then propose a sieve two-stage least squares estimator and establish its rate-optimality in both norms under some mild conditions. Our general results on the well-posedness and the minimax lower bounds are of independent interest to study not only other nonparametric estimators for $Q$-function but also efficient estimation on the value of any target policy in off-policy settings.
We study the two inference problems of detecting and recovering an isolated community of \emph{general} structure planted in a random graph. The detection problem is formalized as a hypothesis testing problem, where under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi random graph $\mathcal{G}(n,q)$ with edge density $q\in(0,1)$; under the alternative, there is an unknown structure $\Gamma_k$ on $k$ nodes, planted in $\mathcal{G}(n,q)$, such that it appears as an \emph{induced subgraph}. In case of a successful detection, we are concerned with the task of recovering the corresponding structure. For these problems, we investigate the fundamental limits from both the statistical and computational perspectives. Specifically, we derive lower bounds for detecting/recovering the structure $\Gamma_k$ in terms of the parameters $(n,k,q)$, as well as certain properties of $\Gamma_k$, and exhibit computationally unbounded optimal algorithms that achieve these lower bounds. We also consider the problem of testing in polynomial-time. As is customary in many similar structured high-dimensional problems, our model undergoes an "easy-hard-impossible" phase transition and computational constraints can severely penalize the statistical performance. To provide an evidence for this phenomenon, we show that the class of low-degree polynomials algorithms match the statistical performance of the polynomial-time algorithms we develop.
We revisit the outlier hypothesis testing framework of Li \emph{et al.} (TIT 2014) and derive fundamental limits for the optimal test. In outlier hypothesis testing, one is given multiple observed sequences, where most sequences are generated i.i.d. from a nominal distribution. The task is to discern the set of outlying sequences that are generated according to anomalous distributions. The nominal and anomalous distributions are \emph{unknown}. We consider the case of multiple outliers where the number of outliers is unknown and each outlier can follow a different anomalous distribution. Under this setting, we study the tradeoff among the probabilities of misclassification error, false alarm and false reject. Specifically, we propose a threshold-based test that ensures exponential decay of misclassification error and false alarm probabilities. We study two constraints on the false reject probability, with one constraint being that it is a non-vanishing constant and the other being that it has an exponential decay rate. For both cases, we characterize bounds on the false reject probability, as a function of the threshold, for each tuple of nominal and anomalous distributions. Finally, we demonstrate the asymptotic optimality of our test under the generalized Neyman-Pearson criterion.
Let $P$ be a polyhedron, defined by a system $A x \leq b$, where $A \in Z^{m \times n}$, $rank(A) = n$, and $b \in Z^{m}$. In the Integer Feasibility Problem, we need to decide whether $P \cap Z^n = \emptyset$ or to find some $x \in P \cap Z^n$ in the opposite case. Currently, its state of the art algorithm, due to \cite{DadushDis,DadushFDim} (see also \cite{Convic,ConvicComp,DConvic} for more general formulations), has the complexity bound $O(n)^n \cdot poly(\phi)$, where $\phi = size(A,b)$. It is a long-standing open problem to break the $O(n)^n$ dimension-dependence in the complexity of ILP algorithms. We show that if the matrix $A$ has a small $l_1$ or $l_\infty$ norm, or $A$ is sparse and has bounded elements, then the integer feasibility problem can be solved faster. More precisely, we give the following complexity bounds \begin{gather*} \min\{\|A\|_{\infty}, \|A\|_1\}^{5 n} \cdot 2^n \cdot poly(\phi), \bigl( \|A\|_{\max} \bigr)^{5 n} \cdot \min\{cs(A),rs(A)\}^{3 n} \cdot 2^n \cdot poly(\phi). \end{gather*} Here $\|A\|_{\max}$ denotes the maximal absolute value of elements of $A$, $cs(A)$ and $rs(A)$ denote the maximal number of nonzero elements in columns and rows of $A$, respectively. We present similar results for the integer linear counting and optimization problems. Additionally, we apply the last result for multipacking and multicover problems on graphs and hypergraphs, where we need to choose a minimal/maximal multiset of vertices to cover/pack the edges by a prescribed number of times. For example, we show that the stable multiset and vertex multicover problems on simple graphs admit FPT-algorithms with the complexity bound $2^{O(|V|)} \cdot poly(\phi)$, where $V$ is the vertex set of a given graph.
Consider a set $P$ of $n$ points in $\mathbb{R}^d$. In the discrete median line segment problem, the objective is to find a line segment bounded by a pair of points in $P$ such that the sum of the Euclidean distances from $P$ to the line segment is minimized. In the continuous median line segment problem, a real number $\ell>0$ is given, and the goal is to locate a line segment of length $\ell$ in $\mathbb{R}^d$ such that the sum of the Euclidean distances between $P$ and the line segment is minimized. To begin with, we show how to compute $(1+\epsilon\Delta)$- and $(1+\epsilon)$-approximations to a discrete median line segment in time $O(n\epsilon^{-2d}\log n)$ and $O(n^2\epsilon^{-d})$, respectively, where $\Delta$ is the spread of line segments spanned by pairs of points. While developing our algorithms, by using the principle of pair decomposition, we derive new data structures that allow us to quickly approximate the sum of the distances from a set of points to a given line segment or point. To our knowledge, our utilization of pair decompositions for solving minsum facility location problems is the first of its kind -- it is versatile and easily implementable. Furthermore, we prove that it is impossible to construct a continuous median line segment for $n\geq3$ non-collinear points in the plane by using only ruler and compass. In view of this, we present an $O(n^d\epsilon^{-d})$-time algorithm for approximating a continuous median line segment in $\mathbb{R}^d$ within a factor of $1+\epsilon$. The algorithm is based upon generalizing the point-segment pair decomposition from the discrete to the continuous domain. Last but not least, we give an $(1+\epsilon)$-approximation algorithm, whose time complexity is sub-quadratic in $n$, for solving the constrained median line segment problem in $\mathbb{R}^2$ where an endpoint or the slope of the median line segment is given at input.
We study edge-labelings of the complete bidirected graph $\overset{\tiny\leftrightarrow}{K}_n$ with functions from the set $[d] = \{1, \dots, d\}$ to itself. We call a cycle in $\overset{\tiny\leftrightarrow}{K}_n$ a fixed-point cycle if composing the labels of its edges results in a map that has a fixed point, and we say that a labeling is fixed-point-free if no fixed-point cycle exists. For a given $d$, we ask for the largest value of $n$, denoted $R_f(d)$, for which there exists a fixed-point-free labeling of $\overset{\tiny\leftrightarrow}{K}_n$. Determining $R_f(d)$ for all $d >0$ is a natural Ramsey-type question, generalizing some well-studied zero-sum problems in extremal combinatorics. The problem was recently introduced by Chaudhury, Garg, Mehlhorn, Mehta, and Misra, who proved that $d \leq R_f(d) \leq d^4+d$ and showed that the problem has close connections to EFX allocations, a central problem of fair allocation in social choice theory. In this paper we show the improved bound $R_f(d) \leq d^{2 + o(1)}$, yielding an efficient ${{(1-\varepsilon)}}$-EFX allocation with $n$ agents and $O(n^{0.67})$ unallocated goods for any constant $\varepsilon \in (0,1/2]$; this improves the bound of $O(n^{0.8})$ of Chaudhury, Garg, Mehlhorn, Mehta, and Misra. Additionally, we prove the stronger upper bound $2d-2$, in the case where all edge-labels are permulations. A very special case of this problem, that of finding zero-sum cycles in digraphs whose edges are labeled with elements of $\mathbb{Z}_d$, was recently considered by Alon and Krivelevich and by M\'{e}sz\'{a}ros and Steiner. Our result improves the bounds obtained by these authors and extends them to labelings from an arbitrary (not necessarily commutative) group, while also simplifying the proof.
Given a random sample of size $n$ from a $p$ dimensional random vector, where both $n$ and $p$ are large, we are interested in testing whether the $p$ components of the random vector are mutually independent. This is the so-called complete independence test. In the multivariate normal case, it is equivalent to testing whether the correlation matrix is an identity matrix. In this paper, we propose a one-sided empirical likelihood method for the complete independence test for multivariate normal data based on squared sample correlation coefficients. The limiting distribution for our one-sided empirical likelihood test statistic is proved to be $Z^2I(Z>0)$ when both $n$ and $p$ tend to infinity, where $Z$ is a standard normal random variable. In order to improve the power of the empirical likelihood test statistic, we also introduce a rescaled empirical likelihood test statistic. We carry out an extensive simulation study to compare the performance of the rescaled empirical likelihood method and two other statistics which are related to the sum of squared sample correlation coefficients.
We show that for the problem of testing if a matrix $A \in F^{n \times n}$ has rank at most $d$, or requires changing an $\epsilon$-fraction of entries to have rank at most $d$, there is a non-adaptive query algorithm making $\widetilde{O}(d^2/\epsilon)$ queries. Our algorithm works for any field $F$. This improves upon the previous $O(d^2/\epsilon^2)$ bound (SODA'03), and bypasses an $\Omega(d^2/\epsilon^2)$ lower bound of (KDD'14) which holds if the algorithm is required to read a submatrix. Our algorithm is the first such algorithm which does not read a submatrix, and instead reads a carefully selected non-adaptive pattern of entries in rows and columns of $A$. We complement our algorithm with a matching query complexity lower bound for non-adaptive testers over any field. We also give tight bounds of $\widetilde{\Theta}(d^2)$ queries in the sensing model for which query access comes in the form of $\langle X_i, A\rangle:=tr(X_i^\top A)$; perhaps surprisingly these bounds do not depend on $\epsilon$. We next develop a novel property testing framework for testing numerical properties of a real-valued matrix $A$ more generally, which includes the stable rank, Schatten-$p$ norms, and SVD entropy. Specifically, we propose a bounded entry model, where $A$ is required to have entries bounded by $1$ in absolute value. We give upper and lower bounds for a wide range of problems in this model, and discuss connections to the sensing model above.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.