$f \propto r^{-\alpha} \cdot (r+\gamma)^{-\beta}$ has been empirically shown more precise than a na\"ive power law $f\propto r^{-\alpha}$ to model the rank-frequency ($r$-$f$) relation of words in natural languages. This work shows that the only crucial parameter in the formulation is $\gamma$, which depicts the resistance to vocabulary growth on a corpus. A method of parameter estimation by searching an optimal $\gamma$ is proposed, where a ``zeroth word'' is introduced technically for the calculation. The formulation and parameters are further discussed with several case studies.
A partition $\mathcal{P}$ of a weighted graph $G$ is $(\sigma,\tau,\Delta)$-sparse if every cluster has diameter at most $\Delta$, and every ball of radius $\Delta/\sigma$ intersects at most $\tau$ clusters. Similarly, $\mathcal{P}$ is $(\sigma,\tau,\Delta)$-scattering if instead for balls we require that every shortest path of length at most $\Delta/\sigma$ intersects at most $\tau$ clusters. Given a graph $G$ that admits a $(\sigma,\tau,\Delta)$-sparse partition for all $\Delta>0$, Jia et al. [STOC05] constructed a solution for the Universal Steiner Tree problem (and also Universal TSP) with stretch $O(\tau\sigma^2\log_\tau n)$. Given a graph $G$ that admits a $(\sigma,\tau,\Delta)$-scattering partition for all $\Delta>0$, we construct a solution for the Steiner Point Removal problem with stretch $O(\tau^3\sigma^3)$. We then construct sparse and scattering partitions for various different graph families, receiving many new results for the Universal Steiner Tree and Steiner Point Removal problems.
We study the problem of robust multivariate polynomial regression: let $p\colon\mathbb{R}^n\to\mathbb{R}$ be an unknown $n$-variate polynomial of degree at most $d$ in each variable. We are given as input a set of random samples $(\mathbf{x}_i,y_i) \in [-1,1]^n \times \mathbb{R}$ that are noisy versions of $(\mathbf{x}_i,p(\mathbf{x}_i))$. More precisely, each $\mathbf{x}_i$ is sampled independently from some distribution $\chi$ on $[-1,1]^n$, and for each $i$ independently, $y_i$ is arbitrary (i.e., an outlier) with probability at most $\rho < 1/2$, and otherwise satisfies $|y_i-p(\mathbf{x}_i)|\leq\sigma$. The goal is to output a polynomial $\hat{p}$, of degree at most $d$ in each variable, within an $\ell_\infty$-distance of at most $O(\sigma)$ from $p$. Kane, Karmalkar, and Price [FOCS'17] solved this problem for $n=1$. We generalize their results to the $n$-variate setting, showing an algorithm that achieves a sample complexity of $O_n(d^n\log d)$, where the hidden constant depends on $n$, if $\chi$ is the $n$-dimensional Chebyshev distribution. The sample complexity is $O_n(d^{2n}\log d)$, if the samples are drawn from the uniform distribution instead. The approximation error is guaranteed to be at most $O(\sigma)$, and the run-time depends on $\log(1/\sigma)$. In the setting where each $\mathbf{x}_i$ and $y_i$ are known up to $N$ bits of precision, the run-time's dependence on $N$ is linear. We also show that our sample complexities are optimal in terms of $d^n$. Furthermore, we show that it is possible to have the run-time be independent of $1/\sigma$, at the cost of a higher sample complexity.
Depth-3 circuit lower bounds and $k$-SAT algorithms are intimately related; the state-of-the-art $\Sigma^k_3$-circuit lower bound and the $k$-SAT algorithm are based on the same combinatorial theorem. In this paper we define a problem which reveals new interactions between the two. Define Enum($k$, $t$) problem as: given an $n$-variable $k$-CNF and an initial assignment $\alpha$, output all satisfying assignments at Hamming distance $t$ from $\alpha$, assuming that there are no satisfying assignments of Hamming distance less than $t$ from $\alpha$. Observe that: an upper bound $b(n, k, t)$ on the complexity of Enum($k$, $t$) implies: - Depth-3 circuits: Any $\Sigma^k_3$ circuit computing the Majority function has size at least $\binom{n}{\frac{n}{2}}/b(n, k, \frac{n}{2})$. - $k$-SAT: There exists an algorithm solving $k$-SAT in time $O(\sum_{t = 1}^{n/2}b(n, k, t))$. A simple construction shows that $b(n, k, \frac{n}{2}) \ge 2^{(1 - O(\log(k)/k))n}$. Thus, matching upper bounds would imply a $\Sigma^k_3$-circuit lower bound of $2^{\Omega(\log(k)n/k)}$ and a $k$-SAT upper bound of $2^{(1 - \Omega(\log(k)/k))n}$. The former yields an unrestricted depth-3 lower bound of $2^{\omega(\sqrt{n})}$ solving a long standing open problem, and the latter breaks the Super Strong Exponential Time Hypothesis. In this paper, we propose a randomized algorithm for Enum($k$, $t$) and introduce new ideas to analyze it. We demonstrate the power of our ideas by considering the first non-trivial instance of the problem, i.e., Enum($3$, $\frac{n}{2}$). We show that the expected running time of our algorithm is $1.598^n$, substantially improving on the trivial bound of $3^{n/2} \simeq 1.732^n$. This already improves $\Sigma^3_3$ lower bounds for Majority function to $1.251^n$. The previous bound was $1.154^n$ which follows from the work of H{\aa}stad, Jukna, and Pudl\'ak (Comput. Complex.'95).
We consider the quasi-likelihood analysis for a linear regression model driven by a Student-t L\'{e}vy process with constant scale and arbitrary degrees of freedom. The model is observed at high frequency over an extending period, under which we can quantify how the sampling frequency affects estimation accuracy. In that setting, joint estimation of trend, scale, and degrees of freedom is a non-trivial problem. The bottleneck is that the Student-t distribution is not closed under convolution, making it difficult to estimate all the parameters fully based on the high-frequency time scale. To efficiently deal with the intricate nature from both theoretical and computational points of view, we propose a two-step quasi-likelihood analysis: first, we make use of the Cauchy quasi-likelihood for estimating the regression-coefficient vector and the scale parameter; then, we construct the sequence of the unit-period cumulative residuals to estimate the remaining degrees of freedom. In particular, using full data in the first step causes a problem stemming from the small-time Cauchy approximation, showing the need for data thinning.
Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a {\it terminal embedding} from one metric space $(X,d_X)$ to another $(Y,d_Y)$ with a set of designated terminals $T\subset X$. Such an embedding $f$ is said to have distortion $\rho\ge 1$ if $\rho$ is the smallest value such that there exists a constant $C>0$ satisfying \begin{equation*} \forall x\in T\ \forall q\in X,\ C d_X(x, q) \le d_Y(f(x), f(q)) \le C \rho d_X(x, q) . \end{equation*} When $X,Y$ are both Euclidean metrics with $Y$ being $m$-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion $1+\epsilon$ is achievable via such a terminal embedding with $m = O(\epsilon^{-2}\log n)$ for $n := |T|$. This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within $T$ and not to $T$ from the rest of space. The downside of prior work is that evaluating their embedding on some $q\in \mathbb{R}^d$ required solving a semidefinite program with $\Theta(n)$ constraints in~$m$ variables and thus required some superlinear $\mathrm{poly}(n)$ runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process $T$ to obtain an almost linear-space data structure that supports computing the terminal embedding image of any $q\in\mathbb{R}^d$ in sublinear time $O^* (n^{1-\Theta(\epsilon^2)} + d)$. To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.
For a $P$-indexed persistence module ${\sf M}$, the (generalized) rank of ${\sf M}$ is defined as the rank of the limit-to-colimit map for ${\sf M}$ over the poset $P$. For $2$-parameter persistence modules, recently a zigzag persistence based algorithm has been proposed that takes advantage of the fact that generalized rank for $2$-parameter modules is equal to the number of full intervals in a zigzag module defined on the boundary of the poset. Analogous definition of boundary for $d$-parameter persistence modules or general $P$-indexed persistence modules does not seem plausible. To overcome this difficulty, we first unfold a given $P$-indexed module ${\sf M}$ into a zigzag module ${\sf M}_{ZZ}$ and then check how many full interval modules in a decomposition of ${\sf M}_{ZZ}$ can be folded back to remain full in ${\sf M}$. This number determines the generalized rank of ${\sf M}$. For special cases of degree-$d$ homology for $d$-complexes, we obtain a more efficient algorithm including a linear time algorithm for degree-$1$ homology in graphs.
Given an increasing sequence of integers $x_1,\ldots,x_n$ from a universe $\{0,\ldots,u-1\}$, the monotone minimal perfect hash function (MMPHF) for this sequence is a data structure that answers the following rank queries: $rank(x) = i$ if $x = x_i$, for $i\in \{1,\ldots,n\}$, and $rank(x)$ is arbitrary otherwise. Assadi, Farach-Colton, and Kuszmaul recently presented at SODA'23 a proof of the lower bound $\Omega(n \min\{\log\log\log u, \log n\})$ for the bits of space required by MMPHF, provided $u \ge n 2^{2^{\sqrt{\log\log n}}}$, which is tight since there is a data structure for MMPHF that attains this space bound (and answers the queries in $O(\log u)$ time). In this paper, we close the remaining gap by proving that, for $u \ge (1+\epsilon)n$, where $\epsilon > 0$ is any constant, the tight lower bound is $\Omega(n \min\{\log\log\log \frac{u}{n}, \log n\})$, which is also attainable; we observe that, for all reasonable cases when $n < u < (1+\epsilon)n$, known facts imply tight bounds, which virtually settles the problem. Along the way we substantially simplify the proof of Assadi et al. replacing a part of their heavy combinatorial machinery by trivial observations. However, an important part of the proof still remains complicated. This part of our paper repeats arguments of Assadi et al. and is not novel. Nevertheless, we include it, for completeness, offering a somewhat different perspective on these arguments.
For the vertex selection problem $(\sigma,\rho)$-DomSet one is given two fixed sets $\sigma$ and $\rho$ of integers and the task is to decide whether we can select vertices of the input graph, such that, for every selected vertex, the number of selected neighbors is in $\sigma$ and, for every unselected vertex, the number of selected neighbors is in $\rho$. This framework covers Independent Set and Dominating Set for example. We investigate the case when $\sigma$ and $\rho$ are periodic sets with the same period $m\ge 2$, that is, the sets are two (potentially different) residue classes modulo $m$. We study the problem parameterized by treewidth and present an algorithm that solves in time $m^{tw} \cdot n^{O(1)}$ the decision, minimization and maximization version of the problem. This significantly improves upon the known algorithms where for the case $m \ge 3$ not even an explicit running time is known. We complement our algorithm by providing matching lower bounds which state that there is no $(m-\epsilon)^{pw} \cdot n^{O(1)}$ unless SETH fails. For $m = 2$, we extend these bound to the minimization version as the decision version is efficiently solvable.
Let $\mathsf{TH}_k$ denote the $k$-out-of-$n$ threshold function: given $n$ input Boolean variables, the output is $1$ if and only if at least $k$ of the inputs are $1$. We consider the problem of computing the $\mathsf{TH}_k$ function using noisy readings of the Boolean variables, where each reading is incorrect with some fixed and known probability $p \in (0,1/2)$. As our main result, we show that, when $k = o(n)$, it is both sufficient and necessary to use $$(1 \pm o(1)) \frac{n\log \frac{k}{\delta}}{D_{\mathsf{KL}}(p || 1-p)}$$ queries in expectation to compute the $\mathsf{TH}_k$ function with a vanishing error probability $\delta = o(1)$, where $D_{\mathsf{KL}}(p || 1-p)$ denotes the Kullback-Leibler divergence between $\mathsf{Bern}(p)$ and $\mathsf{Bern}(1-p)$ distributions. In particular, this says that $(1 \pm o(1)) \frac{n\log \frac{1}{\delta}}{D_{\mathsf{KL}}(p || 1-p)}$ queries in expectation are both sufficient and necessary to compute the $\mathsf{OR}$ and $\mathsf{AND}$ functions of $n$ Boolean variables. Compared to previous work, our result tightens the dependence on $p$ in both the upper and lower bounds.
Quantum circuit compilation comprises many computationally hard reasoning tasks that nonetheless lie inside #$\mathbf{P}$ and its decision counterpart in $\mathbf{PP}$. The classical simulation of general quantum circuits is a core example. We show for the first time that a strong simulation of universal quantum circuits can be efficiently tackled through weighted model counting by providing a linear encoding of Clifford+T circuits. To achieve this, we exploit the stabilizer formalism by Knill, Gottesmann, and Aaronson and the fact that stabilizer states form a basis for density operators. With an open-source simulator implementation, we demonstrate empirically that model counting often outperforms state-of-the-art simulation techniques based on the ZX calculus and decision diagrams. Our work paves the way to apply the existing array of powerful classical reasoning tools to realize efficient quantum circuit compilation; one of the obstacles on the road towards quantum supremacy.