We study the $c$-approximate near neighbor problem under the continuous Fr\'echet distance: Given a set of $n$ polygonal curves with $m$ vertices, a radius $\delta > 0$, and a parameter $k \leq m$, we want to preprocess the curves into a data structure that, given a query curve $q$ with $k$ vertices, either returns an input curve with Fr\'echet distance at most $c\cdot \delta$ to $q$, or returns that there exists no input curve with Fr\'echet distance at most $\delta$ to $q$. We focus on the case where the input and the queries are one-dimensional polygonal curves -- also called time series -- and we give a comprehensive analysis for this case. We obtain new upper bounds that provide different tradeoffs between approximation factor, preprocessing time, and query time. Our data structures improve upon the state of the art in several ways. We show that for any $0 < \varepsilon \leq 1$ an approximation factor of $(1+\varepsilon)$ can be achieved within the same asymptotic time bounds as the previously best result for $(2+\varepsilon)$. Moreover, we show that an approximation factor of $(2+\varepsilon)$ can be obtained by using preprocessing time and space $O(nm)$, which is linear in the input size, and query time in $O(\frac{1}{\varepsilon})^{k+2}$, where the previously best result used preprocessing time in $n \cdot O(\frac{m}{\varepsilon k})^k$ and query time in $O(1)^k$. We complement our upper bounds with matching conditional lower bounds based on the Orthogonal Vectors Hypothesis. Interestingly, some of our lower bounds already hold for any super-constant value of $k$. This is achieved by proving hardness of a one-sided sparse version of the Orthogonal Vectors problem as an intermediate problem, which we believe to be of independent interest.
We consider the task of lexicographic direct access to query answers. That is, we want to simulate an array containing the answers of a join query sorted in a lexicographic order chosen by the user. A recent dichotomy showed for which queries and orders this task can be done in polylogarithmic access time after quasilinear preprocessing, but this dichotomy does not tell us how much time is required in the cases classified as hard. We determine the preprocessing time needed to achieve polylogarithmic access time for all self-join free queries and all lexicographical orders. To this end, we propose a decomposition-based general algorithm for direct access on join queries. We then explore its optimality by proving lower bounds for the preprocessing time based on the hardness of a certain online Set-Disjointness problem, which shows that our algorithm's bounds are tight for all lexicographic orders on self-join free queries. Then, we prove the hardness of Set-Disjointness based on the Zero-Clique Conjecture which is an established conjecture from fine-grained complexity theory. We also show that similar techniques can be used to prove that, for enumerating answers to Loomis-Whitney joins, it is not possible to significantly improve upon trivially computing all answers at preprocessing. This, in turn, gives further evidence (based on the Zero-Clique Conjecture) to the enumeration hardness of self-join free cyclic joins with respect to linear preprocessing and constant delay.
This paper considers the temporal discretization of an inverse problem subject to a time fractional diffusion equation. Firstly, the convergence of the L1 scheme is established with an arbitrary sectorial operator of spectral angle $< \pi/2 $, that is the resolvent set of this operator contains $ \{z\in\mathbb C\setminus\{0\}:\ |\operatorname{Arg} z|< \theta\}$ for some $ \pi/2 < \theta < \pi $. The relationship between the time fractional order $\alpha \in (0, 1)$ and the constants in the error estimates is precisely characterized, revealing that the L1 scheme is robust as $ \alpha $ approaches $ 1 $. Then an inverse problem of a fractional diffusion equation is analyzed, and the convergence analysis of a temporal discretization of this inverse problem is given. Finally, numerical results are provided to confirm the theoretical results.
The Fr\'{e}chet distance is a well-studied similarity measure between curves that is widely used throughout computer science. Motivated by applications where curves stem from paths and walks on an underlying graph (such as a road network), we define and study the Fr\'{e}chet distance for paths and walks on graphs. When provided with a distance oracle of $G$ with $O(1)$ query time, the classical quadratic-time dynamic program can compute the Fr\'{e}chet distance between two walks $P$ and $Q$ in a graph $G$ in $O(|P| \cdot |Q|)$ time. We show that there are situations where the graph structure helps with computing Fr\'{e}chet distance: when the graph $G$ is planar, we apply existing (approximate) distance oracles to compute a $(1+\varepsilon)$-approximation of the Fr\'{e}chet distance between any shortest path $P$ and any walk $Q$ in $O(|G| \log |G| / \sqrt{\varepsilon} + |P| + \frac{|Q|}{\varepsilon } )$ time. We generalise this result to near-shortest paths, i.e. $\kappa$-straight paths, as we show how to compute a $(1+\varepsilon)$-approximation between a $\kappa$-straight path $P$ and any walk $Q$ in $O(|G| \log |G| / \sqrt{\varepsilon} + |P| + \frac{\kappa|Q|}{\varepsilon } )$ time. Our algorithmic results hold for both the strong and the weak discrete Fr\'{e}chet distance over the shortest path metric in $G$. Finally, we show that additional assumptions on the input, such as our assumption on path straightness, are indeed necessary to obtain truly subquadratic running time. We provide a conditional lower bound showing that the Fr\'{e}chet distance, or even its $1.01$-approximation, between arbitrary \emph{paths} in a weighted planar graph cannot be computed in $O((|P|\cdot|Q|)^{1-\delta})$ time for any $\delta > 0$ unless the Orthogonal Vector Hypothesis fails. For walks, this lower bound holds even when $G$ is planar, unit-weight and has $O(1)$ vertices.
Stochastic majorization-minimization (SMM) is an online extension of the classical principle of majorization-minimization, which consists of sampling i.i.d. data points from a fixed data distribution and minimizing a recursively defined majorizing surrogate of an objective function. In this paper, we introduce stochastic block majorization-minimization, where the surrogates can now be only block multi-convex and a single block is optimized at a time within a diminishing radius. Relaxing the standard strong convexity requirements for surrogates in SMM, our framework gives wider applicability including online CANDECOMP/PARAFAC (CP) dictionary learning and yields greater computational efficiency especially when the problem dimension is large. We provide an extensive convergence analysis on the proposed algorithm, which we derive under possibly dependent data streams, relaxing the standard i.i.d. assumption on data samples. We show that the proposed algorithm converges almost surely to the set of stationary points of a nonconvex objective under constraints at a rate $O((\log n)^{1+\eps}/n^{1/2})$ for the empirical loss function and $O((\log n)^{1+\eps}/n^{1/4})$ for the expected loss function, where $n$ denotes the number of data samples processed. Under some additional assumption, the latter convergence rate can be improved to $O((\log n)^{1+\eps}/n^{1/2})$. Our results provide first convergence rate bounds for various online matrix and tensor decomposition algorithms under a general Markovian data setting.
Given a polyline on $n$ vertices, the polyline simplification problem asks for a minimum size subsequence of these vertices defining a new polyline whose distance to the original polyline is at most a given threshold under some distance measure. In this paper, we improve the long-standing running time bound for the simplification of polylines under the local Fr\'echet distance. The best algorithm known so far is by Imai and Iri and has a cubic running time in $n$. We present an algorithm with a running time of $O(n^2)$ under the $L_1$ and $L_\infty$ norm, and $O(n^2 \log n)$ under $L_{p \in (1,\infty)}$ (including the Euclidean norm $L_2$). Our approach is based on the ideas of Chan and Chin, who showed that under the local Hausdorff distance, the Imai-Iri algorithm can be improved to run in quadratic time for $L_1$, $L_2$, and $L_\infty$. However, the Hausdorff distance does not take the order of the points along the polyline into account. The Fr\'echet distance, which is sensitive to the course of the polylines, is hence often deemed the superior distance measure for polyline similarity but it also more intricate to compute. So far, the significantly faster simplification algorithms for the Hausdorff distance made them preferable for practical application. But our new algorithm for simplification under the Fr\'echet distance matches the running time bounds for the Hausdorff distance up to logarithmic factors and thus allows the usage of this more suitable distance measure.
In this paper, we show that the diagonal of a high-dimensional sample covariance matrix stemming from $n$ independent observations of a $p$-dimensional time series with finite fourth moments can be approximated in spectral norm by the diagonal of the population covariance matrix. We assume that $n,p\to \infty$ with $p/n$ tending to a constant which might be positive or zero. As applications, we provide an approximation of the sample correlation matrix ${\mathbf R}$ and derive a variety of results for its eigenvalues. We identify the limiting spectral distribution of ${\mathbf R}$ and construct an estimator for the population correlation matrix and its eigenvalues. Finally, the almost sure limits of the extreme eigenvalues of ${\mathbf R}$ in a generalized spiked correlation model are analyzed.
Many-user MAC is an important model for understanding energy efficiency of massive random access in 5G and beyond. Introduced in Polyanskiy'2017 for the AWGN channel, subsequent works have provided improved bounds on the asymptotic minimum energy-per-bit required to achieve a target per-user error at a given user density and payload, going beyond the AWGN setting. The best known rigorous bounds use spatially coupled codes along with the optimal AMP algorithm. But these bounds are infeasible to compute beyond a few (around 10) bits of payload. In this paper, we provide new achievability bounds for the many-user AWGN and quasi-static Rayleigh fading MACs using the spatially coupled codebook design along with a scalar AMP algorithm. The obtained bounds are computable even up to 100 bits and outperform the previous ones at this payload.
We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates the number of topics K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any number of documents (n), individual document length (N_i), dictionary size (p) and number of topics (K), and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.
We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.