Motivated by the problem of determining the atomic structure of macromolecules using single-particle cryo-electron microscopy (cryo-EM), we study the sample and computational complexities of the sparse multi-reference alignment (MRA) model: the problem of estimating a sparse signal from its noisy, circularly shifted copies. Based on its tight connection to the crystallographic phase retrieval problem, we establish that if the number of observations is proportional to the square of the variance of the noise, then the sparse MRA problem is statistically feasible for sufficiently sparse signals. To investigate its computational hardness, we consider three types of computational frameworks: projection-based algorithms, bispectrum inversion, and convex relaxations. We show that a state-of-the-art projection-based algorithm achieves the optimal estimation rate, but its computational complexity is exponential in the sparsity level. The bispectrum framework provides a statistical-computational trade-off: it requires more observations (so its estimation rate is suboptimal), but its computational load is provably polynomial in the signal's length. The convex relaxation approach provides polynomial time algorithms (with a large exponent) that recover sufficiently sparse signals at the optimal estimation rate. We conclude the paper by discussing potential statistical and algorithmic implications for cryo-EM.
We lay the foundations of a new theory for algorithms and computational complexity by parameterizing the instances of a computational problem as a moduli scheme. Considering the geometry of the scheme associated to 3-SAT, we separate P and NP.
This article is motivated by studying multisensory effects on brain activities in intracranial electroencephalography (iEEG) experiments. Differential brain activities to multisensory stimulus presentations are zero in most regions and non-zero in some local regions, yielding locally sparse functions. Such studies are essentially a function-on-scalar regression problem, with interest being focused not only on estimating nonparametric functions but also on recovering the function supports. We propose a weighted group bridge approach for simultaneous function estimation and support recovery in function-on-scalar mixed effect models, while accounting for heterogeneity present in functional data. We use B-splines to transform sparsity of functions to its sparse vector counterpart of increasing dimension, and propose a fast non-convex optimization algorithm using nested alternative direction method of multipliers (ADMM) for estimation. Large sample properties are established. In particular, we show that the estimated coefficient functions are rate optimal in the minimax sense under the $L_2$ norm and resemble a phase transition phenomenon. For support estimation, we derive a convergence rate under the $L_{\infty}$ norm that leads to a sparsistency property under $\delta$-sparsity, and provide a simple sufficient regularity condition under which a strict sparsistency property is established. An adjusted extended Bayesian information criterion is proposed for parameter tuning. The developed method is illustrated through simulation and an application to a novel iEEG dataset to study multisensory integration. We integrate the proposed method into RAVE, an R package that gains increasing popularity in the iEEG community.
The asymptotic behaviour of Linear Spectral Statistics (LSS) of the smoothed periodogram estimator of the spectral coherency matrix of a complex Gaussian high-dimensional time series $(\y_n)_{n \in \mathbb{Z}}$ with independent components is studied under the asymptotic regime where the sample size $N$ converges towards $+\infty$ while the dimension $M$ of $\y$ and the smoothing span of the estimator grow to infinity at the same rate in such a way that $\frac{M}{N} \rightarrow 0$. It is established that, at each frequency, the estimated spectral coherency matrix is close from the sample covariance matrix of an independent identically $\mathcal{N}_{\mathbb{C}}(0,\I_M)$ distributed sequence, and that its empirical eigenvalue distribution converges towards the Marcenko-Pastur distribution. This allows to conclude that each LSS has a deterministic behaviour that can be evaluated explicitly. Using concentration inequalities, it is shown that the order of magnitude of the supremum over the frequencies of the deviation of each LSS from its deterministic approximation is of the order of $\frac{1}{M} + \frac{\sqrt{M}}{N}+ (\frac{M}{N})^{3}$ where $N$ is the sample size. Numerical simulations supports our results.
The multi-commodity flow-cut gap is a fundamental parameter that affects the performance of several divide \& conquer algorithms, and has been extensively studied for various classes of undirected graphs. It has been shown by Linial, London and Rabinovich and by Aumann and Rabani that for general $n$-vertex graphs it is bounded by $O(\log n)$ and the Gupta-Newman-Rabinovich-Sinclair conjecture asserts that it is $O(1)$ for any family of graphs that excludes some fixed minor. We show that the multicommodity flow-cut gap on \emph{directed} planar graphs is $O(\log^3 n)$. This is the first \emph{sub-polynomial} bound for any family of directed graphs of super-constant treewidth. We remark that for general directed graphs, it has been shown by Chuzhoy and Khanna that the gap is $\widetilde{\Omega}(n^{1/7})$, even for directed acyclic graphs. As a direct consequence of our result, we also obtain the first polynomial-time polylogarithmic-approximation algorithms for the Directed Non-Bipartite Sparsest-Cut, and the Directed Multicut problems for directed planar graphs, which extends the long-standing result for undirectd planar graphs by Rao (with a slightly weaker bound). At the heart of our result we investigate low-distortion quasimetric embeddings into \emph{directed} $\ell_1$. More precisely, we construct $O(\log^2 n)$-Lipschitz quasipartitions for the shortest-path quasimetric spaces of planar digraphs, which generalize the notion of Lipschitz partitions from the theory of metric embeddings. This construction combines ideas from the theory of bi-Lipschitz embeddings, with tools form data structures on directed planar graphs.
We consider the problem of inferring the conditional independence graph (CIG) of a sparse, high-dimensional stationary multivariate Gaussian time series. A sparse-group lasso-based frequency-domain formulation of the problem based on frequency-domain sufficient statistic for the observed time series is presented. We investigate an alternating direction method of multipliers (ADMM) approach for optimization of the sparse-group lasso penalized log-likelihood. We provide sufficient conditions for convergence in the Frobenius norm of the inverse PSD estimators to the true value, jointly across all frequencies, where the number of frequencies are allowed to increase with sample size. This results also yields a rate of convergence. We also empirically investigate selection of the tuning parameters based on Bayesian information criterion, and illustrate our approach using numerical examples utilizing both synthetic and real data.
In this work we study the orbit recovery problem over $SO(3)$, where the goal is to recover a band-limited function on the sphere from noisy measurements of randomly rotated copies of it. This is a natural abstraction for the problem of recovering the three-dimensional structure of a molecule through cryo-electron tomography. Symmetries play an important role: Recovering the function up to rotation is equivalent to solving a system of polynomial equations that comes from the invariant ring associated with the group action. Prior work investigated this system through computational algebra tools up to a certain size. However many statistical and algorithmic questions remain: How many moments suffice for recovery, or equivalently at what degree do the invariant polynomials generate the full invariant ring? And is it possible to algorithmically solve this system of polynomial equations? We revisit these problems from the perspective of smoothed analysis whereby we perturb the coefficients of the function in the basis of spherical harmonics. Our main result is a quasi-polynomial time algorithm for orbit recovery over $SO(3)$ in this model. We analyze a popular heuristic called frequency marching that exploits the layered structure of the system of polynomial equations by setting up a system of {\em linear} equations to solve for the higher-order frequencies assuming the lower-order ones have already been found. The main questions are: Do these systems have a unique solution? And how fast can the errors compound? Our main technical contribution is in bounding the condition number of these algebraically-structured linear systems. Thus smoothed analysis provides a compelling model in which we can expand the types of group actions we can handle in orbit recovery, beyond the finite and/or abelian case.
We establish theoretical results about the low frequency contamination (i.e., long memory effects) induced by general nonstationarity for estimates such as the sample autocovariance and the periodogram, and deduce consequences for heteroskedasticity and autocorrelation robust (HAR) inference. We present explicit expressions for the asymptotic bias of these estimates. We distinguish cases where this contamination only occurs as a small-sample problem and cases where the contamination continues to hold asymptotically. We show theoretically that nonparametric smoothing over time is robust to low frequency contamination. Our results provide new insights on the debate between consistent versus inconsistent long-run variance (LRV) estimation. Existing LRV estimators tend to be in inflated when the data are nonstationary. This results in HAR tests that can be undersized and exhibit dramatic power losses. Our theory indicates that long bandwidths or fixed-b HAR tests suffer more from low frequency contamination relative to HAR tests based on HAC estimators, whereas recently introduced double kernel HAC estimators do not super from this problem. Finally, we present second-order Edgeworth expansions under nonstationarity about the distribution of HAC and DK-HAC estimators and about the corresponding t-test in the linear regression model.
Recent research in differential privacy demonstrated that (sub)sampling can amplify the level of protection. For example, for $\epsilon$-differential privacy and simple random sampling with sampling rate $r$, the actual privacy guarantee is approximately $r\epsilon$, if a value of $\epsilon$ is used to protect the output from the sample. In this paper, we study whether this amplification effect can be exploited systematically to improve the accuracy of the privatized estimate. Specifically, assuming the agency has information for the full population, we ask under which circumstances accuracy gains could be expected, if the privatized estimate would be computed on a random sample instead of the full population. We find that accuracy gains can be achieved for certain regimes. However, gains can typically only be expected, if the sensitivity of the output with respect to small changes in the database does not depend too strongly on the size of the database. We only focus on algorithms that achieve differential privacy by adding noise to the final output and illustrate the accuracy implications for two commonly used statistics: the mean and the median. We see our research as a first step towards understanding the conditions required for accuracy gains in practice and we hope that these findings will stimulate further research broadening the scope of differential privacy algorithms and outputs considered.
Influence maximization is the task of selecting a small number of seed nodes in a social network to maximize the spread of the influence from these seeds, and it has been widely investigated in the past two decades. In the canonical setting, the whole social network as well as its diffusion parameters is given as input. In this paper, we consider the more realistic sampling setting where the network is unknown and we only have a set of passively observed cascades that record the set of activated nodes at each diffusion step. We study the task of influence maximization from these cascade samples (IMS), and present constant approximation algorithms for this task under mild conditions on the seed set distribution. To achieve the optimization goal, we also provide a novel solution to the network inference problem, that is, learning diffusion parameters and the network structure from the cascade data. Comparing with prior solutions, our network inference algorithm requires weaker assumptions and does not rely on maximum-likelihood estimation and convex programming. Our IMS algorithms enhance the learning-and-then-optimization approach by allowing a constant approximation ratio even when the diffusion parameters are hard to learn, and we do not need any assumption related to the network structure or diffusion parameters.
Because of continuous advances in mathematical programing, Mix Integer Optimization has become a competitive vis-a-vis popular regularization method for selecting features in regression problems. The approach exhibits unquestionable foundational appeal and versatility, but also poses important challenges. We tackle these challenges, reducing computational burden when tuning the sparsity bound (a parameter which is critical for effectiveness) and improving performance in the presence of feature collinearity and of signals that vary in nature and strength. Importantly, we render the approach efficient and effective in applications of realistic size and complexity - without resorting to relaxations or heuristics in the optimization, or abandoning rigorous cross-validation tuning. Computational viability and improved performance in subtler scenarios is achieved with a multi-pronged blueprint, leveraging characteristics of the Mixed Integer Programming framework and by means of whitening, a data pre-processing step.