We present an adaptive algorithm with one-sided error for the problem of junta testing for Boolean function under the challenging distribution-free setting, the query complexity of which is $\tilde O(k)/\epsilon$. This improves the upper bound of $\tilde O(k^2)/\epsilon$ by \cite{liu2019distribution}. From the $\Omega(k\log k)$ lower bound for junta testing under the uniform distribution by \cite{sauglam2018near}, our algorithm is nearly optimal. In the standard uniform distribution, the optimal junta testing algorithm is mainly designed by bridging between relevant variables and relevant blocks. At the heart of the analysis is the Efron-Stein orthogonal decomposition. However, it is not clear how to generalize this tool to the general setting. Surprisingly, we find that junta could be tested in a very simple and efficient way even in the distribution-free setting. It is interesting that the analysis does not rely on Fourier tools directly which are commonly used in junta testing. Further, we present a simpler algorithm with the same query complexity.
We study the MARINA method of Gorbunov et al (2021) -- the current state-of-the-art distributed non-convex optimization method in terms of theoretical communication complexity. Theoretical superiority of this method can be largely attributed to two sources: the use of a carefully engineered biased stochastic gradient estimator, which leads to a reduction in the number of communication rounds, and the reliance on {\em independent} stochastic communication compression operators, which leads to a reduction in the number of transmitted bits within each communication round. In this paper we i) extend the theory of MARINA to support a much wider class of potentially {\em correlated} compressors, extending the reach of the method beyond the classical independent compressors setting, ii) show that a new quantity, for which we coin the name {\em Hessian variance}, allows us to significantly refine the original analysis of MARINA without any additional assumptions, and iii) identify a special class of correlated compressors based on the idea of {\em random permutations}, for which we coin the term Perm$K$, the use of which leads to $O(\sqrt{n})$ (resp. $O(1 + d/\sqrt{n})$) improvement in the theoretical communication complexity of MARINA in the low Hessian variance regime when $d\geq n$ (resp. $d \leq n$), where $n$ is the number of workers and $d$ is the number of parameters describing the model we are learning. We corroborate our theoretical results with carefully engineered synthetic experiments with minimizing the average of nonconvex quadratics, and on autoencoder training with the MNIST dataset.
A polynomial Turing kernel for some parameterized problem $P$ is a polynomial-time algorithm that solves $P$ using queries to an oracle of $P$ whose sizes are upper-bounded by some polynomial in the parameter. Here the term "polynomial" refers to the bound on the query sizes, as the running time of any kernel is required to be polynomial. One of the most important open goals in parameterized complexity is to understand the applicability and limitations of polynomial Turing Kernels. As any fixed-parameter tractable problem admits a Turing kernel of some size, the focus has mostly being on determining which problems admit such kernels whose query sizes can be indeed bounded by some polynomial. In this paper we take a different approach, and instead focus on the number of queries that a Turing kernel uses, assuming it is restricted to using only polynomial sized queries. Our study focuses on one the main problems studied in parameterized complexity, the Clique problem: Given a graph $G$ and an integer $k$, determine whether there are $k$ pairwise adjacent vertices in $G$. We show that Clique parameterized by several structural parameters exhibits the following phenomena: - It admits polynomial Turing kernels which use a sublinear number of queries, namely $O(n/\log^c n)$ queries where $n$ is the total size of the graph and $c$ is any constant. This holds even for a very restrictive type of Turing kernels which we call OR-kernels. - It does not admit polynomial Turing kernels which use $O(n^{1-\epsilon})$ queries, unless NP$\subseteq$coNP/poly. For proving the second item above, we develop a new framework for bounding the number of queries needed by polynomial Turing kernels. This framework is inspired by the standard lower bounds framework for Karp kernels, and while it is quite similar, it still requires some novel ideas to allow its extension to the Turing setting.
Distributed frameworks are widely used to handle massive data, where sample size $n$ is very large, and data are often stored in $k$ different machines. For a random vector $X\in \mathbb{R}^p$ with expectation $\mu$, testing the mean vector $H_0: \mu=\mu_0$ vs $H_1: \mu\ne \mu_0$ for a given vector $\mu_0$ is a basic problem in statistics. The centralized test statistics require heavy communication costs, which can be a burden when $p$ or $k$ is large. To reduce the communication cost, distributed test statistics are proposed in this paper for this problem based on the divide and conquer technique, a commonly used approach for distributed statistical inference. Specifically, we extend two commonly used centralized test statistics to the distributed ones, that apply to low and high dimensional cases, respectively. Comparing the power of centralized test statistics and the distributed ones, it is observed that there is a fundamental tradeoff between communication costs and the powers of the tests. This is quite different from the application of the divide and conquer technique in many other problems such as estimation, where the associated distributed statistics can be as good as the centralized ones. Numerical results confirm the theoretical findings.
We consider the minimax query complexity of online planning with a generative model in fixed-horizon Markov decision processes (MDPs) with linear function approximation. Following recent works, we consider broad classes of problems where either (i) the optimal value function $v^\star$ or (ii) the optimal action-value function $q^\star$ lie in the linear span of some features; or (iii) both $v^\star$ and $q^\star$ lie in the linear span when restricted to the states reachable from the starting state. Recently, Weisz et al. (2021b) showed that under (ii) the minimax query complexity of any planning algorithm is at least exponential in the horizon $H$ or in the feature dimension $d$ when the size $A$ of the action set can be chosen to be exponential in $\min(d,H)$. On the other hand, for the setting (i), Weisz et al. (2021a) introduced TensorPlan, a planner whose query cost is polynomial in all relevant quantities when the number of actions is fixed. Among other things, these two works left open the question whether polynomial query complexity is possible when $A$ is subexponential in $min(d,H)$. In this paper we answer this question in the negative: we show that an exponentially large lower bound holds when $A=\Omega(\min(d^{1/4},H^{1/2}))$, under either (i), (ii) or (iii). In particular, this implies a perhaps surprising exponential separation of query complexity compared to the work of Du et al. (2021) who prove a polynomial upper bound when (iii) holds for all states. Furthermore, we show that the upper bound of TensorPlan can be extended to hold under (iii) and, for MDPs with deterministic transitions and stochastic rewards, also under (ii).
We present a probabilistic algorithm to test if a homogeneous polynomial ideal $I$ defining a scheme $X$ in $\mathbb{P}^n$ is radical using Segre classes and other geometric notions from intersection theory. Its worst case complexity depends on the geometry of $X$. If the scheme $X$ has reduced isolated primary components and no embedded components supported the singular locus of $X_{\rm red}=V(\sqrt{I})$, then the worst case complexity is doubly exponential in $n$; in all the other cases the complexity is singly exponential. The realm of the ideals for which our radical testing procedure requires only single exponential time includes examples which are often considered pathological, such as the ones drawn from the famous Mayr-Meyer set of ideals which exhibit doubly exponential complexity for the ideal membership problem.
We characterize the first-order sensitivity of approximately recovering a low-rank matrix from linear measurements, a standard problem in compressed sensing. A special case covered by our analysis is approximating an incomplete matrix by a low-rank matrix. We give an algorithm for computing the associated condition number and demonstrate experimentally how the number of linear measurements affects it. In addition, we study the condition number of the rank-r matrix approximation problem. It measures in the Frobenius norm by how much an infinitesimal perturbation to an arbitrary input matrix is amplified in the movement of its best rank-r approximation. We give an explicit formula for the condition number, which shows that it does depend on the relative singular value gap between the rth and (r+1)th singular values of the input matrix.
We study efficient PAC learning of homogeneous halfspaces in $\mathbb{R}^d$ in the presence of malicious noise of Valiant (1985). This is a challenging noise model and only until recently has near-optimal noise tolerance bound been established under the mild condition that the unlabeled data distribution is isotropic log-concave. However, it remains unsettled how to obtain the optimal sample complexity simultaneously. In this work, we present a new analysis for the algorithm of Awasthi et al. (2017) and show that it essentially achieves the near-optimal sample complexity bound of $\tilde{O}(d)$, improving the best known result of $\tilde{O}(d^2)$. Our main ingredient is a novel incorporation of a matrix Chernoff-type inequality to bound the spectrum of an empirical covariance matrix for well-behaved distributions, in conjunction with a careful exploration of the localization schemes of Awasthi et al. (2017). We further extend the algorithm and analysis to the more general and stronger nasty noise model of Bshouty et al. (2002), showing that it is still possible to achieve near-optimal noise tolerance and sample complexity in polynomial time.
We show that for the problem of testing if a matrix $A \in F^{n \times n}$ has rank at most $d$, or requires changing an $\epsilon$-fraction of entries to have rank at most $d$, there is a non-adaptive query algorithm making $\widetilde{O}(d^2/\epsilon)$ queries. Our algorithm works for any field $F$. This improves upon the previous $O(d^2/\epsilon^2)$ bound (SODA'03), and bypasses an $\Omega(d^2/\epsilon^2)$ lower bound of (KDD'14) which holds if the algorithm is required to read a submatrix. Our algorithm is the first such algorithm which does not read a submatrix, and instead reads a carefully selected non-adaptive pattern of entries in rows and columns of $A$. We complement our algorithm with a matching query complexity lower bound for non-adaptive testers over any field. We also give tight bounds of $\widetilde{\Theta}(d^2)$ queries in the sensing model for which query access comes in the form of $\langle X_i, A\rangle:=tr(X_i^\top A)$; perhaps surprisingly these bounds do not depend on $\epsilon$. We next develop a novel property testing framework for testing numerical properties of a real-valued matrix $A$ more generally, which includes the stable rank, Schatten-$p$ norms, and SVD entropy. Specifically, we propose a bounded entry model, where $A$ is required to have entries bounded by $1$ in absolute value. We give upper and lower bounds for a wide range of problems in this model, and discuss connections to the sensing model above.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
Methods that align distributions by minimizing an adversarial distance between them have recently achieved impressive results. However, these approaches are difficult to optimize with gradient descent and they often do not converge well without careful hyperparameter tuning and proper initialization. We investigate whether turning the adversarial min-max problem into an optimization problem by replacing the maximization part with its dual improves the quality of the resulting alignment and explore its connections to Maximum Mean Discrepancy. Our empirical results suggest that using the dual formulation for the restricted family of linear discriminators results in a more stable convergence to a desirable solution when compared with the performance of a primal min-max GAN-like objective and an MMD objective under the same restrictions. We test our hypothesis on the problem of aligning two synthetic point clouds on a plane and on a real-image domain adaptation problem on digits. In both cases, the dual formulation yields an iterative procedure that gives more stable and monotonic improvement over time.