Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This considerably extends Baur-Strassen's smooth cheap gradient principle. We illustrate our results by establishing fast backpropagation results of conservative gradients through feedforward neural networks with standard activation and loss functions. Nonsmooth backpropagation's cheapness contrasts with concurrent forward approaches, which have, to this day, dimensional-dependent worst-case overhead estimates. We provide further results suggesting the superiority of backward propagation of conservative gradients. Indeed, we relate the complexity of computing a large number of directional derivatives to that of matrix multiplication, and we show that finding two subgradients in the Clarke subdifferential of a function is an NP-hard problem.
In this work, we adopt Wyner common information framework for unsupervised multi-view representation learning. Within this framework, we propose two novel formulations that enable the development of computational efficient solvers based on the alternating minimization principle. The first formulation, referred to as the {\em variational form}, enjoys a linearly growing complexity with the number of views and is based on a variational-inference tight surrogate bound coupled with a Lagrangian optimization objective function. The second formulation, i.e., the {\em representational form}, is shown to include known results as special cases. Here, we develop a tailored version from the alternating direction method of multipliers (ADMM) algorithm for solving the resulting non-convex optimization problem. In the two cases, the convergence of the proposed solvers is established in certain relevant regimes. Furthermore, our empirical results demonstrate the effectiveness of the proposed methods as compared with the state-of-the-art solvers. In a nutshell, the proposed solvers offer computational efficiency, theoretical convergence guarantees, scalable complexity with the number of views, and exceptional accuracy as compared with the state-of-the-art techniques. Our focus here is devoted to the discrete case and our results for continuous distributions are reported elsewhere.
Causal Discovery (CD) is the process of identifying the cause-effect relationships among the variables from data. Over the years, several methods have been developed primarily based on the statistical properties of data to uncover the underlying causal mechanism. In this study we introduce the common terminologies in causal discovery, and provide a comprehensive discussion of the approaches designed to identify the causal edges in different settings. We further discuss some of the benchmark datasets available for evaluating the performance of the causal discovery algorithms, available tools to perform causal discovery readily, and the common metrics used to evaluate these methods. Finally, we conclude by presenting the common challenges involved in CD and also, discuss the applications of CD in multiple areas of interest.
Suppose we are given an $n$-dimensional order-3 symmetric tensor $T \in (\mathbb{R}^n)^{\otimes 3}$ that is the sum of $r$ random rank-1 terms. The problem of recovering the rank-1 components is possible in principle when $r \lesssim n^2$ but polynomial-time algorithms are only known in the regime $r \ll n^{3/2}$. Similar "statistical-computational gaps" occur in many high-dimensional inference tasks, and in recent years there has been a flurry of work on explaining the apparent computational hardness in these problems by proving lower bounds against restricted (yet powerful) models of computation such as statistical queries (SQ), sum-of-squares (SoS), and low-degree polynomials (LDP). However, no such prior work exists for tensor decomposition, largely because its hardness does not appear to be explained by a "planted versus null" testing problem. We consider a model for random order-3 tensor decomposition where one component is slightly larger in norm than the rest (to break symmetry), and the components are drawn uniformly from the hypercube. We resolve the computational complexity in the LDP model: $O(\log n)$-degree polynomial functions of the tensor entries can accurately estimate the largest component when $r \ll n^{3/2}$ but fail to do so when $r \gg n^{3/2}$. This provides rigorous evidence suggesting that the best known algorithms for tensor decomposition cannot be improved, at least by known approaches. A natural extension of the result holds for tensors of any fixed order $k \ge 3$, in which case the LDP threshold is $r \sim n^{k/2}$.
We study the problem of weak recovery for the $r$-uniform hypergraph stochastic block model ($r$-HSBM) with two balanced communities. In HSBM a random graph is constructed by placing hyperedges with higher density if all vertices of a hyperedge share the same binary label. By analyzing contraction of a non-Shannon (symmetric-KL) information measure, we prove that for $r=3,4$, weak recovery is impossible below the Kesten-Stigum threshold. Prior work Pal and Zhu (2021) established that weak recovery in HSBM is always possible above the Kesten-Stigum threshold. Consequently, there is no information-computation gap for these $r$, which (partially) resolves a conjecture of Angelini et al. (2015). To our knowledge this is the first impossibility result for HSBM weak recovery. As usual, we reduce the study of non-recovery of HSBM to the study of non-reconstruction in a related broadcasting on hypertrees (BOHT) model. While we show that BOHT's reconstruction threshold coincides with Kesten-Stigum for $r=3,4$, surprisingly, we demonstrate that for $r\ge 7$ reconstruction is possible also below the Kesten-Stigum. This shows an interesting phase transition in the parameter $r$, and suggests that for $r\ge 7$, there might be an information-computation gap for the HSBM. For $r=5,6$ and large degree we propose an approach for showing non-reconstruction below Kesten-Stigum threshold, suggesting that $r=7$ is the correct threshold for onset of the new phase. We admit that our analysis of the $r=4$ case depends on a numerically-verified inequality.
In this work, we propose and study a preconditioned framework with a graphic Ginzburg-Landau functional for image segmentation and data clustering by parallel computing. Solving nonlocal models is usually challenging due to the huge computation burden. For the nonconvex and nonlocal variational functional, we propose several damped Jacobi and generalized Richardson preconditioners for the large-scale linear systems within a difference of convex functions algorithms framework. They are efficient for parallel computing with GPU and can leverage the computational cost. Our framework also provides flexible step sizes with a global convergence guarantee. Numerical experiments show the proposed algorithms are very competitive compared to the singular value decomposition based spectral method.
Numerical differentiation of a function, contaminated with noise, over the unit interval $[0,1] \subset \mathbb{R}$ by inverting the simple integration operator $J:L^2([0,1]) \to L^2([0,1])$ defined as $[Jx](s):=\int_0^s x(t) dt$ is discussed extensively in the literature. The complete singular system of the compact operator $J$ is explicitly given with singular values $\sigma_n(J)$ asymptotically proportional to $1/n$, which indicates a degree {\sl one} of ill-posedness for this inverse problem. We recall the concept of the degree of ill-posedness for linear operator equations with compact forward operators in Hilbert spaces. In contrast to the one-dimensional case with operator $J$, there is little material available about the analysis of the d-dimensional case, where the compact integral operator $J_d:L^2([0,1]^d) \to L^2([0,1]^d)$ defined as $[J_d\,x](s_1,\ldots,s_d):=\int_0^{s_1}\ldots\int_0^{s_d} x(t_1,\ldots,t_d)\, dt_d\ldots dt_1$ over unit $d$-cube is to be inverted. This inverse problem of mixed differentiation $x(s_1,\ldots,s_d)=\frac{\partial^d}{\partial s_1 \ldots \partial s_d} y(s_1,\ldots ,s_d)$ is of practical interest, for example when in statistics copula densities have to be verified from empirical copulas over $[0,1]^d \subset \mathbb{R}^d$. In this note, we prove that the non-increasingly ordered singular values $\sigma_n(J_d)$ of the operator $J_d$ have an asymptotics of the form $\frac{(\log n)^{d-1}}{n}$, which shows that the degree of ill-posedness stays at one, even though an additional logarithmic factor occurs. Some more discussion refers to the special case $d=2$ for characterizing the range $\mathcal{R}(J_2)$ of the operator $J_2$.
This paper contributes tail bounds of the age-of-information of a general class of parallel systems and explores their potential. Parallel systems arise in relevant cases, such as in multi-band mobile networks, multi-technology wireless access, or multi-path protocols, just to name a few. Typically, control over each communication channel is limited and random service outages and congestion cause buffering that impairs the age-of-information. The parallel use of independent channels promises a remedy, since outages on one channel may be compensated for by another. Surprisingly, for the well-known case of M$\mid$M$\mid$1 queues we find the opposite: pooling capacity in one channel performs better than a parallel system with the same total capacity. A generalization is not possible since there are no solutions for other types of parallel queues at hand. In this work, we prove a dual representation of age-of-information in min-plus algebra that connects to queueing models known from the theory of effective bandwidth/capacity and the stochastic network calculus. Exploiting these methods, we derive tail bounds of the age-of-information of parallel G$\mid$G$\mid$1 queues. In addition to parallel classical queues, we investigate Markov channels where, depending on the memory of the channel, we show the true advantage of parallel systems. We continue to investigate this new finding and provide insight into when capacity should be pooled in one channel or when independent parallel channels perform better. We complement our analysis with simulation results and evaluate different update policies, scheduling policies, and the use of heterogeneous channels that is most relevant for latest multi-band networks.
When is heterogeneity in the composition of an autonomous robotic team beneficial and when is it detrimental? We investigate and answer this question in the context of a minimally viable model that examines the role of heterogeneous speeds in perimeter defense problems, where defenders share a total allocated speed budget. We consider two distinct problem settings and develop strategies based on dynamic programming and on local interaction rules. We present a theoretical analysis of both approaches and our results are extensively validated using simulations. Interestingly, our results demonstrate that the viability of heterogeneous teams depends on the amount of information available to the defenders. Moreover, our results suggest a universality property: across a wide range of problem parameters the optimal ratio of the speeds of the defenders remains nearly constant.
Model complexity is a fundamental problem in deep learning. In this paper we conduct a systematic overview of the latest studies on model complexity in deep learning. Model complexity of deep learning can be categorized into expressive capacity and effective model complexity. We review the existing studies on those two categories along four important factors, including model framework, model size, optimization process and data complexity. We also discuss the applications of deep learning model complexity including understanding model generalization capability, model optimization, and model selection and design. We conclude by proposing several interesting future directions.
Since deep neural networks were developed, they have made huge contributions to everyday lives. Machine learning provides more rational advice than humans are capable of in almost every aspect of daily life. However, despite this achievement, the design and training of neural networks are still challenging and unpredictable procedures. To lower the technical thresholds for common users, automated hyper-parameter optimization (HPO) has become a popular topic in both academic and industrial areas. This paper provides a review of the most essential topics on HPO. The first section introduces the key hyper-parameters related to model training and structure, and discusses their importance and methods to define the value range. Then, the research focuses on major optimization algorithms and their applicability, covering their efficiency and accuracy especially for deep learning networks. This study next reviews major services and toolkits for HPO, comparing their support for state-of-the-art searching algorithms, feasibility with major deep learning frameworks, and extensibility for new modules designed by users. The paper concludes with problems that exist when HPO is applied to deep learning, a comparison between optimization algorithms, and prominent approaches for model evaluation with limited computational resources.