We introduce an error resilient distributed computing method based on an extension of the channel polarization phenomenon to distributed algorithms. The method leverages an algorithmic split operation that transforms two identical compute nodes to slow and fast workers, which parallels the channel split operation in Polar Codes. This operation preserves the average runtime, analogous to the conservation of Shannon capacity in channel polarization. By leveraging a recursive construction in a similar spirit to the Fast Fourier Transform, this method synthesizes virtual compute nodes with dispersed return time distributions, which we call computational polarization. We show that the runtime distributions form a functional martingale processes, identify their limiting distributions in closed-form expressions together with non-asymptotic convergence rates, and prove strong convergence results in Banach spaces. We provide an information-theoretic lower bound on the overall runtime of any coded computation method and show that the computational polarization approach asymptotically achieves the optimal runtime for computing linear functions. An important advantage is the near linear time decoding procedure, which is significantly cheaper than Maximum Distance Separable codes.
We present and implement an algorithm for computing the invariant circle and the corresponding stable manifolds for 2-dimensional maps. The algorithm is based on the parameterization method, and it is backed up by an a-posteriori theorem established in [YdlL21]. The algorithm works irrespective of whether the internal dynamics in the invariant circle is a rotation or it is phase-locked. The algorithm converges quadratically and the number of operations and memory requirements for each step of the iteration is linear with respect to the size of the discretization. We also report on the result of running the implementation in some standard models to uncover new phenomena. In particular, we explored a bundle merging scenario in which the invariant circle loses hyperbolicity because the angle between the stable directions and the tangent becomes zero even if the rates of contraction are separated. We also discuss and implement a generalization of the algorithm to 3 dimensions, and implement it on the 3-dimensional Fattened Arnold Family (3D-FAF) map with non-resonant eigenvalues and present numerical results.
Considering a probability distribution over parameters is known as an efficient strategy to learn a neural network with non-differentiable activation functions. We study the expectation of a probabilistic neural network as a predictor by itself, focusing on the aggregation of binary activated neural networks with normal distributions over real-valued weights. Our work leverages a recent analysis derived from the PAC-Bayesian framework that derives tight generalization bounds and learning procedures for the expected output value of such an aggregation, which is given by an analytical expression. While the combinatorial nature of the latter has been circumvented by approximations in previous works, we show that the exact computation remains tractable for deep but narrow neural networks, thanks to a dynamic programming approach. This leads us to a peculiar bound minimization learning algorithm for binary activated neural networks, where the forward pass propagates probabilities over representations instead of activation values. A stochastic counterpart of this new neural networks training scheme that scales to wider architectures is proposed.
In this work, we consider the optimization formulation for symmetric tensor decomposition recently introduced in the Subspace Power Method (SPM) of Kileel and Pereira. Unlike popular alternative functionals for tensor decomposition, the SPM objective function has the desirable properties that its maximal value is known in advance, and its global optima are exactly the rank-1 components of the tensor when the input is sufficiently low-rank. We analyze the non-convex optimization landscape associated with the SPM objective. Our analysis accounts for working with noisy tensors. We derive quantitative bounds such that any second-order critical point with SPM objective value exceeding the bound must equal a tensor component in the noiseless case, and must approximate a tensor component in the noisy case. For decomposing tensors of size $D^{\times m}$, we obtain a near-global guarantee up to rank $\widetilde{o}(D^{\lfloor m/2 \rfloor})$ under a random tensor model, and a global guarantee up to rank $\mathcal{O}(D)$ assuming deterministic frame conditions. This implies that SPM with suitable initialization is a provable, efficient, robust algorithm for low-rank symmetric tensor decomposition. We conclude with numerics that show a practical preferability for using the SPM functional over a more established counterpart.
The error threshold of a one-parameter family of quantum channels is defined as the largest noise level such that the quantum capacity of the channel remains positive. This in turn guarantees the existence of a quantum error correction code for noise modeled by that channel. Discretizing the single-qubit errors leads to the important family of Pauli quantum channels; curiously, multipartite entangled states can increase the threshold of these channels beyond the so-called hashing bound, an effect termed superadditivity of coherent information. In this work, we divide the simplex of Pauli channels into one-parameter families and compute numerical lower bounds on their error thresholds. We find substantial increases of error thresholds relative to the hashing bound for large regions in the Pauli simplex corresponding to biased noise, which is a realistic noise model in promising quantum computing architectures. The error thresholds are computed on the family of graph states, a special type of stabilizer state. In order to determine the coherent information of a graph state, we devise an algorithm that exploits the symmetries of the underlying graph, resulting in a substantial computational speed-up. This algorithm uses tools from computational group theory and allows us to consider symmetric graph states on a large number of vertices. Our algorithm works particularly well for repetition codes and concatenated repetition codes (or cat codes), for which our results provide the first comprehensive study of superadditivity for arbitrary Pauli channels. In addition, we identify a novel family of quantum codes based on tree graphs. The error thresholds of these tree graph states outperform repetition and cat codes in large regions of the Pauli simplex, and hence form a new code family with desirable error correction properties.
We consider the problem of finding a saddle point for the convex-concave objective $\min_x \max_y f(x) + \langle Ax, y\rangle - g^*(y)$, where $f$ is a convex function with locally Lipschitz gradient and $g$ is convex and possibly non-smooth. We propose an adaptive version of the Condat-V\~u algorithm, which alternates between primal gradient steps and dual proximal steps. The method achieves stepsize adaptivity through a simple rule involving $\|A\|$ and the norm of recently computed gradients of $f$. Under standard assumptions, we prove an $\mathcal{O}(k^{-1})$ ergodic convergence rate. Furthermore, when $f$ is also locally strongly convex and $A$ has full row rank we show that our method converges with a linear rate. Numerical experiments are provided for illustrating the practical performance of the algorithm.
In the realm of deep learning, the Fisher information matrix (FIM) gives novel insights and useful tools to characterize the loss landscape, perform second-order optimization, and build geometric learning theories. The exact FIM is either unavailable in closed form or too expensive to compute. In practice, it is almost always estimated based on empirical samples. We investigate two such estimators based on two equivalent representations of the FIM -- both unbiased and consistent. Their estimation quality is naturally gauged by their variance given in closed form. We analyze how the parametric structure of a deep neural network can affect the variance. The meaning of this variance measure and its upper bounds are then discussed in the context of deep learning.
Much stringent reliability and processing latency requirements in ultra-reliable-low-latency-communication (URLLC) traffic make the design of linear massive multiple-input-multiple-output (M-MIMO) receivers becomes very challenging. Recently, Bayesian concept has been used to increase the detection reliability in minimum-mean-square-error (MMSE) linear receivers. However, the latency processing time is a major concern due to the exponential complexity of matrix inversion operations in MMSE schemes. This paper proposes an iterative M-MIMO receiver that is developed by using a Bayesian concept and a parallel interference cancellation (PIC) scheme, referred to as a linear Bayesian learning (LBL) receiver. PIC has a linear complexity as it uses a combination of maximum ratio combining (MRC) and decision statistic combining (DSC) schemes to avoid matrix inversion operations. Simulation results show that the bit-error-rate (BER) and latency processing performances of the proposed receiver outperform the ones of MMSE and best Bayesian-based receivers by minimum $2$ dB and $19$ times for various M-MIMO system configurations.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.