A continuous-time average consensus system is a linear dynamical system defined over a graph, where each node has its own state value that evolves according to a simultaneous linear differential equation. A node is allowed to interact with neighboring nodes. Average consensus is a phenomenon that the all the state values converge to the average of the initial state values. In this paper, we assume that a node can communicate with neighboring nodes through an additive white Gaussian noise channel. We first formulate the noisy average consensus system by using a stochastic differential equation (SDE), which allows us to use the Euler-Maruyama method, a numerical technique for solving SDEs. By studying the stochastic behavior of the residual error of the Euler-Maruyama method, we arrive at the covariance evolution equation. The analysis of the residual error leads to a compact formula for mean squared error (MSE), which shows that the sum of the inverse eigenvalues of the Laplacian matrix is the most dominant factor influencing the MSE. Furthermore, we propose optimization problems aimed at minimizing the MSE at a given target time, and introduce a deep unfolding-based optimization method to solve these problems. The quality of the solution is validated by numerical experiments.
A multiplicity queue is a concurrently-defined data type which relaxes the conditions of a linearizable FIFO queue to allow concurrent Dequeue instances to return the same value. It would seem that this should allow faster implementations, as processes should not need to wait as long to learn about concurrent operations at remote processes and previous work has shown that multiplicity queues are computationally less complex than the unrelaxed version. Intriguingly, recent work has shown that there is, in fact, not much speedup possible versus an unrelaxed queue implementation. Seeking to understand this difference between intuition and real behavior, we extend that work, increasing the lower bound for uniform algorithms. Further, we outline a path forward toward building proofs for even higher lower bounds, allowing us to hypothesize that the worst-case time to Dequeue approaches maximum message delay, which is similar to the time required for an unrelaxed Dequeue. We also give an upper bound for a special case to show that our bounds are tight at that point. To achieve our lower bounds, we use extended shifting arguments, which have been rarely used but allow larger lower bounds than traditional shifting arguments. We use these in series of inductive indistinguishability proofs which allow us to extend our proofs beyond the usual limitations of shifting arguments. This proof structure is an interesting contribution independently of the main result, as developing new lower bound proof techniques may have many uses in future work.
One of the key challenges towards the deployment of over-the-air federated learning (AirFL) is the design of mechanisms that can comply with the power and bandwidth constraints of the shared channel, while causing minimum deterioration to the learning performance as compared to baseline noiseless implementations. For additive white Gaussian noise (AWGN) channels with instantaneous per-device power constraints, prior work has demonstrated the optimality of a power control mechanism based on norm clipping. This was done through the minimization of an upper bound on the optimality gap for smooth learning objectives satisfying the Polyak-{\L}ojasiewicz (PL) condition. In this paper, we make two contributions to the development of AirFL based on norm clipping, which we refer to as AirFL-Clip. First, we provide a convergence bound for AirFLClip that applies to general smooth and non-convex learning objectives. Unlike existing results, the derived bound is free from run-specific parameters, thus supporting an offline evaluation. Second, we extend AirFL-Clip to include Top-k sparsification and linear compression. For this generalized protocol, referred to as AirFL-Clip-Comp, we derive a convergence bound for general smooth and non-convex learning objectives. We argue, and demonstrate via experiments, that the only time-varying quantities present in the bound can be efficiently estimated offline by leveraging the well-studied properties of sparse recovery algorithms.
We investigate trade-offs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the trade-off is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result under single-tuple inserts or deletes to the database. Our approach observes the degree of values in the database and uses different computation and maintenance strategies for high-degree (heavy) and low-degree (light) values. For the latter it partially computes the result, while for the former it computes enough information to allow for on-the-fly enumeration. We define the preprocessing time, the update time, and the enumeration delay as functions of the light/heavy threshold. By appropriately choosing this threshold, our approach recovers a number of prior results when restricted to hierarchical queries. We show that for a restricted class of hierarchical queries, our approach achieves worst-case optimal update time and enumeration delay conditioned on the Online Matrix-Vector Multiplication Conjecture.
Generalized approximate message passing (GAMP) is a computationally efficient algorithm for estimating an unknown signal $w_0\in\mathbb{R}^N$ from a random linear measurement $y= Xw_0 + \epsilon\in\mathbb{R}^M$, where $X\in\mathbb{R}^{M\times N}$ is a known measurement matrix and $\epsilon$ is the noise vector. The salient feature of GAMP is that it can provide an unbiased estimator $\hat{r}^{\rm G}\sim\mathcal{N}(w_0, \hat{s}^2I_N)$, which can be used for various hypothesis-testing methods. In this study, we consider the bootstrap average of an unbiased estimator of GAMP for the elastic net. By numerically analyzing the state evolution of \emph{approximate message passing with resampling}, which has been proposed for computing bootstrap statistics of the elastic net estimator, we investigate when the bootstrap averaging reduces the variance of the unbiased estimator and the effect of optimizing the size of each bootstrap sample and hyperparameter of the elastic net regularization in the asymptotic setting $M, N\to\infty, M/N\to\alpha\in(0,\infty)$. The results indicate that bootstrap averaging effectively reduces the variance of the unbiased estimator when the actual data generation process is inconsistent with the sparsity assumption of the regularization and the sample size is small. Furthermore, we find that when $w_0$ is less sparse, and the data size is small, the system undergoes a phase transition. The phase transition indicates the existence of the region where the ensemble average of unbiased estimators of GAMP for the elastic net norm minimization problem yields the unbiased estimator with the minimum variance.
The problem of designing distributed optimization algorithms that are resilient to Byzantine adversaries has received significant attention. For the Byzantine-resilient distributed optimization problem, the goal is to (approximately) minimize the average of the local cost functions held by the regular (non adversarial) agents in the network. In this paper, we provide a general algorithmic framework for Byzantine-resilient distributed optimization which includes some state-of-the-art algorithms as special cases. We analyze the convergence of algorithms within the framework, and derive a geometric rate of convergence of all regular agents to a ball around the optimal solution (whose size we characterize). Furthermore, we show that approximate consensus can be achieved geometrically fast under some minimal conditions. Our analysis provides insights into the relationship among the convergence region, distance between regular agents' values, step-size, and properties of the agents' functions for Byzantine-resilient distributed optimization.
Fault-tolerant consensus is about reaching agreement on some of the input values in a limited time by non-faulty autonomous processes, despite of failures of processes or communication medium. This problem is particularly challenging and costly against an adaptive adversary with full information. Bar-Joseph and Ben-Or (PODC'98) were the first who proved an absolute lower bound $\Omega(\sqrt{n/\log n})$ on expected time complexity of consensus in any classic (i.e., randomized or deterministic) message-passing network with $n$ processes succeeding with probability $1$ against such a strong adaptive adversary crashing processes. Seminal work of Ben-Or and Hassidim (STOC'05) broke the $\Omega(\sqrt{n/\log n})$ barrier for consensus in classic (deterministic and randomized) networks by employing quantum computing. They showed an (expected) constant-time quantum algorithm for a linear number of crashes $t<n/3$. In this paper, we improve upon that seminal work by reducing the number of quantum and communication bits to an arbitrarily small polynomial, and even more, to a polylogarithmic number -- though, the latter in the cost of a slightly larger polylogarithmic time (still exponentially smaller than the time lower bound $\Omega(\sqrt{n/\log n})$ for classic computation).
Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. Cluster locations are determined by the initial tokens, confirming context-awareness of representations learned by Transformers. Using techniques from dynamical systems and partial differential equations, we show that the type of limiting object that emerges depends on the spectrum of the value matrix. Additionally, in the one-dimensional case we prove that the self-attention matrix converges to a low-rank Boolean matrix. The combination of these results mathematically confirms the empirical observation made by Vaswani et al. [VSP'17] that leaders appear in a sequence of tokens when processed by Transformers.
The Noisy-SGD algorithm is widely used for privately training machine learning models. Traditional privacy analyses of this algorithm assume that the internal state is publicly revealed, resulting in privacy loss bounds that increase indefinitely with the number of iterations. However, recent findings have shown that if the internal state remains hidden, then the privacy loss might remain bounded. Nevertheless, this remarkable result heavily relies on the assumption of (strong) convexity of the loss function. It remains an important open problem to further relax this condition while proving similar convergent upper bounds on the privacy loss. In this work, we address this problem for DP-SGD, a popular variant of Noisy-SGD that incorporates gradient clipping to limit the impact of individual samples on the training process. Our findings demonstrate that the privacy loss of projected DP-SGD converges exponentially fast, without requiring convexity or smoothness assumptions on the loss function. In addition, we analyze the privacy loss of regularized (unprojected) DP-SGD. To obtain these results, we directly analyze the hockey-stick divergence between coupled stochastic processes by relying on non-linear data processing inequalities.
Residual networks (ResNets) have displayed impressive results in pattern recognition and, recently, have garnered considerable theoretical interest due to a perceived link with neural ordinary differential equations (neural ODEs). This link relies on the convergence of network weights to a smooth function as the number of layers increases. We investigate the properties of weights trained by stochastic gradient descent and their scaling with network depth through detailed numerical experiments. We observe the existence of scaling regimes markedly different from those assumed in neural ODE literature. Depending on certain features of the network architecture, such as the smoothness of the activation function, one may obtain an alternative ODE limit, a stochastic differential equation or neither of these. These findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit.
Federated learning is a new distributed machine learning framework, where a bunch of heterogeneous clients collaboratively train a model without sharing training data. In this work, we consider a practical and ubiquitous issue in federated learning: intermittent client availability, where the set of eligible clients may change during the training process. Such an intermittent client availability model would significantly deteriorate the performance of the classical Federated Averaging algorithm (FedAvg for short). We propose a simple distributed non-convex optimization algorithm, called Federated Latest Averaging (FedLaAvg for short), which leverages the latest gradients of all clients, even when the clients are not available, to jointly update the global model in each iteration. Our theoretical analysis shows that FedLaAvg attains the convergence rate of $O(1/(N^{1/4} T^{1/2}))$, achieving a sublinear speedup with respect to the total number of clients. We implement and evaluate FedLaAvg with the CIFAR-10 dataset. The evaluation results demonstrate that FedLaAvg indeed reaches a sublinear speedup and achieves 4.23% higher test accuracy than FedAvg.