Federated learning (FedL) has emerged as a popular technique for distributing model training over a set of wireless devices, via iterative local updates (at devices) and global aggregations (at the server). In this paper, we develop parallel successive learning (PSL), which expands the FedL architecture along three dimensions: (i) Network, allowing decentralized cooperation among the devices via device-to-device (D2D) communications. (ii) Heterogeneity, interpreted at three levels: (ii-a) Learning: PSL considers heterogeneous number of stochastic gradient descent iterations with different mini-batch sizes at the devices; (ii-b) Data: PSL presumes a dynamic environment with data arrival and departure, where the distributions of local datasets evolve over time, captured via a new metric for model/concept drift. (ii-c) Device: PSL considers devices with different computation and communication capabilities. (iii) Proximity, where devices have different distances to each other and the access point. PSL considers the realistic scenario where global aggregations are conducted with idle times in-between them for resource efficiency improvements, and incorporates data dispersion and model dispersion with local model condensation into FedL. Our analysis sheds light on the notion of cold vs. warmed up models, and model inertia in distributed machine learning. We then propose network-aware dynamic model tracking to optimize the model learning vs. resource efficiency tradeoff, which we show is an NP-hard signomial programming problem. We finally solve this problem through proposing a general optimization solver. Our numerical results reveal new findings on the interdependencies between the idle times in-between the global aggregations, model/concept drift, and D2D cooperation configuration.
Distributed machine learning (ML) can bring more computational resources to bear than single-machine learning, thus enabling reductions in training time. Distributed learning partitions models and data over many machines, allowing model and dataset sizes beyond the available compute power and memory of a single machine. In practice though, distributed ML is challenging when distribution is mandatory, rather than chosen by the practitioner. In such scenarios, data could unavoidably be separated among workers due to limited memory capacity per worker or even because of data privacy issues. There, existing distributed methods will utterly fail due to dominant transfer costs across workers, or do not even apply. We propose a new approach to distributed fully connected neural network learning, called independent subnet training (IST), to handle these cases. In IST, the original network is decomposed into a set of narrow subnetworks with the same depth. These subnetworks are then trained locally before parameters are exchanged to produce new subnets and the training cycle repeats. Such a naturally "model parallel" approach limits memory usage by storing only a portion of network parameters on each device. Additionally, no requirements exist for sharing data between workers (i.e., subnet training is local and independent) and communication volume and frequency are reduced by decomposing the original network into independent subnets. These properties of IST can cope with issues due to distributed data, slow interconnects, or limited device memory, making IST a suitable approach for cases of mandatory distribution. We show experimentally that IST results in training times that are much lower than common distributed learning approaches.
Split learning (SL) is a collaborative learning framework, which can train an artificial intelligence (AI) model between a device and an edge server by splitting the AI model into a device-side model and a server-side model at a cut layer. The existing SL approach conducts the training process sequentially across devices, which incurs significant training latency especially when the number of devices is large. In this paper, we design a novel SL scheme to reduce the training latency, named Cluster-based Parallel SL (CPSL) which conducts model training in a "first-parallel-then-sequential" manner. Specifically, the CPSL is to partition devices into several clusters, parallelly train device-side models in each cluster and aggregate them, and then sequentially train the whole AI model across clusters, thereby parallelizing the training process and reducing training latency. Furthermore, we propose a resource management algorithm to minimize the training latency of CPSL considering device heterogeneity and network dynamics in wireless networks. This is achieved by stochastically optimizing the cut layer selection, real-time device clustering, and radio spectrum allocation. The proposed two-timescale algorithm can jointly make the cut layer selection decision in a large timescale and device clustering and radio spectrum allocation decisions in a small timescale. Extensive simulation results on non-independent and identically distributed data demonstrate that the proposed solutions can greatly reduce the training latency as compared with the existing SL benchmarks, while adapting to network dynamics.
Stochastic optimization algorithms implemented on distributed computing architectures are increasingly used to tackle large-scale machine learning applications. A key bottleneck in such distributed systems is the communication overhead for exchanging information such as stochastic gradients between different workers. Sparse communication with memory and the adaptive aggregation methodology are two successful frameworks among the various techniques proposed to address this issue. In this paper, we exploit the advantages of Sparse communication and Adaptive aggregated Stochastic Gradients to design a communication-efficient distributed algorithm named SASG. Specifically, we determine the workers who need to communicate with the parameter server based on the adaptive aggregation rule and then sparsify the transmitted information. Therefore, our algorithm reduces both the overhead of communication rounds and the number of communication bits in the distributed system. We define an auxiliary sequence and provide convergence results of the algorithm with the help of Lyapunov function analysis. Experiments on training deep neural networks show that our algorithm can significantly reduce the communication overhead compared to the previous methods, with little impact on training and testing accuracy.
Multi-scale problems, where variables of interest evolve in different time-scales and live in different state-spaces. can be found in many fields of science. Here, we introduce a new recursive methodology for Bayesian inference that aims at estimating the static parameters and tracking the dynamic variables of these kind of systems. Although the proposed approach works in rather general multi-scale systems, for clarity we analyze the case of a heterogeneous multi-scale model with 3 time-scales (static parameters, slow dynamic state variables and fast dynamic state variables). The proposed scheme, based on nested filtering methodology of P\'erez-Vieites et al. (2018), combines three intertwined layers of filtering techniques that approximate recursively the joint posterior probability distribution of the parameters and both sets of dynamic state variables given a sequence of partial and noisy observations. We explore the use of sequential Monte Carlo schemes in the first and second layers while we use an unscented Kalman filter to obtain a Gaussian approximation of the posterior probability distribution of the fast variables in the third layer. Some numerical results are presented for a stochastic two-scale Lorenz 96 model with unknown parameters.
Federated Learning has promised a new approach to resolve the challenges in machine learning by bringing computation to the data. The popularity of the approach has led to rapid progress in the algorithmic aspects and the emergence of systems capable of simulating Federated Learning. State of art systems in Federated Learning support a single node aggregator that is insufficient to train a large corpus of devices or train larger-sized models. As the model size or the number of devices increase the single node aggregator incurs memory and computation burden while performing fusion tasks. It also faces communication bottlenecks when a large number of model updates are sent to a single node. We classify the workload for the aggregator into categories and propose a new aggregation service for handling each load. Our aggregation service is based on a holistic approach that chooses the best solution depending on the model update size and the number of clients. Our system provides a fault-tolerant, robust and efficient aggregation solution utilizing existing parallel and distributed frameworks. Through evaluation, we show the shortcomings of the state of art approaches and how a single solution is not suitable for all aggregation requirements. We also provide a comparison of current frameworks with our system through extensive experiments.
We demonstrate that merely analog transmissions and match filtering can realize the function of an edge server in federated learning (FL). Therefore, a network with massively distributed user equipments (UEs) can achieve large-scale FL without an edge server. We also develop a training algorithm that allows UEs to continuously perform local computing without being interrupted by the global parameter uploading, which exploits the full potential of UEs' processing power. We derive convergence rates for the proposed schemes to quantify their training efficiency. The analyses reveal that when the interference obeys a Gaussian distribution, the proposed algorithm retrieves the convergence rate of a server-based FL. But if the interference distribution is heavy-tailed, then the heavier the tail, the slower the algorithm converges. Nonetheless, the system run time can be largely reduced by enabling computation in parallel with communication, whereas the gain is particularly pronounced when communication latency is high. These findings are corroborated via excessive simulations.
In this paper we propose a Bayesian nonparametric approach to modelling sparse time-varying networks. A positive parameter is associated to each node of a network, which models the sociability of that node. Sociabilities are assumed to evolve over time, and are modelled via a dynamic point process model. The model is able to capture long term evolution of the sociabilities. Moreover, it yields sparse graphs, where the number of edges grows subquadratically with the number of nodes. The evolution of the sociabilities is described by a tractable time-varying generalised gamma process. We provide some theoretical insights into the model and apply it to three datasets: a simulated network, a network of hyperlinks between communities on Reddit, and a network of co-occurences of words in Reuters news articles after the September 11th attacks.
The adaptive processing of structured data is a long-standing research topic in machine learning that investigates how to automatically learn a mapping from a structured input to outputs of various nature. Recently, there has been an increasing interest in the adaptive processing of graphs, which led to the development of different neural network-based methodologies. In this thesis, we take a different route and develop a Bayesian Deep Learning framework for graph learning. The dissertation begins with a review of the principles over which most of the methods in the field are built, followed by a study on graph classification reproducibility issues. We then proceed to bridge the basic ideas of deep learning for graphs with the Bayesian world, by building our deep architectures in an incremental fashion. This framework allows us to consider graphs with discrete and continuous edge features, producing unsupervised embeddings rich enough to reach the state of the art on several classification tasks. Our approach is also amenable to a Bayesian nonparametric extension that automatizes the choice of almost all model's hyper-parameters. Two real-world applications demonstrate the efficacy of deep learning for graphs. The first concerns the prediction of information-theoretic quantities for molecular simulations with supervised neural models. After that, we exploit our Bayesian models to solve a malware-classification task while being robust to intra-procedural code obfuscation techniques. We conclude the dissertation with an attempt to blend the best of the neural and Bayesian worlds together. The resulting hybrid model is able to predict multimodal distributions conditioned on input graphs, with the consequent ability to model stochasticity and uncertainty better than most works. Overall, we aim to provide a Bayesian perspective into the articulated research field of deep learning for graphs.
Federated Learning (FL) is a decentralized machine-learning paradigm, in which a global server iteratively averages the model parameters of local users without accessing their data. User heterogeneity has imposed significant challenges to FL, which can incur drifted global models that are slow to converge. Knowledge Distillation has recently emerged to tackle this issue, by refining the server model using aggregated knowledge from heterogeneous users, other than directly averaging their model parameters. This approach, however, depends on a proxy dataset, making it impractical unless such a prerequisite is satisfied. Moreover, the ensemble knowledge is not fully utilized to guide local model learning, which may in turn affect the quality of the aggregated model. Inspired by the prior art, we propose a data-free knowledge distillation} approach to address heterogeneous FL, where the server learns a lightweight generator to ensemble user information in a data-free manner, which is then broadcasted to users, regulating local training using the learned knowledge as an inductive bias. Empirical studies powered by theoretical implications show that, our approach facilitates FL with better generalization performance using fewer communication rounds, compared with the state-of-the-art.
Recent years have witnessed the emerging success of graph neural networks (GNNs) for modeling structured data. However, most GNNs are designed for homogeneous graphs, in which all nodes and edges belong to the same types, making them infeasible to represent heterogeneous structures. In this paper, we present the Heterogeneous Graph Transformer (HGT) architecture for modeling Web-scale heterogeneous graphs. To model heterogeneity, we design node- and edge-type dependent parameters to characterize the heterogeneous attention over each edge, empowering HGT to maintain dedicated representations for different types of nodes and edges. To handle dynamic heterogeneous graphs, we introduce the relative temporal encoding technique into HGT, which is able to capture the dynamic structural dependency with arbitrary durations. To handle Web-scale graph data, we design the heterogeneous mini-batch graph sampling algorithm---HGSampling---for efficient and scalable training. Extensive experiments on the Open Academic Graph of 179 million nodes and 2 billion edges show that the proposed HGT model consistently outperforms all the state-of-the-art GNN baselines by 9%--21% on various downstream tasks.