Distributed computing enables large-scale computation tasks to be processed over multiple workers in parallel. However, the randomness of communication and computation delays across workers causes the straggler effect, which may degrade the performance. Coded computation helps to mitigate the straggler effect, but the amount of redundant load and their assignment to the workers should be carefully optimized. In this work, we consider a multi-master heterogeneous-worker distributed computing scenario, where multiple matrix multiplication tasks are encoded and allocated to workers for parallel computation. The goal is to minimize the communication plus computation delay of the slowest task. We propose worker assignment, resource allocation and load allocation algorithms under both dedicated and fractional worker assignment policies, where each worker can process the encoded tasks of either a single master or multiple masters, respectively. Then, the non-convex delay minimization problem is solved by employing the Markov's inequality-based approximation, Karush-Kuhn-Tucker conditions, and successive convex approximation methods. Through extensive simulations, we show that the proposed algorithms can reduce the task completion delay compared to the benchmarks, and observe that dedicated and fractional worker assignment policies have different scopes of applications.
Federated learning (FL) supports training models on geographically distributed devices. However, traditional FL systems adopt a centralized synchronous strategy, putting high communication pressure and model generalization challenge. Existing optimizations on FL either fail to speedup training on heterogeneous devices or suffer from poor communication efficiency. In this paper, we propose HADFL, a framework that supports decentralized asynchronous training on heterogeneous devices. The devices train model locally with heterogeneity-aware local steps using local data. In each aggregation cycle, they are selected based on probability to perform model synchronization and aggregation. Compared with the traditional FL system, HADFL can relieve the central server's communication pressure, efficiently utilize heterogeneous computing power, and can achieve a maximum speedup of 3.15x than decentralized-FedAvg and 4.68x than Pytorch distributed training scheme, respectively, with almost no loss of convergence accuracy.
The demand for large-scale deep learning is increasing, and distributed training is the current mainstream solution. Ring AllReduce is widely used as a data parallel decentralized algorithm. However, in a heterogeneous environment, each worker calculates the same amount of data, so that there is a lot of waiting time loss among different workers, which makes the algorithm unable to adapt well to heterogeneous clusters. Resources are not used as they should be. In this paper, we design an implementation of static allocation algorithm. The dataset is artificially allocated to each worker, and samples are drawn proportionally for training, thereby speeding up the training speed of the network in a heterogeneous environment. We verify the convergence and influence on training speed of the network model under this algorithm on one machine with multi-card and multi-machine with multi-card. On this basis of feasibility, we propose a self-adaptive allocation algorithm that allows each machine to find the data it needs to adapt to the current environment. The self-adaptive allocation algorithm can reduce the training time by nearly one-third to half compared to the same proportional allocation.In order to better show the applicability of the algorithm in heterogeneous clusters, We replace a poorly performing worker with a good performing worker or add a poorly performing worker to the heterogeneous cluster. Experimental results show that training time will decrease as the overall performance improves. Therefore, it means that resources are fully used. Further, this algorithm is not only suitable for straggler problems, but also for most heterogeneous situations. It can be used as a plug-in for AllReduce and its variant algorithms.
Secure model aggregation across many users is a key component of federated learning systems. The state-of-the-art protocols for secure model aggregation, which are based on additive masking, require all users to quantize their model updates to the same level of quantization. This severely degrades their performance due to lack of adaptation to available bandwidth at different users. We propose three schemes that allow secure model aggregation while using heterogeneous quantization. This enables the users to adjust their quantization proportional to their available bandwidth, which can provide a substantially better trade-off between the accuracy of training and the communication time. The proposed schemes are based on a grouping strategy by partitioning the network into groups, and partitioning the local model updates of users into segments. Instead of applying aggregation protocol to the entire local model update vector, it is applied on segments with specific coordination between users. We theoretically evaluate the quantization error for our schemes, and also demonstrate how our schemes can be utilized to overcome Byzantine users.
Large ML models and datasets have necessitated the use of multi-GPU systems for distributed model training. To harness the power offered by multi-GPU systems, it is critical to eliminate bottlenecks in inter-GPU communication - a problem made challenging by the heterogeneous nature of interconnects. In this work, we present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems. TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms. TACCL is built on top of the standard NVIDIA Collective Communication Library (NCCL), allowing it to be a drop-in replacement for GPU communication in frameworks like PyTorch with minimal changes. TACCL generates algorithms for communication primitives like Allgather, Alltoall, and Allreduce that are up to $3\times$ faster than NCCL. Using TACCL's algorithms speeds up the end-to-end training of an internal mixture of experts model by $17\%$. By decomposing the optimization problem into parts and leveraging the symmetry in multi-GPU topologies, TACCL synthesizes collectives for up to 80-GPUs in less than 3 minutes, at least two orders of magnitude faster than other synthesis-based state-of-the-art collective communication libraries.
Multi-antenna coded caching combines a global caching gain, proportional to the total cache size in the network, with an additional spatial multiplexing gain that stems from multiple transmitting antennas. However, classic centralized coded caching schemes are not suitable for dynamic networks as they require prior knowledge of the number of users to indicate what data should be cached at each user during the placement phase. On the other hand, fully decentralized schemes provide comparable gains to their centralized counterparts only when the number of users is very large. In this paper, we propose a novel multi-antenna coded caching scheme for dynamic networks, where instead of defining individual cache contents, we associate users with a limited set of predefined caching profiles. Then, during the delivery phase, we aim at achieving a combined caching and spatial multiplexing gain, comparable to a large extent with the ideal case of fully centralized schemes. The resulting scheme imposes small subpacketization and beamforming overheads, is robust under dynamic network conditions, and incurs small finite-SNR performance loss compared with centralized schemes.
We consider stochastic optimization with delayed gradients where, at each time step $t$, the algorithm makes an update using a stale stochastic gradient from step $t - d_t$ for some arbitrary delay $d_t$. This setting abstracts asynchronous distributed optimization where a central server receives gradient updates computed by worker machines. These machines can experience computation and communication loads that might vary significantly over time. In the general non-convex smooth optimization setting, we give a simple and efficient algorithm that requires $O( \sigma^2/\epsilon^4 + \tau/\epsilon^2 )$ steps for finding an $\epsilon$-stationary point $x$, where $\tau$ is the \emph{average} delay $\smash{\frac{1}{T}\sum_{t=1}^T d_t}$ and $\sigma^2$ is the variance of the stochastic gradients. This improves over previous work, which showed that stochastic gradient decent achieves the same rate but with respect to the \emph{maximal} delay $\max_{t} d_t$, that can be significantly larger than the average delay especially in heterogeneous distributed systems. Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
We present a flexible public transit network design model which optimizes a social access objective while guaranteeing that the system's costs and transit times remain within a preset margin of their current levels. The purpose of the model is to find a set of minor, immediate modifications to an existing bus network that can give more communities access to the chosen services while having a minimal impact on the current network's operator costs and user costs. Design decisions consist of reallocation of existing resources in order to adjust line frequencies and capacities. We present a hybrid tabu search/simulated annealing algorithm for the solution of this optimization-based model. As a case study we apply the model to the problem of improving equity of access to primary health care facilities in the Chicago metropolitan area. The results of the model suggest that it is possible to achieve better primary care access equity through reassignment of existing buses and implementation of express runs, while leaving overall service levels relatively unaffected.
Reconfigurable intelligent surface (RIS) has become a promising technology to improve wireless communication in recent years. It steers the incident signals to create a favorable propagation environment by controlling the reconfigurable passive elements with less hardware cost and lower power consumption. In this paper, we consider a RIS-aided multiuser multiple-input single-output downlink communication system. We aim to maximize the weighted sum-rate of all users by joint optimizing the active beamforming at the access point and the passive beamforming vector of the RIS elements. Unlike most existing works, we consider the more practical situation with the discrete phase shifts and imperfect channel state information (CSI). Specifically, for the situation that the discrete phase shifts and perfect CSI are considered, we first develop a deep quantization neural network (DQNN) to simultaneously design the active and passive beamforming while most reported works design them alternatively. Then, we propose an improved structure (I-DQNN) based on DQNN to simplify the parameters decision process when the control bits of each RIS element are greater than 1 bit. Finally, we extend the two proposed DQNN-based algorithms to the case that the discrete phase shifts and imperfect CSI are considered simultaneously. Our simulation results show that the two DQNN-based algorithms have better performance than traditional algorithms in the perfect CSI case, and are also more robust in the imperfect CSI case.
Target-Based Sentiment Analysis aims to detect the opinion aspects (aspect extraction) and the sentiment polarities (sentiment detection) towards them. Both the previous pipeline and integrated methods fail to precisely model the innate connection between these two objectives. In this paper, we propose a novel dynamic heterogeneous graph to jointly model the two objectives in an explicit way. Both the ordinary words and sentiment labels are treated as nodes in the heterogeneous graph, so that the aspect words can interact with the sentiment information. The graph is initialized with multiple types of dependencies, and dynamically modified during real-time prediction. Experiments on the benchmark datasets show that our model outperforms the state-of-the-art models. Further analysis demonstrates that our model obtains significant performance gain on the challenging instances under multiple-opinion aspects and no-opinion aspect situations.
This paper presents an upgraded, real world application oriented version of gym-gazebo, the Robot Operating System (ROS) and Gazebo based Reinforcement Learning (RL) toolkit, which complies with OpenAI Gym. The content discusses the new ROS 2 based software architecture and summarizes the results obtained using Proximal Policy Optimization (PPO). Ultimately, the output of this work presents a benchmarking system for robotics that allows different techniques and algorithms to be compared using the same virtual conditions. We have evaluated environments with different levels of complexity of the Modular Articulated Robotic Arm (MARA), reaching accuracies in the millimeter scale. The converged results show the feasibility and usefulness of the gym-gazebo 2 toolkit, its potential and applicability in industrial use cases, using modular robots.