Partitioning a graph into balanced blocks such that few edges run between blocks is a key problem for large-scale distributed processing. A current trend for partitioning huge graphs are streaming algorithms, which use low computational resources. In this work, we present a shared-memory streaming multi-recursive partitioning scheme that performs recursive multi-sections on the fly without knowing the overall input graph. Our approach has a considerably lower running time complexity in comparison with state-of-the-art non-buffered one-pass partitioning algorithms for the standard graph partitioning case. Moreover, if the topology of a distributed system is known, it is possible to further optimize the communication costs by mapping partitions onto processing elements. Our experiments indicate that our algorithm is both faster and produces better process mappings than competing tools. In case of graph partitioning, our framework is up to two orders of magnitude faster at the cost of 5% more cut edges compared to Fennel.
The approximate uniform sampling of graph realizations with a given degree sequence is an everyday task in several social science, computer science, engineering etc. projects. One approach is using Markov chains. The best available current result about the well-studied switch Markov chain is that it is rapidly mixing on P-stable degree sequences (see DOI:10.1016/j.ejc.2021.103421). The switch Markov chain does not change any degree sequence. However, there are cases where degree intervals are specified rather than a single degree sequence. (A natural scenario where this problem arises is in hypothesis testing on social networks that are only partially observed.) Rechner, Strowick, and M\"uller-Hannemann introduced in 2018 the notion of degree interval Markov chain which uses three (separately well-studied) local operations (switch, hinge-flip and toggle), and employing on degree sequence realizations where any two sequences under scrutiny have very small coordinate-wise distance. Recently Amanatidis and Kleer published a beautiful paper (arXiv:2110.09068), showing that the degree interval Markov chain is rapidly mixing if the sequences are coming from a system of very thin intervals which are centered not far from a regular degree sequence. In this paper we extend substantially their result, showing that the degree interval Markov chain is rapidly mixing if the intervals are centred at P-stable degree sequences.
The hard thresholding technique plays a vital role in the development of algorithms for sparse signal recovery. By merging this technique and heavy-ball acceleration method which is a multi-step extension of the traditional gradient descent method, we propose the so-called heavy-ball-based hard thresholding (HBHT) and heavy-ball-based hard thresholding pursuit (HBHTP) algorithms for signal recovery. It turns out that the HBHT and HBHTP can successfully recover a $k$-sparse signal if the restricted isometry constant of the measurement matrix satisfies $\delta_{3k}<0.618 $ and $\delta_{3k}<0.577,$ respectively. The guaranteed success of HBHT and HBHTP is also shown under the conditions $\delta_{2k}<0.356$ and $\delta_{2k}<0.377,$ respectively. Moreover, the finite convergence and stability of the two algorithms are also established in this paper. Simulations on random problem instances are performed to compare the performance of the proposed algorithms and several existing ones. Empirical results indicate that the HBHTP performs very comparably to a few existing algorithms and it takes less average time to achieve the signal recovery than these existing methods.
Graph Convolutional Networks (GCNs) are one of the most popular architectures that are used to solve classification problems accompanied by graphical information. We present a rigorous theoretical understanding of the effects of graph convolutions in multi-layer networks. We study these effects through the node classification problem of a non-linearly separable Gaussian mixture model coupled with a stochastic block model. First, we show that a single graph convolution expands the regime of the distance between the means where multi-layer networks can classify the data by a factor of at least $1/\sqrt[4]{\mathbb{E}{\rm deg}}$, where $\mathbb{E}{\rm deg}$ denotes the expected degree of a node. Second, we show that with a slightly stronger graph density, two graph convolutions improve this factor to at least $1/\sqrt[4]{n}$, where $n$ is the number of nodes in the graph. Finally, we provide both theoretical and empirical insights into the performance of graph convolutions placed in different combinations among the layers of a network, concluding that the performance is mutually similar for all combinations of the placement. We present extensive experiments on both synthetic and real-world data that illustrate our results.
We study the distributed minimum spanning tree (MST) problem, a fundamental problem in distributed computing. It is well-known that distributed MST can be solved in $\tilde{O}(D+\sqrt{n})$ rounds in the standard CONGEST model (where $n$ is the network size and $D$ is the network diameter) and this is essentially the best possible round complexity (up to logarithmic factors). However, in resource-constrained networks such as ad hoc wireless and sensor networks, nodes spending so much time can lead to significant spending of resources such as energy. Motivated by the above consideration, we study distributed algorithms for MST under the \emph{sleeping model} [Chatterjee et al., PODC 2020], a model for design and analysis of resource-efficient distributed algorithms. In the sleeping model, a node can be in one of two modes in any round -- \emph{sleeping} or \emph{awake} (unlike the traditional model where nodes are always awake). Only the rounds in which a node is \emph{awake} are counted, while \emph{sleeping} rounds are ignored. A node spends resources only in the awake rounds and hence the main goal is to minimize the \emph{awake complexity} of a distributed algorithm, the worst-case number of rounds any node is awake. We present deterministic and randomized distributed MST algorithms that have an \emph{optimal} awake complexity of $O(\log n)$ time with a matching lower bound. We also show that our randomized awake-optimal algorithm has essentially the best possible round complexity by presenting a lower bound of $\tilde{\Omega}(n)$ on the product of the awake and round complexity of any distributed algorithm (including randomized) that outputs an MST, where $\tilde{\Omega}$ hides a $1/(\text{polylog } n)$ factor.
Coflow is a network abstraction used to represent communication patterns in data centers. The coflow scheduling problem in large data centers is one of the most important $NP$-hard problems. Many previous studies on coflow scheduling mainly focus on the single-core model. However, with the growth of data centers, this single-core model is no longer sufficient. This paper considers the coflow scheduling problem in heterogeneous parallel networks. The heterogeneous parallel network is an architecture based on multiple network cores running in parallel. In this paper, two polynomial-time approximation algorithms are developed for scheduling divisible and indivisible coflows in heterogeneous parallel networks, respectively. Both algorithms achieve an approximation ratio of $O(\log m/ \log \log m)$ with arbitrary release times.
While neural architecture search (NAS) has enabled automated machine learning (AutoML) for well-researched areas, its application to tasks beyond computer vision is still under-explored. As less-studied domains are precisely those where we expect AutoML to have the greatest impact, in this work we study NAS for efficiently solving diverse problems. Seeking an approach that is fast, simple, and broadly applicable, we fix a standard convolutional network (CNN) topology and propose to search for the right kernel sizes and dilations its operations should take on. This dramatically expands the model's capacity to extract features at multiple resolutions for different types of data while only requiring search over the operation space. To overcome the efficiency challenges of naive weight-sharing in this search space, we introduce DASH, a differentiable NAS algorithm that computes the mixture-of-operations using the Fourier diagonalization of convolution, achieving both a better asymptotic complexity and an up-to-10x search time speedup in practice. We evaluate DASH on NAS-Bench-360, a suite of ten tasks designed for benchmarking NAS in diverse domains. DASH outperforms state-of-the-art methods in aggregate, attaining the best-known automated performance on seven tasks. Meanwhile, on six of the ten tasks, the combined search and retraining time is less than 2x slower than simply training a CNN backbone that is far less accurate.
In this work, we develop quantization and variable-length source codecs for the feedback links in linear-quadratic-Gaussian (LQG) control systems. We prove that for any fixed control performance, the approaches we propose nearly achieve lower bounds on communication cost that have been established in prior work. In particular, we refine the analysis of a classical achievability approach with an eye towards more practical details. Notably, in the prior literature the source codecs used to demonstrate the (near) achievability of these lower bounds are often implicitly assumed to be time-varying. For single-input single-output (SISO) plants, we prove that it suffices to consider time-invariant quantization and source coding. This result follows from analyzing the long-term stochastic behavior of the system's quantized measurements and reconstruction errors. To our knowledge, this time-invariant achievability result is the first in the literature.
We present the Sequential Aggregation and Rematerialization (SAR) scheme for distributed full-batch training of Graph Neural Networks (GNNs) on large graphs. Large-scale training of GNNs has recently been dominated by sampling-based methods and methods based on non-learnable message passing. SAR on the other hand is a distributed technique that can train any GNN type directly on an entire large graph. The key innovation in SAR is the distributed sequential rematerialization scheme which sequentially re-constructs then frees pieces of the prohibitively large GNN computational graph during the backward pass. This results in excellent memory scaling behavior where the memory consumption per worker goes down linearly with the number of workers, even for densely connected graphs. Using SAR, we report the largest applications of full-batch GNN training to-date, and demonstrate large memory savings as the number of workers increases. We also present a general technique based on kernel fusion and attention-matrix rematerialization to optimize both the runtime and memory efficiency of attention-based models. We show that, coupled with SAR, our optimized attention kernels lead to significant speedups and memory savings in attention-based GNNs.We made the SAR GNN training library publicy available: \url{//github.com/IntelLabs/SAR}.
Deep learning approaches have demonstrated success in modeling analog audio effects. Nevertheless, challenges remain in modeling more complex effects that involve time-varying nonlinear elements, such as dynamic range compressors. Existing neural network approaches for modeling compression either ignore the device parameters, do not attain sufficient accuracy, or otherwise require large noncausal models prohibiting real-time operation. In this work, we propose a modification to temporal convolutional networks (TCNs) enabling greater efficiency without sacrificing performance. By utilizing very sparse convolutional kernels through rapidly growing dilations, our model attains a significant receptive field using fewer layers, reducing computation. Through a detailed evaluation we demonstrate our efficient and causal approach achieves state-of-the-art performance in modeling the analog LA-2A, is capable of real-time operation on CPU, and only requires 10 minutes of training data.
The focus of Part I of this monograph has been on both the fundamental properties, graph topologies, and spectral representations of graphs. Part II embarks on these concepts to address the algorithmic and practical issues centered round data/signal processing on graphs, that is, the focus is on the analysis and estimation of both deterministic and random data on graphs. The fundamental ideas related to graph signals are introduced through a simple and intuitive, yet illustrative and general enough case study of multisensor temperature field estimation. The concept of systems on graph is defined using graph signal shift operators, which generalize the corresponding principles from traditional learning systems. At the core of the spectral domain representation of graph signals and systems is the Graph Discrete Fourier Transform (GDFT). The spectral domain representations are then used as the basis to introduce graph signal filtering concepts and address their design, including Chebyshev polynomial approximation series. Ideas related to the sampling of graph signals are presented and further linked with compressive sensing. Localized graph signal analysis in the joint vertex-spectral domain is referred to as the vertex-frequency analysis, since it can be considered as an extension of classical time-frequency analysis to the graph domain of a signal. Important topics related to the local graph Fourier transform (LGFT) are covered, together with its various forms including the graph spectral and vertex domain windows and the inversion conditions and relations. A link between the LGFT with spectral varying window and the spectral graph wavelet transform (SGWT) is also established. Realizations of the LGFT and SGWT using polynomial (Chebyshev) approximations of the spectral functions are further considered. Finally, energy versions of the vertex-frequency representations are introduced.