Graphics Processing Units (GPUs) have become an integral part of High-Performance Computing to achieve an Exascale performance. The main goal of application developers of GPU is to tune their code extensively to obtain optimal performance, making efficient use of different resources available. While extracting optimal performance of applications on an HPC infrastructure, developers should also ensure the applications have the least energy usage considering the massive power consumption of data centres and HPC servers. This thesis presents two models developed which can be utilized by developers in analysing the CUDA kernel's energy dissipation. The first one is a model that predicts the CUDA kernel's execution time. Here a PTX code is statically analysed to extract instruction features, control flow, and data dependence. We propose two scheduling algorithm approaches that satisfy the performance and hardware constraints. The second model is a static analysis-based power prediction built by utilizing machine learning techniques. Features used for building the model are derived using static analysis of PTX code. These features are chosen to understand the relationship between GPU power consumption and program features that can aid developers in building energy-efficient, sustainable applications. The dataset used for validating both models include kernels from different benchmarks suits, sizes, nature (e.g., compute-bound, memory-bound), and complexity (e.g., control divergence, memory access patterns). We also present a tool that has practically validated the effectiveness and ease of using the two models as design assistance tools for GPU.
The problem Power Dominating Set (PDS) is motivated by the placement of phasor measurement units to monitor electrical networks. It asks for a minimum set of vertices in a graph that observes all remaining vertices by exhaustively applying two observation rules. Our contribution is twofold. First, we determine the parameterized complexity of PDS by proving it is $W[P]$-complete when parameterized with respect to the solution size. We note that it was only known to be $W[2]$-hard before. Our second and main contribution is a new algorithm for PDS that efficiently solves practical instances. Our algorithm consists of two complementary parts. The first is a set of reduction rules for PDS that can also be used in conjunction with previously existing algorithms. The second is an algorithm for solving the remaining kernel based on the implicit hitting set approach. Our evaluation on a set of power grid instances from the literature shows that our solver outperforms previous state-of-the-art solvers for PDS by more than one order of magnitude on average. Furthermore, our algorithm can solve previously unsolved instances of continental scale within a few minutes.
Multi-access Edge Computing (MEC) is an enabling technology to leverage new network applications, such as virtual/augmented reality, by providing faster task processing at the network edge. This is done by deploying servers closer to the end users to run the network applications. These applications are often intensive in terms of task processing, memory usage, and communication; thus mobile devices may take a long time or even not be able to run them efficiently. By transferring (offloading) the execution of these applications to the servers at the network edge, it is possible to achieve a lower completion time (makespan) and meet application requirements. However, offloading multiple entire applications to the edge server can overwhelm its hardware and communication channel, as well as underutilize the mobile devices' hardware. In this paper, network applications are modeled as Directed Acyclic Graphs (DAGs) and partitioned into tasks, and only part of these tasks are offloaded to the edge server. This is the DAG application partitioning and offloading problem, which is known to be NP-hard. To approximate its solution, this paper proposes the FlexDO algorithm. FlexDO combines a greedy phase with a permutation phase to find a set of offloading decisions, and then chooses the one that achieves the shortest makespan. FlexDO is compared with a proposal from the literature and two baseline decisions, considering realistic DAG applications extracted from the Alibaba Cluster Trace Program. Results show that FlexDO is consistently only 3.9% to 8.9% above the optimal makespan in all test scenarios, which include different levels of CPU availability, a multi-user case, and different communication channel transmission rates. FlexDO outperforms both baseline solutions by a wide margin, and is three times closer to the optimal makespan than its competitor.
This paper studies the probability of error associated with the social machine learning framework, which involves an independent training phase followed by a cooperative decision-making phase over a graph. This framework addresses the problem of classifying a stream of unlabeled data in a distributed manner. We consider two kinds of classification tasks with limited observations in the prediction phase, namely, the statistical classification task and the single-sample classification task. For each task, we describe the distributed learning rule and analyze the probability of error accordingly. To do so, we first introduce a stronger consistent training condition that involves the margin distributions generated by the trained classifiers. Based on this condition, we derive an upper bound on the probability of error for both tasks, which depends on the statistical properties of the data and the combination policy used to combine the distributed classifiers. For the statistical classification problem, we employ the geometric social learning rule and conduct a non-asymptotic performance analysis. An exponential decay of the probability of error with respect to the number of unlabeled samples is observed in the upper bound. For the single-sample classification task, a distributed learning rule that functions as an ensemble classifier is constructed. An upper bound on the probability of error of this ensemble classifier is established.
Given tensors $\boldsymbol{\mathscr{A}}, \boldsymbol{\mathscr{B}}, \boldsymbol{\mathscr{C}}$ of size $m \times 1 \times n$, $m \times p \times 1$, and $1\times p \times n$, respectively, their Bhattacharya-Mesner (BM) product will result in a third order tensor of dimension $m \times p \times n$ and BM-rank of 1 (Mesner and Bhattacharya, 1990). Thus, if a third-order tensor can be written as a sum of a small number of such BM-rank 1 terms, this BM-decomposition (BMD) offers an implicitly compressed representation of the tensor. Therefore, in this paper, we give a generative model which illustrates that spatio-temporal video data can be expected to have low BM-rank. Then, we discuss non-uniqueness properties of the BMD and give an improved bound on the BM-rank of a third-order tensor. We present and study properties of an iterative algorithm for computing an approximate BMD, including convergence behavior and appropriate choices for starting guesses that allow for the decomposition of our spatial-temporal data into stationary and non-stationary components. Several numerical experiments show the impressive ability of our BMD algorithm to extract important temporal information from video data while simultaneously compressing the data. In particular, we compare our approach with dynamic mode decomposition (DMD): first, we show how the matrix-based DMD can be reinterpreted in tensor BMP form, then we explain why the low BM-rank decomposition can produce results with superior compression properties while simultaneously providing better separation of stationary and non-stationary features in the data. We conclude with a comparison of our low BM-rank decomposition to two other tensor decompositions, CP and the t-SVDM.
Voltage Overscaling (VOS) is one of the well-known techniques to increase the energy efficiency of arithmetic units. Also, it can provide significant lifetime improvements, while still meeting the accuracy requirements of inherently error-resilient applications. This paper proposes a generic accuracy-configurable multiplier that employs the VOS at a coarse-grained level (block-level) to reduce the control logic required for applying VOS and its associated overheads, thus enabling a high degree of trade-off between energy consumption and output quality. The proposed configurable Block-Level VOS-based (BL-VOS) multiplier relies on employing VOS in a multiplier composed of smaller blocks, where applying VOS in different blocks results in structures with various output accuracy levels. To evaluate the proposed concept, we implement 8-bit and 16-bit BL-VOS multipliers with various blocks width in a 15-nm FinFET technology. The results show that the proposed multiplier achieves up to 15% lower energy consumption and up to 21% higher output accuracy compared to the state-of-the-art VOS-based multipliers. Also, the effects of Process Variation (PV) and Bias Temperature Instability (BTI) induced delay on the proposed multiplier are investigated. Finally, the effectiveness of the proposed multiplier is studied for two different image processing applications, in terms of quality and energy efficiency.
Node classification on graphs is a significant task with a wide range of applications, including social analysis and anomaly detection. Even though graph neural networks (GNNs) have produced promising results on this task, current techniques often presume that label information of nodes is accurate, which may not be the case in real-world applications. To tackle this issue, we investigate the problem of learning on graphs with label noise and develop a novel approach dubbed Consistent Graph Neural Network (CGNN) to solve it. Specifically, we employ graph contrastive learning as a regularization term, which promotes two views of augmented nodes to have consistent representations. Since this regularization term cannot utilize label information, it can enhance the robustness of node representations to label noise. Moreover, to detect noisy labels on the graph, we present a sample selection technique based on the homophily assumption, which identifies noisy nodes by measuring the consistency between the labels with their neighbors. Finally, we purify these confident noisy labels to permit efficient semantic graph learning. Extensive experiments on three well-known benchmark datasets demonstrate the superiority of our CGNN over competing approaches.
A proper fusion of complex data is of interest to many researchers in diverse fields, including computational statistics, computational geometry, bioinformatics, machine learning, pattern recognition, quality management, engineering, statistics, finance, economics, etc. It plays a crucial role in: synthetic description of data processes or whole domains, creation of rule bases for approximate reasoning tasks, reaching consensus and selection of the optimal strategy in decision support systems, imputation of missing values, data deduplication and consolidation, record linkage across heterogeneous databases, and clustering. This open-access research monograph integrates the spread-out results from different domains using the methodology of the well-established classical aggregation framework, introduces researchers and practitioners to Aggregation 2.0, as well as points out the challenges and interesting directions for further research.
Game theory has by now found numerous applications in various fields, including economics, industry, jurisprudence, and artificial intelligence, where each player only cares about its own interest in a noncooperative or cooperative manner, but without obvious malice to other players. However, in many practical applications, such as poker, chess, evader pursuing, drug interdiction, coast guard, cyber-security, and national defense, players often have apparently adversarial stances, that is, selfish actions of each player inevitably or intentionally inflict loss or wreak havoc on other players. Along this line, this paper provides a systematic survey on three main game models widely employed in adversarial games, i.e., zero-sum normal-form and extensive-form games, Stackelberg (security) games, zero-sum differential games, from an array of perspectives, including basic knowledge of game models, (approximate) equilibrium concepts, problem classifications, research frontiers, (approximate) optimal strategy seeking techniques, prevailing algorithms, and practical applications. Finally, promising future research directions are also discussed for relevant adversarial games.
In the last decade or so, we have witnessed deep learning reinvigorating the machine learning field. It has solved many problems in the domains of computer vision, speech recognition, natural language processing, and various other tasks with state-of-the-art performance. The data is generally represented in the Euclidean space in these domains. Various other domains conform to non-Euclidean space, for which graph is an ideal representation. Graphs are suitable for representing the dependencies and interrelationships between various entities. Traditionally, handcrafted features for graphs are incapable of providing the necessary inference for various tasks from this complex data representation. Recently, there is an emergence of employing various advances in deep learning to graph data-based tasks. This article provides a comprehensive survey of graph neural networks (GNNs) in each learning setting: supervised, unsupervised, semi-supervised, and self-supervised learning. Taxonomy of each graph based learning setting is provided with logical divisions of methods falling in the given learning setting. The approaches for each learning task are analyzed from both theoretical as well as empirical standpoints. Further, we provide general architecture guidelines for building GNNs. Various applications and benchmark datasets are also provided, along with open challenges still plaguing the general applicability of GNNs.
Deep neural networks have revolutionized many machine learning tasks in power systems, ranging from pattern recognition to signal processing. The data in these tasks is typically represented in Euclidean domains. Nevertheless, there is an increasing number of applications in power systems, where data are collected from non-Euclidean domains and represented as the graph-structured data with high dimensional features and interdependency among nodes. The complexity of graph-structured data has brought significant challenges to the existing deep neural networks defined in Euclidean domains. Recently, many studies on extending deep neural networks for graph-structured data in power systems have emerged. In this paper, a comprehensive overview of graph neural networks (GNNs) in power systems is proposed. Specifically, several classical paradigms of GNNs structures (e.g., graph convolutional networks, graph recurrent neural networks, graph attention networks, graph generative networks, spatial-temporal graph convolutional networks, and hybrid forms of GNNs) are summarized, and key applications in power systems such as fault diagnosis, power prediction, power flow calculation, and data generation are reviewed in detail. Furthermore, main issues and some research trends about the applications of GNNs in power systems are discussed.