With the rise of AI in recent years and the increase in complexity of the models, the growing demand in computational resources is starting to pose a significant challenge. The need for higher compute power is being met with increasingly more potent accelerators and the use of large compute clusters. However, the gain in prediction accuracy from large models trained on distributed and accelerated systems comes at the price of a substantial increase in energy demand, and researchers have started questioning the environmental friendliness of such AI methods at scale. Consequently, energy efficiency plays an important role for AI model developers and infrastructure operators alike. The energy consumption of AI workloads depends on the model implementation and the utilized hardware. Therefore, accurate measurements of the power draw of AI workflows on different types of compute nodes is key to algorithmic improvements and the design of future compute clusters and hardware. To this end, we present measurements of the energy consumption of two typical applications of deep learning models on different types of compute nodes. Our results indicate that 1. deriving energy consumption directly from runtime is not accurate, but the consumption of the compute node needs to be considered regarding its composition; 2. neglecting accelerator hardware on mixed nodes results in overproportional inefficiency regarding energy consumption; 3. energy consumption of model training and inference should be considered separately - while training on GPUs outperforms all other node types regarding both runtime and energy consumption, inference on CPU nodes can be comparably efficient. One advantage of our approach is that the information on energy consumption is available to all users of the supercomputer, enabling an easy transfer to other workloads alongside a raise in user-awareness of energy consumption.
In many applications that involve the inference of an unknown smooth function, the inference of its derivatives will often be just as important as that of the function itself. To make joint inferences of the function and its derivatives, a class of Gaussian processes called $p^{\text{th}}$ order Integrated Wiener's Process (IWP), is considered. Methods for constructing a finite element (FEM) approximation of an IWP exist but have focused only on the order $p = 2$ case which does not allow appropriate inference for derivatives, and their computational feasibility relies on additional approximation to the FEM itself. In this article, we propose an alternative FEM approximation, called overlapping splines (O-spline), which pursues computational feasibility directly through the choice of test functions, and mirrors the construction of an IWP as the Ospline results from the multiple integrations of these same test functions. The O-spline approximation applies for any order $p \in \mathbb{Z}^+$, is computationally efficient and provides consistent inference for all derivatives up to order $p-1$. It is shown both theoretically, and empirically through simulation, that the O-spline approximation converges to the true IWP as the number of knots increases. We further provide a unified and interpretable way to define priors for the smoothing parameter based on the notion of predictive standard deviation (PSD), which is invariant to the order $p$ and the placement of the knot. Finally, we demonstrate the practical use of the O-spline approximation through simulation studies and an analysis of COVID death rates where the inference is carried on both the function and its derivatives where the latter has an important interpretation in terms of the course of the pandemic.
The principle of minimum potential and complementary energy are the most important variational principles in solid mechanics. The deep energy method (DEM), which has received much attention, is based on the principle of minimum potential energy and lacks the important form of minimum complementary energy. Thus, we propose the deep energy method based on the principle of minimum complementary energy (DCM). The output function of DCM is the stress function that naturally satisfies the equilibrium equation. We extend the proposed DCM algorithm (DCM-P), adding the terms that naturally satisfy the biharmonic equation in the Airy stress function. We combine operator learning with physical equations and propose a deep complementary energy operator method (DCM-O), including branch net, trunk net, basis net, and particular net. DCM-O first combines existing high-fidelity numerical results to train DCM-O through data. Then the complementary energy is used to train the branch net and trunk net in DCM-O. To analyze DCM performance, we present the numerical result of the most common stress functions, the Prandtl and Airy stress function. The proposed method DCM is used to model the representative mechanical problems with the different types of boundary conditions. We compare DCM with the existing PINNs and DEM algorithms. The result shows the advantage of the proposed DCM is suitable for dealing with problems of dominated displacement boundary conditions, which is reflected in theory and our numerical experiments. DCM-P and DCM-O improve the accuracy of DCM and the speed of calculation convergence. DCM is an essential supplementary energy form of the deep energy method. We believe that operator learning based on the energy method can balance data and physical equations well, giving computational mechanics broad research prospects.
Parking in large metropolitan areas is often a time-consuming task with further implications toward traffic patterns that affect urban landscaping. Reducing the premium space needed for parking has led to the development of automated mechanical parking systems. Compared to regular garages having one or two rows of vehicles in each island, automated garages can have multiple rows of vehicles stacked together to support higher parking demands. Although this multi-row layout reduces parking space, it makes the parking and retrieval more complicated. In this work, we propose an automated garage design that supports near 100% parking density. Modeling the problem of parking and retrieving multiple vehicles as a special class of multi-robot path planning problem, we propose associated algorithms for handling all common operations of the automated garage, including (1) optimal algorithm and near-optimal methods that find feasible and efficient solutions for simultaneous parking/retrieval and (2) a novel shuffling mechanism to rearrange vehicles to facilitate scheduled retrieval at rush hours. We conduct thorough simulation studies showing the proposed methods are promising for large and high-density real-world parking applications.
Failures with different root causes can disturb multi-fault localization significantly, therefore, dividing failures into distinct groups according to the responsible faults is highly important. In such a failure indexing task, the crux lies in the failure proximity, which involves two points, i.e., how to effectively represent failures (e.g., extract the signature of failures) and how to properly measure the distance between the proxies for those failures. Existing studies have proposed a variety of failure proximities. The prevalent of them extract signatures of failures from execution coverage or suspiciousness ranking lists, and accordingly employ the Euclid or the Kendall tau distances. However, such strategies may not properly reflect the essential characteristics of failures, thus resulting in unsatisfactory effectiveness. In this paper, we propose a new failure proximity, namely, program variable-based failure proximity, and based on which present a novel failure indexing approach. Specifically, the proposed approach utilizes the run-time values of program variables to represent failures, and designs a set of rules to measure the similarity between them. Experimental results demonstrate the competitiveness of the proposed approach: it can achieve 44.12% and 27.59% improvements in faults number estimation, as well as 47.30% and 26.93% improvements in clustering effectiveness, compared with the state-of-the-art technique in this field, in simulated and real-world environments, respectively.
The evaluation of Deep Learning models has traditionally focused on criteria such as accuracy, F1 score, and related measures. The increasing availability of high computational power environments allows the creation of deeper and more complex models. However, the computations needed to train such models entail a large carbon footprint. In this work, we study the relations between DL model architectures and their environmental impact in terms of energy consumed and CO$_2$ emissions produced during training by means of an empirical study using Deep Convolutional Neural Networks. Concretely, we study: (i) the impact of the architecture and the location where the computations are hosted on the energy consumption and emissions produced; (ii) the trade-off between accuracy and energy efficiency; and (iii) the difference on the method of measurement of the energy consumed using software-based and hardware-based tools.
In realistic compressed sensing (CS) scenarios, the obtained measurements usually have to be quantized to a finite number of bits before transmission and/or storage, thus posing a challenge in recovery, especially for extremely coarse quantization such as 1-bit sign measurements. Recently Meng & Kabashima proposed an efficient quantized compressed sensing algorithm called QCS-SGM using the score-based generative models as an implicit prior. Thanks to the power of score-based generative models in capturing the rich structure of the prior, QCS-SGM achieves remarkably better performances than previous quantized CS methods. However, QCS-SGM is restricted to (approximately) row-orthogonal sensing matrices since otherwise the likelihood score becomes intractable. To address this challenging problem, in this paper we propose an improved version of QCS-SGM, which we call QCS-SGM+, which also works well for general matrices. The key idea is a Bayesian inference perspective of the likelihood score computation, whereby an expectation propagation algorithm is proposed to approximately compute the likelihood score. Experiments on a variety of baseline datasets demonstrate that the proposed QCS-SGM+ outperforms QCS-SGM by a large margin when sensing matrices are far from row-orthogonal.
Autonomic computing investigates how systems can achieve (user) specified control outcomes on their own, without the intervention of a human operator. Autonomic computing fundamentals have been substantially influenced by those of control theory for closed and open-loop systems. In practice, complex systems may exhibit a number of concurrent and inter-dependent control loops. Despite research into autonomic models for managing computer resources, ranging from individual resources (e.g., web servers) to a resource ensemble (e.g., multiple resources within a data center), research into integrating Artificial Intelligence (AI) and Machine Learning (ML) to improve resource autonomy and performance at scale continues to be a fundamental challenge. The integration of AI/ML to achieve such autonomic and self-management of systems can be achieved at different levels of granularity, from full to human-in-the-loop automation. In this article, leading academics, researchers, practitioners, engineers, and scientists in the fields of cloud computing, AI/ML, and quantum computing join to discuss current research and potential future directions for these fields. Further, we discuss challenges and opportunities for leveraging AI and ML in next generation computing for emerging computing paradigms, including cloud, fog, edge, serverless and quantum computing environments.
This PhD thesis contains several contributions to the field of statistical causal modeling. Statistical causal models are statistical models embedded with causal assumptions that allow for the inference and reasoning about the behavior of stochastic systems affected by external manipulation (interventions). This thesis contributes to the research areas concerning the estimation of causal effects, causal structure learning, and distributionally robust (out-of-distribution generalizing) prediction methods. We present novel and consistent linear and non-linear causal effects estimators in instrumental variable settings that employ data-dependent mean squared prediction error regularization. Our proposed estimators show, in certain settings, mean squared error improvements compared to both canonical and state-of-the-art estimators. We show that recent research on distributionally robust prediction methods has connections to well-studied estimators from econometrics. This connection leads us to prove that general K-class estimators possess distributional robustness properties. We, furthermore, propose a general framework for distributional robustness with respect to intervention-induced distributions. In this framework, we derive sufficient conditions for the identifiability of distributionally robust prediction methods and present impossibility results that show the necessity of several of these conditions. We present a new structure learning method applicable in additive noise models with directed trees as causal graphs. We prove consistency in a vanishing identifiability setup and provide a method for testing substructure hypotheses with asymptotic family-wise error control that remains valid post-selection. Finally, we present heuristic ideas for learning summary graphs of nonlinear time-series models.
Edge intelligence refers to a set of connected systems and devices for data collection, caching, processing, and analysis in locations close to where data is captured based on artificial intelligence. The aim of edge intelligence is to enhance the quality and speed of data processing and protect the privacy and security of the data. Although recently emerged, spanning the period from 2011 to now, this field of research has shown explosive growth over the past five years. In this paper, we present a thorough and comprehensive survey on the literature surrounding edge intelligence. We first identify four fundamental components of edge intelligence, namely edge caching, edge training, edge inference, and edge offloading, based on theoretical and practical results pertaining to proposed and deployed systems. We then aim for a systematic classification of the state of the solutions by examining research results and observations for each of the four components and present a taxonomy that includes practical problems, adopted techniques, and application goals. For each category, we elaborate, compare and analyse the literature from the perspectives of adopted techniques, objectives, performance, advantages and drawbacks, etc. This survey article provides a comprehensive introduction to edge intelligence and its application areas. In addition, we summarise the development of the emerging research field and the current state-of-the-art and discuss the important open issues and possible theoretical and technical solutions.
Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet have recently achieved state-of-the-art performance on a variety of language understanding tasks. However, their size makes them impractical for a number of scenarios, especially on mobile and edge devices. In particular, the input word embedding matrix accounts for a significant proportion of the model's memory footprint, due to the large input vocabulary and embedding dimensions. Knowledge distillation techniques have had success at compressing large neural network models, but they are ineffective at yielding student models with vocabularies different from the original teacher models. We introduce a novel knowledge distillation technique for training a student model with a significantly smaller vocabulary as well as lower embedding and hidden state dimensions. Specifically, we employ a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. We combine this approach with learning shared projection matrices that transfer layer-wise knowledge from the teacher model to the student model. Our method is able to compress the BERT_BASE model by more than 60x, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB. Experimental results also demonstrate higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques.