Stencil computation is one of the most used kernels in a wide variety of scientific applications, ranging from large-scale weather prediction to solving partial differential equations. Stencil computations are characterized by three unique properties: (1) low arithmetic intensity, (2) limited temporal data reuse, and (3) regular and predictable data access pattern. As a result, stencil computations are typically bandwidth-bound workloads, which only experience limited benefits from the deep cache hierarchy of modern CPUs. In this work, we propose Casper, a near-cache accelerator consisting of specialized stencil compute units connected to the last-level cache (LLC) of a traditional CPU. Casper is based on two key ideas: (1) avoiding the cost of moving rarely reused data through the cache hierarchy, and (2) exploiting the regularity of the data accesses and the inherent parallelism of the stencil computation to increase the overall performance. With minimal changes in LLC address decoding logic and data placement, Casper performs stencil computations at the peak bandwidth of the LLC. We show that, by tightly coupling lightweight stencil compute units near to LLC, Casper improves the performance of stencil kernels by 1.65x on average, while reducing the energy consumption by 35% compared to a commercial high-performance multi-core processor. Moreover, Casper provides a 37x improvement in performance-per-area compared to a state-of-the-art GPU.
The Message Queuing Telemetry Transport (MQTT) protocol is one of the most widely used IoT protocol solutions. In this work, we are especially interested in open-source MQTT Broker implementations (such as Mosquitto, EMQX, RabbitMQ, VerneMQ, and HiveMQ). To this end, we engineer a network testbed to experimentally benchmark the performance of these implementations in an edge computing context with constrained devices. In more detail, we engineer an automated deployment and orchestration of the containerized MQTT broker implementations, with support for deployment across either moderately powerful AMD64 devices, or more resource constrained ARM64 devices. The proposed MQTT implementations are evaluated in terms of overhead response time and different payload sizes. Results showed that the hardware platform used as well as the message size, and the network parameters (latency, packet loss and jitter) have a significant impact on the performance differences between the brokers. All results, software tools and code are fully reproducible and free and open source.
Recently, intermittent computing (IC) has received tremendous attention due to its high potential in perpetual sensing for Internet-of-Things (IoT). By harvesting ambient energy, battery-free devices can perform sensing intermittently without maintenance, thus significantly improving IoT sustainability. To build a practical intermittently-powered sensing system, efficient routing across battery-free devices for data delivery is essential. However, the intermittency of these devices brings new challenges, rendering existing routing protocols inapplicable. In this paper, we propose RICS, the first-of-its-kind routing scheme tailored for intermittently-powered sensing systems. RICS features two major designs, with the goal of achieving low-latency data delivery on a network built with battery-free devices. First, RICS incorporates a fast topology construction protocol for each IC node to establish a path towards the sink node with the least hop count. Second, RICS employs a low-latency message forwarding protocol, which incorporates an efficient synchronization mechanism and a novel technique called pendulum-sync to avoid the time-consuming repeated node synchronization. Our evaluation based on an implementation in OMNeT++ and comprehensive experiments with varying system settings show that RICS can achieve orders of magnitude latency reduction in data delivery compared with the baselines.
In recent years, Field-Programmable Gate Arrays (FPGA) have evolved rapidly paving the way for a whole new range of computing paradigms. On the other hand, computer applications are evolving. There is a rising demand for a system that is general-purpose and yet has the processing abilities to accommodate current trends in application processing. This work proposes a design and implementation of a tightly-coupled FPGA-based dual-processor platform. We architect a platform that optimizes the utilization of FPGA resources and allows for the investigation of practical implementation issues such as cache design. The performance of the proposed prototype is then evaluated, as different configurations of a uniprocessor and a dual-processor system are studied and compared against each other and against published results for common industry-standard CPU platforms. The proposed implementation utilizes the Nios II 32-bit embedded soft-core processor architecture designed for the Altera Cyclone III family of FPGAs.
Elliptic interface boundary value problems play a major role in numerous applications involving heat, fluids, materials, and proteins, to name a few. As an example, in implicit variational solvation, for the construction of biomolecular shapes, the electrostatic contributions satisfy the Poisson-Boltzmann equation with discontinuous dielectric constants across the interface. When interface motions are involved, one often needs not only accurate solution values, but accurate derivatives as well, such as the normal derivatives at the interface. We introduce here the Compact Coupling Interface Method (CCIM), a finite difference method for the elliptic interface problem with interfacial jump conditions. The CCIM can calculate solution values and their derivatives up to second-order accuracy in arbitrary ambient space dimensions. It combines elements of Chern and Shu's Coupling Interface Method and Mayo's approach for elliptic interface boundary value problems, leading to more compact finite difference stencils that are applicable to more general situations. Numerical results on a variety of geometric interfacial shapes and on complex protein molecules in three dimensions support the efficacy of our approach and reveal advantages in accuracy and robustness.
Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is $27\times$ faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and $1.34\times$ faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is $2.8\times$ and $3.2\times$ than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems.
Sensing technologies deployed in the workplace can unobtrusively collect detailed data about individual activities and group interactions that are otherwise difficult to capture. A hopeful application of these technologies is that they can help businesses and workers optimize productivity and wellbeing. However, given the workplace's inherent and structural power dynamics, the prevalent approach of accepting tacit compliance to monitor work activities rather than seeking workers' meaningful consent raises privacy and ethical concerns. This paper unpacks the challenges workers face when consenting to workplace wellbeing technologies. Using a hypothetical case to prompt reflection among six multi-stakeholder focus groups involving 15 participants, we explored participants' expectations and capacity to consent to these technologies. We sketched possible interventions that could better support meaningful consent to workplace wellbeing technologies by drawing on critical computing and feminist scholarship -- which reframes consent from a purely individual choice to a structural condition experienced at the individual level that needs to be freely given, reversible, informed, enthusiastic, and specific (FRIES). The focus groups revealed how workers are vulnerable to "meaningless" consent -- as they may be subject to power dynamics that minimize their ability to withhold consent and may thus experience an erosion of autonomy, also undermining the value of data gathered in the name of "wellbeing." To meaningfully consent, participants wanted changes to the technology and to the policies and practices surrounding the technology. Our mapping of what prevents workers from meaningfully consenting to workplace wellbeing technologies (challenges) and what they require to do so (interventions) illustrates how the lack of meaningful consent is a structural problem requiring socio-technical solutions.
Within the tensor singular value decomposition (T-SVD) framework, existing robust low-rank tensor completion approaches have made great achievements in various areas of science and engineering. Nevertheless, these methods involve the T-SVD based low-rank approximation, which suffers from high computational costs when dealing with large-scale tensor data. Moreover, most of them are only applicable to third-order tensors. Against these issues, in this article, two efficient low-rank tensor approximation approaches fusing randomized techniques are first devised under the order-d (d >= 3) T-SVD framework. On this basis, we then further investigate the robust high-order tensor completion (RHTC) problem, in which a double nonconvex model along with its corresponding fast optimization algorithms with convergence guarantees are developed. To the best of our knowledge, this is the first study to incorporate the randomized low-rank approximation into the RHTC problem. Empirical studies on large-scale synthetic and real tensor data illustrate that the proposed method outperforms other state-of-the-art approaches in terms of both computational efficiency and estimated precision.
Games and simulators can be a valuable platform to execute complex multi-agent, multiplayer, imperfect information scenarios with significant parallels to military applications: multiple participants manage resources and make decisions that command assets to secure specific areas of a map or neutralize opposing forces. These characteristics have attracted the artificial intelligence (AI) community by supporting development of algorithms with complex benchmarks and the capability to rapidly iterate over new ideas. The success of artificial intelligence algorithms in real-time strategy games such as StarCraft II have also attracted the attention of the military research community aiming to explore similar techniques in military counterpart scenarios. Aiming to bridge the connection between games and military applications, this work discusses past and current efforts on how games and simulators, together with the artificial intelligence algorithms, have been adapted to simulate certain aspects of military missions and how they might impact the future battlefield. This paper also investigates how advances in virtual reality and visual augmentation systems open new possibilities in human interfaces with gaming platforms and their military parallels.
Co-evolving time series appears in a multitude of applications such as environmental monitoring, financial analysis, and smart transportation. This paper aims to address the following challenges, including (C1) how to incorporate explicit relationship networks of the time series; (C2) how to model the implicit relationship of the temporal dynamics. We propose a novel model called Network of Tensor Time Series, which is comprised of two modules, including Tensor Graph Convolutional Network (TGCN) and Tensor Recurrent Neural Network (TRNN). TGCN tackles the first challenge by generalizing Graph Convolutional Network (GCN) for flat graphs to tensor graphs, which captures the synergy between multiple graphs associated with the tensors. TRNN leverages tensor decomposition to model the implicit relationships among co-evolving time series. The experimental results on five real-world datasets demonstrate the efficacy of the proposed method.
Federated learning (FL) is an emerging, privacy-preserving machine learning paradigm, drawing tremendous attention in both academia and industry. A unique characteristic of FL is heterogeneity, which resides in the various hardware specifications and dynamic states across the participating devices. Theoretically, heterogeneity can exert a huge influence on the FL training process, e.g., causing a device unavailable for training or unable to upload its model updates. Unfortunately, these impacts have never been systematically studied and quantified in existing FL literature. In this paper, we carry out the first empirical study to characterize the impacts of heterogeneity in FL. We collect large-scale data from 136k smartphones that can faithfully reflect heterogeneity in real-world settings. We also build a heterogeneity-aware FL platform that complies with the standard FL protocol but with heterogeneity in consideration. Based on the data and the platform, we conduct extensive experiments to compare the performance of state-of-the-art FL algorithms under heterogeneity-aware and heterogeneity-unaware settings. Results show that heterogeneity causes non-trivial performance degradation in FL, including up to 9.2% accuracy drop, 2.32x lengthened training time, and undermined fairness. Furthermore, we analyze potential impact factors and find that device failure and participant bias are two potential factors for performance degradation. Our study provides insightful implications for FL practitioners. On the one hand, our findings suggest that FL algorithm designers consider necessary heterogeneity during the evaluation. On the other hand, our findings urge system providers to design specific mechanisms to mitigate the impacts of heterogeneity.