Intermittent computing systems undergo frequent power failure, hindering necessary data sample capture or timely on-device computation. These missing samples and deadlines limit the potential usage of intermittent computing systems in many time-sensitive and fault-tolerant applications. However, a group/swarm of intermittent nodes may amalgamate to sense and process all the samples by taking turns in waking up and extending their collective on-time. However, coordinating a swarm of intermittent computing nodes requires frequent and power-hungry communication, often infeasible with limited energy. Though previous works have shown promises when all intermittent nodes have access to the same amount of energy to harvest, work has yet to be looked into scenarios when the available energy distribution is different for each node. The proposed AICS framework provides an amalgamated intermittent computing system where each node schedules its wake-up schedules based on the duty cycle without communication overhead. We propose one offline tailored duty cycle selection method (Prime-Co-Prime), which schedules wake-up and sleep cycles for each node based on the measured energy to harvest for each node and the prior knowledge or estimation regarding the relative energy distribution. However, when the energy is variable, the problem is formulated as a Decentralized-Partially Observable Markov Decision Process (Dec-POMDP). Each node uses a group of heuristics to solve the Dec-POMDP and schedule its wake-up cycle.
The advancement of Virtual Reality (VR) technology is focused on improving its immersiveness, supporting multiuser Virtual Experiences (VEs), and enabling the users to move freely within their VEs while still being confined within specialized VR setups through Redirected Walking (RDW). To meet their extreme data-rate and latency requirements, future VR systems will require supporting wireless networking infrastructures operating in millimeter Wave (mmWave) frequencies that leverage highly directional communication in both transmission and reception through beamforming and beamsteering. We propose the use of predictive context-awareness to optimize transmitter and receiver-side beamforming and beamsteering. By predicting users' short-term lateral movements in multiuser VR setups with Redirected Walking (RDW), transmitter-side beamforming and beamsteering can be optimized through Line-of-Sight (LoS) "tracking" in the users' directions. At the same time, predictions of short-term orientational movements can be utilized for receiver-side beamforming for coverage flexibility enhancements. We target two open problems in predicting these two context information instances: i) predicting lateral movements in multiuser VR settings with RDW, and ii) generating synthetic head rotation datasets for training orientational movements predictors. Our experimental results demonstrate that Long Short-Term Memory (LSTM) networks feature promising accuracy in predicting lateral movements, and context-awareness stemming from VEs further enhances this accuracy. Additionally, we show that a TimeGAN-based approach for orientational data generation can create synthetic samples that closely match experimentally obtained ones.
In next generation Internet-of-Things, the overhead introduced by grant-based multiple access protocols may engulf the access network as a consequence of the unprecedented number of connected devices. Grant-free access protocols are therefore gaining an increasing interest to support massive access from machine-type devices with intermittent activity. In this paper, coded random access (CRA) with massive multiple input multiple output (MIMO) is investigated as a solution to design highly-scalable massive multiple access protocols, taking into account stringent requirements on latency and reliability. With a focus on signal processing aspects at the physical layer and their impact on the overall system performance, critical issues of successive interference cancellation (SIC) over fading channels are first analyzed. Then, SIC algorithms and a scheduler are proposed that can overcome some of the limitations of the current access protocols. The effectiveness of the proposed processing algorithms is validated by Monte Carlo simulation, for different CRA protocols and by comparisons with developed benchmarks.
Intelligent reflecting surface (IRS) is a promising and disruptive technique to extend the network coverage and improve spectral efficiency. This paper investigates an IRS-assisted Terahertz (THz) multiple-input multiple-output (MIMO)-nonorthogonal multiple access (NOMA) system based on hybrid precoding in the presence of eavesdropper. Two types of sparse RF chain antenna structures are adopted, i.e., sub-connected structure and fully connected structure. Cluster heads are firstly selected for transmissions, and discrete phase-based analog precoding is designed for the transmit beamforming. Subsequently, based on the channel conditions, the users are grouped into multiple clusters, and each cluster is transmitted by using the NOMA technique. In addition, a low complexity zero-forcing method is employed to design digital precoding so as to eliminate interference between clusters. On this basis, we propose a secure transmission scheme to maximize the sum secrecy rate by jointly optimizing the power allocation and phase shifts of IRS under the constraints of system transmission power, achievable rate requirement of each user, and IRS phase shifts. Due to multiple coupled variables, the formulated problem leads to a non-convex issue. We apply the Taylor series expansion and semidefinite programming to convert the original non-convex problem into a convex one. Then, an alternating optimization algorithm is developed to obtain a feasible solution of the original problem. Simulation results are demonstrated to validate the convergence of the proposed algorithm, and confirm that the deployment of IRS can significantly improve the secrecy performance.
Ultra-reliable low latency communications (uRLLC) is adopted in the fifth generation (5G) mobile networks to better support mission-critical applications that demand high level of reliability and low latency. With the aid of well-established multiple-input multiple-output (MIMO) information theory, uRLLC in the future 6G is expected to provide enhanced capability towards extreme connectivity. Since the latency constraint can be represented equivalently by blocklength, channel coding theory at finite block-length plays an important role in the theoretic analysis of uRLLC. On the basis of Polyanskiy's and Yang's asymptotic results, we first derive the exact close-form expressions for the expectation and variance of channel dispersion. Then, the bound of average maximal achievable rate is given for massive MIMO systems in ideal independent and identically distributed fading channels. This is the study to reveal the underlying connections among the fundamental parameters in MIMO transmissions in a concise and complete close-form formula. Most importantly, the inversely proportional law observed therein implies that the latency can be further reduced at expense of spatial degrees of freedom.
Distributed tensor decomposition (DTD) is a fundamental data-analytics technique that extracts latent important properties from high-dimensional multi-attribute datasets distributed over edge devices. Conventionally its wireless implementation follows a one-shot approach that first computes local results at devices using local data and then aggregates them to a server with communication-efficient techniques such as over-the-air computation (AirComp) for global computation. Such implementation is confronted with the issues of limited storage-and-computation capacities and link interruption, which motivates us to propose a framework of on-the-fly communication-and-computing (FlyCom$^2$) in this work. The proposed framework enables streaming computation with low complexity by leveraging a random sketching technique and achieves progressive global aggregation through the integration of progressive uploading and multiple-input-multiple-output (MIMO) AirComp. To develop FlyCom$^2$, an on-the-fly sub-space estimator is designed to take real-time sketches accumulated at the server to generate online estimates for the decomposition. Its performance is evaluated by deriving both deterministic and probabilistic error bounds using the perturbation theory and concentration of measure. Both results reveal that the decomposition error is inversely proportional to the population of sketching observations received by the server. To further rein in the noise effect on the error, we propose a threshold-based scheme to select a subset of sufficiently reliable received sketches for DTD at the server. Experimental results validate the performance gain of the proposed selection algorithm and show that compared to its one-shot counterparts, the proposed FlyCom$^2$ achieves comparable (even better in the case of large eigen-gaps) decomposition accuracy besides dramatically reducing devices' complexity costs.
In conventional backscatter communication (BackCom) systems, time division multiple access (TDMA) and frequency division multiple access (FDMA) are generally adopted for multiuser backscattering due to their simplicity in implementation. However, as the number of backscatter devices (BDs) proliferates, there will be a high overhead under the traditional centralized control techniques, and the inter-user coordination is unaffordable for the passive BDs, which are of scarce concern in existing works and remain unsolved. To this end, in this paper, we propose a slotted ALOHA-based random access for BackCom systems, in which each BD is randomly chosen and is allowed to coexist with one active device for hybrid multiple access. To excavate and evaluate the performance, a resource allocation problem for max-min transmission rate is formulated, where transmit antenna selection, receive beamforming design, reflection coefficient adjustment, power control, and access probability determination are jointly considered. To deal with this intractable problem, we first transform the objective function with the max-min form into an equivalent linear one, and then decompose the resulting problem into three sub-problems. Next, a block coordinate descent (BCD)-based greedy algorithm with a penalty function, successive convex approximation, and linear programming are designed to obtain sub-optimal solutions for tractable analysis. Simulation results demonstrate that the proposed algorithm outperforms benchmark algorithms in terms of transmission rate and fairness.
Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving to resemble CPUs in some aspects, we revisit temporal blocking optimizations for GPUs. We explore how temporal blocking schemes can be adapted to the new features in the recent Nvidia GPUs, including large scratchpad memory, hardware prefetching, and device-wide synchronization. We propose a novel temporal blocking method, EBISU, which champions low device occupancy to drive aggressive deep temporal blocking on large tiles that are executed tile-by-tile. We compare EBISU with state-of-the-art temporal blocking libraries: STENCILGEN and AN5D. We also compare with state-of-the-art stencil auto-tuning tools that are equipped with temporal blocking optimizations: ARTEMIS and DRSTENCIL. Over a wide range of stencil benchmarks, EBISU achieves speedups up to $2.53$x and a geometric mean speedup of $1.49$x over the best state-of-the-art performance in each stencil benchmark.
Backscatter communication (BackCom), one of the core technologies to realize zero-power communication, is expected to be a pivotal paradigm for the next generation of the Internet of Things (IoT). However, the "strong" direct link (DL) interference (DLI) is traditionally assumed to be harmful, and generally drowns out the "weak" backscattered signals accordingly, thus deteriorating the performance of BackCom. In contrast to the previous efforts to eliminate the DLI, in this paper, we exploit the constructive interference (CI), in which the DLI contributes to the backscattered signal. To be specific, our objective is to maximize the received signal power by jointly optimizing the receive beamforming vectors and tag selection factors, which is, however, non-convex and difficult to solve due to constraints on the Kullback-Leibler (KL) divergence. In order to solve this problem, we first decompose the original problem, and then propose two algorithms to solve the sub-problem with beamforming design via a change of variables and semi-definite programming (SDP) and a greedy algorithm to solve the sub-problem with tag selection. In order to gain insight into the CI, we consider a special case with the single-antenna reader to reveal the channel angle between the backscattering link (BL) and the DL, in which the DLI will become constructive. Simulation results show that significant performance gain can always be achieved in the proposed algorithms compared with the traditional algorithms without the DL in terms of the strength of the received signal. The derived constructive channel angle for the BackCom system with the single-antenna reader is also confirmed by simulation results.
In order to satisfy their ever increasing capacity and compute requirements, many machine learning models are distributed across multiple nodes using space-efficient parallelism strategies. As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation. In this work, we propose fusing computation with communication using GPU-initiated networking, and leverage GPUs' massive parallelism to enable fine-grained overlap of the fused operations. We have developed a single, self-contained GPU kernel where workgroups (WGs) immediately communicate their results to remote GPUs when they complete their computation. Meanwhile, other WGs within the same kernel perform overlapping computation, maintaining high ALU utilization. Furthermore, we propose zero-copy optimizations for peer-to-peer GPU communication where the data computed by one GPU is directly written to the destination buffers within the peer GPUs, eliminating intermediate stores and extra buffering. Our approach leverages the emerging multi-node GPU system trend where GPUs are physically close to network with direct GPU-NIC interconnects. We demonstrate our approach by creating an embedding + All-to-All fused kernel which overlaps embedding operations and the dependent all-to-all collective in DLRM models. We evaluate our approach both using simulation and real hardware. Our evaluations show that our approach can effectively overlap All-to-All communication with embedding computations, subsequently reducing their combined execution time by 31% on average (up to 58%) for inter-node and by 25% (up to 35%) for intra-node configurations. Scale-out simulations indicate that our approach reduces DLRM execution time by ~10% for 128 node system.
Driven by the visions of Internet of Things and 5G communications, the edge computing systems integrate computing, storage and network resources at the edge of the network to provide computing infrastructure, enabling developers to quickly develop and deploy edge applications. Nowadays the edge computing systems have received widespread attention in both industry and academia. To explore new research opportunities and assist users in selecting suitable edge computing systems for specific applications, this survey paper provides a comprehensive overview of the existing edge computing systems and introduces representative projects. A comparison of open source tools is presented according to their applicability. Finally, we highlight energy efficiency and deep learning optimization of edge computing systems. Open issues for analyzing and designing an edge computing system are also studied in this survey.