We present the design of a new passive communication method that does not rely on ambient or generated RF sources. Instead, we exploit the Johnson (thermal) noise generated by a resistor to transmit information bits wirelessly. By switching the load connected to an antenna between a resistor and open circuit, we can achieve data rates of up to 26bps and distances of up to 7.3 meters. This communication method is orders of magnitude less power consuming than conventional communication schemes and presents the opportunity to enable wireless communication in areas with a complete lack of connectivity.
Empirical results in software engineering have long started to show that findings are unlikely to be applicable to all software systems, or any domain: results need to be evaluated in specified contexts, and limited to the type of systems that they were extracted from. This is a known issue, and requires the establishment of a classification of software types. This paper makes two contributions: the first is to evaluate the quality of the current software classifications landscape. The second is to perform a case study showing how to create a classification of software types using a curated set of software systems. Our contributions show that existing, and very likely even new, classification attempts are deemed to fail for one or more issues, that we named as the `antipatterns' of software classification tasks. We collected 7 of these antipatterns that emerge from both our case study, and the existing classifications. These antipatterns represent recurring issues in a classification, so we discuss practical ways to help researchers avoid these pitfalls. It becomes clear that classification attempts must also face the daunting task of formulating a taxonomy of software types, with the objective of establishing a hierarchy of categories in a classification.
Improved five-point low dissipation nonlinear schemes are proposed in this paper within the framework of weighted compact nonlinear schemes (WCNSs) \cite{Deng2000}. Particularly we follow the work of Li and Du \cite{Li2016} on the two-stage fourth-order temporal accurate discretization scheme, which is developed based on the Lax-Wendroff method.
Integrated sensing and communication (ISAC) creates a platform to exploit the synergy between two powerful functionalities that have been developing separately. However, the interference management and resource allocation between sensing and communication have not been fully studied. In this paper, we consider the design of perceptive mobile networks (PMNs) by adding sensing capability to current cellular networks. To avoid the full-duplex operation, we propose the PMN with distributed target monitoring terminals (TMTs) where passive TMTs are deployed over wireless networks to locate the sensing target (ST). We jointly optimize the transmit and receive beamformers towards the communication user terminals (UEs) and the ST by alternating-optimization (AO) and prove its convergence. To reduce computation complexity and obtain physical insights, we further investigate the use of linear transceivers, including zero forcing and beam synthesis (B-syn). Our analysis revealed interesting physical insights regarding interference management and resource allocation between sensing and communication: 1) instead of forming dedicated sensing signals, it is more efficient to redesign the communication signals for both communication and sensing purposes and "leak" communication energy for sensing; 2) the amount of energy leakage from one UE to the ST depends on their relative locations.
Convolutional neural networks (CNNs) are important in a wide variety of machine learning tasks and applications, so optimizing their performance is essential. Moving words of data between levels of a memory hierarchy or between processors on a network is much more expensive than the cost of arithmetic, so minimizing communication is critical to optimizing performance. In this paper, we present new lower bounds on data movement for mixed precision convolutions in both single-processor and parallel distributed memory models, as well as algorithms that outperform current implementations such as Im2Col. We obtain performance figures using GEMMINI, a machine learning accelerator, where our tiling provides improvements between 13% and 150% over a vendor supplied algorithm.
In this paper, we introduce $\mathsf{CO}_3$, an algorithm for communication-efficiency federated Deep Neural Network (DNN) training.$\mathsf{CO}_3$ takes its name from three processing applied steps which reduce the communication load when transmitting the local gradients from the remote users to the Parameter Server.Namely:(i) gradient quantization through floating-point conversion, (ii) lossless compression of the quantized gradient, and (iii) quantization error correction.We carefully design each of the steps above so as to minimize the loss in the distributed DNN training when the communication overhead is fixed.In particular, in the design of steps (i) and (ii), we adopt the assumption that DNN gradients are distributed according to a generalized normal distribution.This assumption is validated numerically in the paper. For step (iii), we utilize an error feedback with memory decay mechanism to correct the quantization error introduced in step (i). We argue that this coefficient, similarly to the learning rate, can be optimally tuned to improve convergence. The performance of $\mathsf{CO}_3$ is validated through numerical simulations and is shown having better accuracy and improved stability at a reduced communication payload.
The stochastic gradient Langevin Dynamics is one of the most fundamental algorithms to solve sampling problems and non-convex optimization appearing in several machine learning applications. Especially, its variance reduced versions have nowadays gained particular attention. In this paper, we study two variants of this kind, namely, the Stochastic Variance Reduced Gradient Langevin Dynamics and the Stochastic Recursive Gradient Langevin Dynamics. We prove their convergence to the objective distribution in terms of KL-divergence under the sole assumptions of smoothness and Log-Sobolev inequality which are weaker conditions than those used in prior works for these algorithms. With the batch size and the inner loop length set to $\sqrt{n}$, the gradient complexity to achieve an $\epsilon$-precision is $\tilde{O}((n+dn^{1/2}\epsilon^{-1})\gamma^2 L^2\alpha^{-2})$, which is an improvement from any previous analyses. We also show some essential applications of our result to non-convex optimization.
Multi-camera vehicle tracking is one of the most complicated tasks in Computer Vision as it involves distinct tasks including Vehicle Detection, Tracking, and Re-identification. Despite the challenges, multi-camera vehicle tracking has immense potential in transportation applications including speed, volume, origin-destination (O-D), and routing data generation. Several recent works have addressed the multi-camera tracking problem. However, most of the effort has gone towards improving accuracy on high-quality benchmark datasets while disregarding lower camera resolutions, compression artifacts and the overwhelming amount of computational power and time needed to carry out this task on its edge and thus making it prohibitive for large-scale and real-time deployment. Therefore, in this work we shed light on practical issues that should be addressed for the design of a multi-camera tracking system to provide actionable and timely insights. Moreover, we propose a real-time city-scale multi-camera vehicle tracking system that compares favorably to computationally intensive alternatives and handles real-world, low-resolution CCTV instead of idealized and curated video streams. To show its effectiveness, in addition to integration into the Regional Integrated Transportation Information System (RITIS), we participated in the 2021 NVIDIA AI City multi-camera tracking challenge and our method is ranked among the top five performers on the public leaderboard.
This paper presents a novel circuit (AID) to improve the accuracy of an energy-efficient in-memory multiplier using a standard 6T-SRAM. The state-of-the-art discharge-based in-SRAM multiplication accelerators suffer from a non-linear behavior in their bit-line (BL, BLB) due to the quadratic nature of the access transistor that leads to a poor signal-to-noise ratio (SNR). In order to achieve linearity in the BLB voltage, we propose a novel root function technique on the access transistor's gate that results in accuracy improvement of on average 10.77 dB SNR compared to state-of-the-art discharge-based topologies. Our analytical methods and a circuit simulation in a 65 nm CMOS technology verify that the proposed technique consumes 0.523 pJ per computation (multiplication, accumulation, and preset) from a power supply of 1V, which is 51.18% lower compared to other state-of-the-art techniques. We have performed an extensive Monte Carlo based simulation for a 4x4 multiplication operation, and our novel technique presents less than 0.086 standard deviations for the worst-case incorrect output scenario.
Task graphs provide a simple way to describe scientific workflows (sets of tasks with dependencies) that can be executed on both HPC clusters and in the cloud. An important aspect of executing such graphs is the used scheduling algorithm. Many scheduling heuristics have been proposed in existing works; nevertheless, they are often tested in oversimplified environments. We provide an extensible simulation environment designed for prototyping and benchmarking task schedulers, which contains implementations of various scheduling algorithms and is open-sourced, in order to be fully reproducible. We use this environment to perform a comprehensive analysis of workflow scheduling algorithms with a focus on quantifying the effect of scheduling challenges that have so far been mostly neglected, such as delays between scheduler invocations or partially unknown task durations. Our results indicate that network models used by many previous works might produce results that are off by an order of magnitude in comparison to a more realistic model. Additionally, we show that certain implementation details of scheduling algorithms which are often neglected can have a large effect on the scheduler's performance, and they should thus be described in great detail to enable proper evaluation.
Recent advances in 3D fully convolutional networks (FCN) have made it feasible to produce dense voxel-wise predictions of volumetric images. In this work, we show that a multi-class 3D FCN trained on manually labeled CT scans of several anatomical structures (ranging from the large organs to thin vessels) can achieve competitive segmentation results, while avoiding the need for handcrafting features or training class-specific models. To this end, we propose a two-stage, coarse-to-fine approach that will first use a 3D FCN to roughly define a candidate region, which will then be used as input to a second 3D FCN. This reduces the number of voxels the second FCN has to classify to ~10% and allows it to focus on more detailed segmentation of the organs and vessels. We utilize training and validation sets consisting of 331 clinical CT images and test our models on a completely unseen data collection acquired at a different hospital that includes 150 CT scans, targeting three anatomical organs (liver, spleen, and pancreas). In challenging organs such as the pancreas, our cascaded approach improves the mean Dice score from 68.5 to 82.2%, achieving the highest reported average score on this dataset. We compare with a 2D FCN method on a separate dataset of 240 CT scans with 18 classes and achieve a significantly higher performance in small organs and vessels. Furthermore, we explore fine-tuning our models to different datasets. Our experiments illustrate the promise and robustness of current 3D FCN based semantic segmentation of medical images, achieving state-of-the-art results. Our code and trained models are available for download: //github.com/holgerroth/3Dunet_abdomen_cascade.