Massively parallel Fourier transforms are widely used in computational sciences, and specifically in computational fluid dynamics which involves unbounded Poisson problems. In practice the latter is usually the most time-consuming operation due to its inescapable all-to-all communication pattern. The original flups library tackles that issue with an implementation of the distributed Fourier transform tailor-made for successive resolutions of unbounded Poisson problems. However the proposed implementation lacks of flexibility as it only supports cell-centered data layout and features a plain communication strategy. This work extends the library along two directions. First, flups implementation is generalized to support a node-centered data layout. Second, three distinct approaches are provided to handle the communications: one all-to-all, and two non-blocking implementations relying on manual packing and MPI_Datatype to communicate over the network. The proposed software is validated against analytical solutions for unbounded, semi-unbounded, and periodic domains. The performance of the approaches is then compared against accFFT, another distributed FFT implementation, using a periodic case. Finally the performance metrics of each implementation are analyzed and detailed on various top-tier European facilities up to 49,152 cores. This work brings flups up to a fully production-ready and performant distributed FFT library, featuring all the possible types of FFTs and with flexibility in the data-layout. The code is available under a BSD-3 license at github.com/vortexlab-uclouvain/flups.
Automated hand gesture recognition has been a focus of the AI community for decades. Traditionally, work in this domain revolved largely around scenarios assuming the availability of the flow of images of the user hands. This has partly been due to the prevalence of camera-based devices and the wide availability of image data. However, there is growing demand for gesture recognition technology that can be implemented on low-power devices using limited sensor data instead of high-dimensional inputs like hand images. In this work, we demonstrate a hand gesture recognition system and method that uses signals from capacitive sensors embedded into the etee hand controller. The controller generates real-time signals from each of the wearer five fingers. We use a machine learning technique to analyse the time series signals and identify three features that can represent 5 fingers within 500 ms. The analysis is composed of a two stage training strategy, including dimension reduction through principal component analysis and classification with K nearest neighbour. Remarkably, we found that this combination showed a level of performance which was comparable to more advanced methods such as supervised variational autoencoder. The base system can also be equipped with the capability to learn from occasional errors by providing it with an additional adaptive error correction mechanism. The results showed that the error corrector improve the classification performance in the base system without compromising its performance. The system requires no more than 1 ms of computing time per input sample, and is smaller than deep neural networks, demonstrating the feasibility of agile gesture recognition systems based on this technology.
Parallel I/O refers to the ability of scientific programs to concurrently read/write from/to a single file from multiple processes executing on distributed memory platforms like compute clusters. In the HPC world, I/O becomes a significant bottleneck for many real-world scientific applications. In the last two decades, there has been significant research in improving the performance of I/O operations in scientific computing for traditional languages including C, C++, and Fortran. As a result of this, several mature and high-performance libraries including ROMIO (implementation of MPI-IO), parallel HDF5, Parallel I/O (PIO), and parallel netCDF are available today that provide efficient I/O for scientific applications. However, there is very little research done to evaluate and improve I/O performance of Java-based HPC applications. The main hindrance in the development of efficient parallel I/O Java libraries is the lack of a standard API (something equivalent to MPI-IO). Some adhoc solutions have been developed and used in proprietary applications, but there is no general-purpose solution that can be used by performance hungry applications. As part of this project, we plan to develop a Java-based parallel I/O API inspired by the MPI-IO bindings (MPI 2.0 standard document) for C, C++, and Fortran. Once the Java equivalent API of MPI-IO has been developed, we will develop a reference implementation on top of existing Java messaging libraries. Later, we will evaluate and compare performance of our reference Java Parallel I/O library with C/C++ counterparts using benchmarks and real-world applications.
In conventional backscatter communication (BackCom) systems, time division multiple access (TDMA) and frequency division multiple access (FDMA) are generally adopted for multiuser backscattering due to their simplicity in implementation. However, as the number of backscatter devices (BDs) proliferates, there will be a high overhead under the traditional centralized control techniques, and the inter-user coordination is unaffordable for the passive BDs, which are of scarce concern in existing works and remain unsolved. To this end, in this paper, we propose a slotted ALOHA-based random access for BackCom systems, in which each BD is randomly chosen and is allowed to coexist with one active device for hybrid multiple access. To excavate and evaluate the performance, a resource allocation problem for max-min transmission rate is formulated, where transmit antenna selection, receive beamforming design, reflection coefficient adjustment, power control, and access probability determination are jointly considered. To deal with this intractable problem, we first transform the objective function with the max-min form into an equivalent linear one, and then decompose the resulting problem into three sub-problems. Next, a block coordinate descent (BCD)-based greedy algorithm with a penalty function, successive convex approximation, and linear programming are designed to obtain sub-optimal solutions for tractable analysis. Simulation results demonstrate that the proposed algorithm outperforms benchmark algorithms in terms of transmission rate and fairness.
Emerging applications in the IoT domain require ultra-low-power and high-performance end-nodes to deal with complex near-sensor-data analytics. Domains such as audio, radar, and Structural Health Monitoring require many computations to be performed in the frequency domain rather than in the time domain. We present ECHOES, a System-On-a-Chip (SoC) composed of a RISC-V core enhanced with fixed and floating-point digital signal processing (DSP) extensions and a Fast-Fourier Transform (FFT) hardware accelerator targeting emerging frequency-domain application. The proposed SoC features an autonomous I/O engine supporting a wide set of peripherals, including Ultra-Low-Power radars, MEMS, and digital microphones over I2S protocol with full-duplex Time Division Multiplexing DSP mode, making ECHOES the first open-source SoC which offers this functionality enabling simultaneous communication with up to 16 I/Os devices. ECHOES, fabricated with 65nm CMOS technology, reaches a peak performance of 0.16 GFLOPS and a peak energy efficiency of 9.68 GFLOPS/W on a wide range of floating and fixed-point general-purpose DSP kernels. The FFT accelerator achieves performance up to 10.16 GOPS with an efficiency of 199.8 GOPS/W, improving performance and efficiency by up to 41.1x and 11.2x, respectively, over its software implementation of this critical task for frequency domain processing.
Artificial intelligence (AI) is envisioned to play a key role in future wireless technologies, with deep neural networks (DNNs) enabling digital receivers to learn to operate in challenging communication scenarios. However, wireless receiver design poses unique challenges that fundamentally differ from those encountered in traditional deep learning domains. The main challenges arise from the limited power and computational resources of wireless devices, as well as from the dynamic nature of wireless communications, which causes continual changes to the data distribution. These challenges impair conventional AI based on highly-parameterized DNNs, motivating the development of adaptive, flexible, and light-weight AI for wireless communications, which is the focus of this article. Here, we propose that AI-based design of wireless receivers requires rethinking of the three main pillars of AI: architecture, data, and training algorithms. In terms of architecture, we review how to design compact DNNs via model-based deep learning. Then, we discuss how to acquire training data for deep receivers without compromising spectral efficiency. Finally, we review efficient, reliable, and robust training algorithms via meta-learning and generalized Bayesian learning. Numerical results are presented to demonstrate the complementary effectiveness of each of the surveyed methods. We conclude by presenting opportunities for future research on the development of practical deep receivers
Effective communication is crucial for deploying robots in mission-specific tasks, but inadequate or unreliable communication can greatly reduce mission efficacy, for example in search and rescue missions where communication-denied conditions may occur. In such missions, robots are deployed to locate targets, such as human survivors, but they might get trapped at hazardous locations, such as in a trapping pit or by debris. Thus, the information the robot collected is lost owing to the lack of communication. In our prior work, we developed the notion of a path-based sensor. A path-based sensor detects whether or not an event has occurred along a particular path, but it does not provide the exact location of the event. Such path-based sensor observations are well-suited to communication-denied environments, and various studies have explored methods to improve information gathering in such settings. In some missions it is typical for target elements to be in close proximity to hazardous factors that hinder the information-gathering process. In this study, we examine a similar scenario and conduct experiments to determine if additional knowledge about the correlation between hazards and targets improves the efficiency of information gathering. To incorporate this knowledge, we utilize a Bayesian network representation of domain knowledge and develop an algorithm based on this representation. Our empirical investigation reveals that such additional information on correlation is beneficial only in environments with moderate hazard lethality, suggesting that while knowledge of correlation helps, further research and development is necessary for optimal outcomes.
Driven by the need for larger and more diverse datasets to pre-train and fine-tune increasingly complex machine learning models, the number of datasets is rapidly growing. audb is an open-source Python library that supports versioning and documentation of audio datasets. It aims to provide a standardized and simple user-interface to publish, maintain, and access the annotations and audio files of a dataset. To efficiently store the data on a server, audb automatically resolves dependencies between versions of a dataset and only uploads newly added or altered files when a new version is published. The library supports partial loading of a dataset and local caching for fast access. audb is a lightweight library and can be interfaced from any machine learning library. It supports the management of datasets on a single PC, within a university or company, or within a whole research community.
Correlation matrix visualization is essential for understanding the relationships between variables in a dataset, but missing data can pose a significant challenge in estimating correlation coefficients. In this paper, we compare the effects of various missing data methods on the correlation plot, focusing on two common missing patterns: random and monotone. We aim to provide practical strategies and recommendations for researchers and practitioners in creating and analyzing the correlation plot. Our experimental results suggest that while imputation is commonly used for missing data, using imputed data for plotting the correlation matrix may lead to a significantly misleading inference of the relation between the features. We recommend using DPER, a direct parameter estimation approach, for plotting the correlation matrix based on its performance in the experiments.
When estimating quantities and fields that are difficult to measure directly, such as the fluidity of ice, from point data sources, such as satellite altimetry, it is important to solve a numerical inverse problem that is formulated with Bayesian consistency. Otherwise, the resultant probability density function for the difficult to measure quantity or field will not be appropriately clustered around the truth. In particular, the inverse problem should be formulated by evaluating the numerical solution at the true point locations for direct comparison with the point data source. If the data are first fitted to a gridded or meshed field on the computational grid or mesh, and the inverse problem formulated by comparing the numerical solution to the fitted field, the benefits of additional point data values below the grid density will be lost. We demonstrate, with examples in the fields of groundwater hydrology and glaciology, that a consistent formulation can increase the accuracy of results and aid discourse between modellers and observationalists. To do this, we bring point data into the finite element method ecosystem as discontinuous fields on meshes of disconnected vertices. Point evaluation can then be formulated as a finite element interpolation operation (dual-evaluation). This new abstraction is well-suited to automation, including automatic differentiation. We demonstrate this through implementation in Firedrake, which generates highly optimised code for solving PDEs with the finite element method. Our solution integrates with dolfin-adjoint/pyadjoint, allowing PDE-constrained optimisation problems, such as data assimilation, to be solved through forward and adjoint mode automatic differentiation.
Real-time traffic and sensor data from connected vehicles have the potential to provide insights that will lead to the immediate benefit of efficient management of the transportation infrastructure and related adjacent services. However, the growth of electric vehicles (EVs) and connected vehicles (CVs) has generated an abundance of CV data and sensor data that has put a strain on the processing capabilities of existing data center infrastructure. As a result, the benefits are either delayed or not fully realized. To address this issue, we propose a solution for processing state-wide CV traffic and sensor data on GPUs that provides real-time micro-scale insights in both temporal and spatial dimensions. This is achieved through the use of the Nvidia Rapids framework and the Dask parallel cluster in Python. Our findings demonstrate a 70x acceleration in the extraction, transformation, and loading (ETL) of CV data for the State of Missouri for a full day of all unique CV journeys, reducing the processing time from approximately 48 hours to just 25 minutes. Given that these results are for thousands of CVs and several thousands of individual journeys with sub-second sensor data, implies that we can model and obtain actionable insights for the management of the transportation infrastructure.