General Matrix Multiplication (GEMM) is a fundamental operation widely used in scientific computations. Its performance and accuracy significantly impact the performance and accuracy of applications that depend on it. One such application is semidefinite programming (SDP), and it often requires binary128 or higher precision arithmetic to solve problems involving SDP stably. However, only some processors support binary128 arithmetic, which makes SDP solvers generally slow. In this study, we focused on accelerating GEMM with binary128 arithmetic on field-programmable gate arrays (FPGAs) to enable the flexible design of accelerators for the desired computations. Our binary128 GEMM designs on a recent high-performance FPGA achieved approximately 90GFlops, 147x faster than the computation executed on a recent CPU with 20 threads for large matrices. Using our binary128 GEMM design on the FPGA, we successfully accelerated two numerical applications: LU decomposition and SDP problems, for the first time.
In recent years, communication engineers put strong emphasis on artificial neural network (ANN)-based algorithms with the aim of increasing the flexibility and autonomy of the system and its components. In this context, unsupervised training is of special interest as it enables adaptation without the overhead of transmitting pilot symbols. In this work, we present a novel ANN-based, unsupervised equalizer and its trainable field programmable gate array (FPGA) implementation. We demonstrate that our custom loss function allows the ANN to adapt for varying channel conditions, approaching the performance of a supervised baseline. Furthermore, as a first step towards a practical communication system, we design an efficient FPGA implementation of our proposed algorithm, which achieves a throughput in the order of Gbit/s, outperforming a high-performance GPU by a large margin.
This paper investigates the problem of efficient constrained global optimization of hybrid models that are a composition of a known white-box function and an expensive multi-output black-box function subject to noisy observations, which often arises in real-world science and engineering applications. We propose a novel method, Constrained Upper Quantile Bound (CUQB), to solve such problems that directly exploits the composite structure of the objective and constraint functions that we show leads substantially improved sampling efficiency. CUQB is a conceptually simple, deterministic approach that avoid constraint approximations used by previous methods. Although the CUQB acquisition function is not available in closed form, we propose a novel differentiable sample average approximation that enables it to be efficiently maximized. We further derive bounds on the cumulative regret and constraint violation under a non-parametric Bayesian representation of the black-box function. Since these bounds depend sublinearly on the number of iterations under some regularity assumptions, we establis bounds on the convergence rate to the optimal solution of the original constrained problem. In contrast to most existing methods, CUQB further incorporates a simple infeasibility detection scheme, which we prove triggers in a finite number of iterations when the original problem is infeasible (with high probability given the Bayesian model). Numerical experiments on several test problems, including environmental model calibration and real-time optimization of a reactor system, show that CUQB significantly outperforms traditional Bayesian optimization in both constrained and unconstrained cases. Furthermore, compared to other state-of-the-art methods that exploit composite structure, CUQB achieves competitive empirical performance while also providing substantially improved theoretical guarantees.
Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others. However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data scientists. This creates a significant barrier to the adoption of these tools in the industry. Moreover, existing systems were not created with industrial-grade workloads in mind. Finally, they do not aim to provide descriptive explanations, i.e. why a given pattern is not found. It is a significant issue as it is essential to understand the underlying reasons for a specific pattern's absence to make informed decisions based on the data. Because of that, these patterns are effectively rest in thin air: their application scope is rather limited, they are rarely used by the broader public. At the same time, as we are going to demonstrate in this presentation, complex statistics can be efficiently used to solve many classic data quality problems. Desbordante is an open-source data profiler that aims to close this gap. It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations. Furthermore, it provides seamless Python integration by offloading various costly operations to the C++ core, not only mining. In this demonstration, we show several scenarios that allow end users to solve different data quality problems. Namely, we showcase typo detection, data deduplication, and data anomaly detection scenarios.
Quantum computer simulators are crucial for the development of quantum computing. In this work, we investigate the suitability and performance impact of GPU and multi-GPU systems on a widely used simulation tool - the state vector simulator Qiskit Aer. In particular, we evaluate the performance of both Qiskit's default Nvidia Thrust backend and the recent Nvidia cuQuantum backend on Nvidia A100 GPUs. We provide a benchmark suite of representative quantum applications for characterization. For simulations with a large number of qubits, the two GPU backends can provide up to 14x speedup over the CPU backend, with Nvidia cuQuantum providing further 1.5-3x speedup over the default Thrust backend. Our evaluation on a single GPU identifies the most important functions in Nvidia Thrust and cuQuantum for different quantum applications and their compute and memory bottlenecks. We also evaluate the gate fusion and cache-blocking optimizations on different quantum applications. Finally, we evaluate large-number qubit quantum applications on multi-GPU and identify data movement between host and GPU as the limiting factor for the performance.
Next-generation internet-of-things (IoT) networks require extremely low latency, complexity, and collision probability. We introduce the novel partial-information multiple access (PIMA) scheme, a semi-grant-free (GF) coordinated random access (RA) protocol for short packet transmission, with the aim of reducing the latency and packet loss of traditional multiple access schemes, as well as more recent preamble-based schemes. With PIMA, the base station (BS) acquires partial information on instantaneous traffic conditions in the partial information acquisition (PIA) sub-frame, estimating the number of active devices, i.e., having packets waiting for transmission in their queue. Based on this estimate, the BS chooses both the total number of slots to be allocated in the data transmission (DT) sub-frame and the respective user-to-slot assignment. Although collisions may still occur due to multiple users assigned to the same slot, they are drastically reduced with respect to the slotted ALOHA (SALOHA) scheme, while achieving lower latency than both time-division multiple-access (TDMA) and preamble-based protocols, due to the extremely reduced overhead of the PIA sub-frame. Finally, we analyze and assess the performance of PIMA under various activation statistics, proving the robustness of the proposed solution to the intensity of traffic, also with burst traffic.
Lattice-based cryptographic algorithms built on ring learning with error theory are gaining importance due to their potential for providing post-quantum security. However, these algorithms involve complex polynomial operations, such as polynomial modular multiplication (PMM), which is the most time-consuming part of these algorithms. Accelerating PMM is crucial to make lattice-based cryptographic algorithms widely adopted by more applications. This work introduces a novel high-throughput and compact PMM accelerator, X-Poly, based on the crossbar (XB)-type compute-in-memory (CIM). We identify the most appropriate PMM algorithm for XB-CIM. We then propose a novel bit-mapping technique to reduce the area and energy of the XB-CIM fabric, and conduct processing engine (PE)-level optimization to increase memory utilization and support different problem sizes with a fixed number of XB arrays. X-Poly design achieves 3.1X10^6 PMM operations/s throughput and offers 200X latency improvement compared to the CPU-based implementation. It also achieves 3.9X throughput per area improvement compared with the state-of-the-art CIM accelerators.
The utilization of finite field multipliers is pervasive in contemporary digital systems, with hardware implementation for bit parallel operation often necessitating millions of logic gates. However, various digital design issues, whether inherent or stemming from soft errors, can result in gate malfunction, ultimately can cause gates to malfunction, which in turn results in incorrect multiplier outputs. Thus, to prevent susceptibility to error, it is imperative to employ a reliable finite field multiplier implementation that boasts a robust fault detection capability. In order to achieve the best fault detection performance for finite field detection performance for finite field multipliers while maintaining a low-complexity implementation, this study proposes a novel fault detection scheme for a recent bit-parallel polynomial basis over GF(2m). The primary concept behind the proposed approach is centered on the implementation of an efficient BCH decoder that utilize Berlekamp-Rumsey-Solomon (BRS) algorithm and Chien-search method to effectively locate errors with minimal delay. The results of our synthesis indicate that our proposed error detection and correction architecture for a 45-bit multiplier with 5-bit errors achieves a 37% and 49% reduction in critical path delay compared to existing designs. Furthermore, a 45-bit multiplicand with five errors has hardware complexity that is only 80%, which is significantly less complex than the most advanced BCH-based fault recognition techniques, such as TMR, Hamming's single error correction, and LDPC-based methods for finite field multiplication which is desirable in constrained applications, such as smart cards, IoT devices, and implantable medical devices.
While anatomy learning is an essential part of medical education, there remain significant challenges in traditional learning methods, In this paper, we introduce two in-house anatomy training solutions that can visualize and superimpose 3D virtual anatomy models with informative labels using a hand-held tablet or a wide-screen AR. To investigate the feasibility and effectiveness of the proposed tablet-based 3D visualization and AR tools, we conducted a large-scale study with 236 students enrolled in undergraduate premedical programs (95 M, 141F in 118 dyadic teams). In this study, participant students were split into three groups to use one of the following learning tools in a team-based anatomy painting activity: (1) conventional textbook, (2) hand-held tablet-based 3D visualization, and (3) screen-based AR. The results showed that students who used the tablet-based visualization tool or the AR learning tool reported significantly higher (more positive) learning experience scores than those who used a textbook. Though we did not observe a significant difference in knowledge retention among the three learning tools, our further analysis of gender effects revealed that male participants generally reported more positive learning experience scores than female participants. Also, the overall experience of mixed-gender dyads was reported to be significantly lower than others in most of the learning experience and performance measures. While discussing the implications of our results in the context of anatomy and medical education, we highlight the potential of our learning tools with additional considerations related to gender and team dynamics in body painting anatomy learning interventions.
The rapid expansion of global cloud wide-area networks (WANs) has posed a challenge for commercial optimization engines to efficiently solve network traffic engineering (TE) problems at scale. Existing acceleration strategies decompose TE optimization into concurrent subproblems but realize limited parallelism due to an inherent tradeoff between run time and allocation performance. We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and network flows, learning flow features as inputs to downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand while optimizing a central TE objective. Finally, Teal fine-tunes allocations with ADMM (Alternating Direction Method of Multipliers), a highly parallelizable optimization algorithm for reducing constraint violations such as overutilized links. We evaluate Teal using traffic matrices from Microsoft's WAN. On a large WAN topology with >1,700 nodes, Teal generates near-optimal flow allocations while running several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625x speedups.
The use of vehicle-to-everything (V2X) communication is expected to significantly improve road safety and traffic management. We present an efficient protocol, called the AEE protocol, for protecting data authenticity and user privacy in V2X applications. Our protocol provides event-based likability, which enables messages from a subject vehicle to be linked to a specific event in order to prevent Sybil attacks. Messages on different events are unlinkable to preserve the long-term privacy of vehicles. Moreover, our protocol introduces a new method for generating temporary public keys to reduce computing and transmission overheads. Such a temporary public key is bound with a certain event and is automatically revoked when the event is over. We describe how to apply our protocol in vehicular communications using two exemplar use cases. To further reduce the real-time computational complexity, our protocol enables us to decompose the cryptographic operations into offline processes for complex operations and real-time processes for fast computations.