Sparse matrix-vector multiplication (SpMV) multiplies a sparse matrix with a dense vector. SpMV plays a crucial role in many applications, from graph analytics to deep learning. The random memory accesses of the sparse matrix make accelerator design challenging. However, high bandwidth memory (HBM) based FPGAs are a good fit for designing accelerators for SpMV. In this paper, we present Serpens, an HBM based accelerator for general-purpose SpMV.Serpens features (1) a general-purpose design, (2) memory-centric processing engines, and (3) index coalescing to support the efficient processing of arbitrary SpMVs. From the evaluation of twelve large-size matrices, Serpens is 1.91x and 1.76x better in terms of geomean throughput than the latest accelerators GraphLiLy and Sextans, respectively. We also evaluate 2,519 SuiteSparse matrices, and Serpens achieves 2.10x higher throughput than a K80 GPU. For the energy efficiency, Serpens is 1.71x, 1.90x, and 42.7x better compared with GraphLily, Sextans, and K80, respectively. After scaling up to 24 HBM channels, Serpens achieves up to 30,204MTEPS and up to 3.79x over GraphLily.
This paper presents two novel hybrid beamforming (HYBF) designs for a multi-cell massive multiple-input-multiple-output (mMIMO) millimeter wave (mmWave) full duplex (FD) system under limited dynamic range (LDR). Firstly, we present a novel centralized HYBF (C-HYBF) scheme based on alternating optimization. In general, the complexity of C-HYBF schemes scales quadratically as a function of the number of users and cells, which may limit their scalability. Moreover, they require significant communication overhead to transfer complete channel state information (CSI) to the central node every channel coherence time for optimization. The central node also requires very high computational power to jointly optimize many variables for the uplink (UL) and downlink (DL) users in FD systems. To overcome these drawbacks, we propose a very low-complexity and scalable cooperative per-link parallel and distributed (P$\&$D)-HYBF scheme. It allows each mmWave FD base station (BS) to update the beamformers for its users in a distributed fashion and independently in parallel on different computational processors. The complexity of P$\&$D-HYBF scales only linearly as the network size grows, making it desirable for the next generation of large and dense mmWave FD networks. Simulation results show that both designs significantly outperform the fully digital half duplex (HD) system with only a few radio-frequency (RF) chains and achieve similar performance.
Energy efficiency (EE) plays a key role in future wireless communication network and it is easily to achieve high EE performance in low SNR regime. In this paper, a new high EE scheme is proposed for a MIMO wireless communication system working in the low SNR regime by using two dimension resource allocation. First, we define the high EE area based on the relationship between the transmission power and the SNR. To meet the constraint of the high EE area, both frequency and space dimension are needed. Besides analysing them separately, we decided to consider frequency and space dimensions as a unit and proposed a two-dimension scheme. Furthermore, considering communication in the high EE area may cause decline of the communication quality, we add quality-of-service(QoS) constraint into the consideration and derive the corresponding EE performance based on the effective capacity. We also derive an approximate expression to simplify the complex EE performance. Finally, our numerical results demonstrate the effectiveness of the proposed scheme.
Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the widely-used, memory-bound Sparse Matrix Vector Multiplication (SpMV) kernel. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to state-of-the-art CPU and GPU systems to study the performance and energy efficiency of various devices. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, and a wide range of data types. Our extensive evaluation provides new insights and recommendations for software designers and hardware architects to efficiently accelerate SpMV on real PIM systems.
FPGA-based accelerators are becoming more popular for deep neural network due to the ability to scale performance with increasing degree of specialization with dataflow architectures or custom data types. To reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (HLS) have been introduced. They provide higher abstraction compared to register-transfer level (RTL)-based design. HLS offers faster development time, better maintainability and more flexibility in code exploration, when evaluating options for multi-dimension tensors, convolutional layers or parallelism. Thus, HLS has been adopted by DNN accelerator generation frameworks such as FINN and hls4ml. In this paper, we present an alternative backend RTL library for FINN. We investigate and evaluate, across a spectrum of design dimensions, an RTL-based implementation versus the original HLS variant. We show that for smaller design parameters, RTL produces significantly smaller circuits. For larger circuits, however, the look-up table (LUT) count of RTL-based design is slightly higher, up to around $15\%$. On the other hand, HLS consistently requires more flip-flops (FFs) (orders-of-magnitude increase) and block RAMs (BRAMs) ($2\times$ more). This also impacts the critical path delay, with RTL producing significantly faster circuits, up to $80\%$. Furthermore, RTL also benefits from at-least a $10\times$ reduction in synthesis time. Finally the results were practically validated using a real-world use case of a multi-layer perceptron (MLP) network used in network intrusion detection. Overall, since HLS frameworks code-generate the hardware design, the benefits of the ease in the design entry is less important as compared to synthesis time reduction togther with resource benefits, this might make the RTL abstraction an attractive alternative.
The US Department of Energy launched the Exascale Computing Project (ECP) in 2016 as part of a coordinated effort to achieve the next generation of high-performance computing (HPC) and to accelerate scientific discovery. The Exascale Proxy Applications Project began within the ECP to: (1) improve the quality of proxies created by the ECP (2) provide small, simplified codes which share important features of large applications and (3) capture programming methods and styles that drive requirements for compilers and other elements of the tool chain. This article describes one Proxy Application (or "proxy app") suite called IMEXLBM which is an open-source, self-contained code unit, with minimal dependencies, that is capable of running on heterogeneous platforms like those with graphic processing units (GPU) for accelerating the calculation. In particular, we demonstrate functionality by solving a benchmark problem in computational fluid dynamics (CFD) on the ThetaGPU machine at the Argonne Leadership Computing Facility (ALCF). Our method makes use of a domain-decomposition technique in conjunction with the message-passing interface (MPI) standard for distributed memory systems. The OpenMP application programming interface (API) is employed for shared-memory multi-processing and offloading critical kernels to the device (i.e. GPU). We also verify our effort by comparing data generated via CPU-only calculations with data generated with CPU+GPU calculations. While we demonstrate efficacy for single-phase fluid problems, the code-unit is designed to be versatile and enable new physical models that can capture complex phenomena such as two-phase flow with interface capture.
Kernel matrices, which arise from discretizing a kernel function $k(x,x')$, have a variety of applications in mathematics and engineering. Classically, the celebrated fast multipole method was designed to perform matrix multiplication on kernel matrices of dimension $N$ in time almost linear in $N$ by using techniques later generalized into the linear algebraic framework of hierarchical matrices. In light of this success, we propose a quantum algorithm for efficiently performing matrix operations on hierarchical matrices by implementing a quantum block-encoding of the hierarchical matrix structure. When applied to many kernel matrices, our quantum algorithm can solve quantum linear systems of dimension $N$ in time $O(\kappa \operatorname{polylog}(\frac{N}{\varepsilon}))$, where $\kappa$ and $\varepsilon$ are the condition number and error bound of the matrix operation. This runtime is exponentially faster than any existing quantum algorithms for implementing dense kernel matrices. Finally, we discuss possible applications of our methodology in solving integral equations or accelerating computations in N-body problems.
In 2018 Bornmann and Haunschild (2018a) introduced a new indicator called the Mantel-Haenszel quotient (MHq) to measure alternative metrics (or altmetrics) of scientometric data. In this article we review the Mantel-Haenszel statistics, point out two errors in the literature, and introduce a new indicator. First, we correct the interpretation of MHq and mention that it is still a meaningful indicator. Second, we correct the variance formula for MHq, which leads to narrower confidence intervals. A simulation study shows the superior performance of our variance estimator and confidence intervals. Since MHq does not match its original description in the literature, we propose a new indicator, the Mantel-Haenszel row risk ratio (MHRR), to meet that need. Interpretation and statistical inference for MHRR are discussed. For both MHRR and MHq, a value greater (less) than one means performance is better (worse) than in the reference set called the world.
The acquisition of channel state information (CSI) in Frequency Division Duplex (FDD) massive MIMO has been a formidable challenge. In this paper, we address this problem with a novel CSI feedback framework enabled by the partial reciprocity of uplink and downlink channels in the wideband regime. We first derive the closed-form expression of the rank of the wideband massive MIMO channel covariance matrix for a given angle-delay distribution. A low-rankness property is identified, which generalizes the well-known result of the narrow-band uniform linear array setting. Then we propose a partial channel reciprocity (PCR) codebook, inspired by the low-rankness behavior and the fact that the uplink and downlink channels have similar angle-delay distributions. Compared to the latest codebook in 5G, the proposed PCR codebook scheme achieves higher performance, lower complexity at the user side, and requires a smaller amount of feedback. We derive the feedback overhead necessary to achieve asymptotically error-free CSI feedback. Two low-complexity alternatives are also proposed to further reduce the complexity at the base station side. Simulations with the practical 3GPP channel model show the significant gains over the latest 5G codebook, which prove that our proposed methods are practical solutions for 5G and beyond.
Recently, Information Retrieval community has witnessed fast-paced advances in Dense Retrieval (DR), which performs first-stage retrieval by encoding documents in a low-dimensional embedding space and querying them with embedding-based search. Despite the impressive ranking performance, previous studies usually adopt brute-force search to acquire candidates, which is prohibitive in practical Web search scenarios due to its tremendous memory usage and time cost. To overcome these problems, vector compression methods, a branch of Approximate Nearest Neighbor Search (ANNS), have been adopted in many practical embedding-based retrieval applications. One of the most popular methods is Product Quantization (PQ). However, although existing vector compression methods including PQ can help improve the efficiency of DR, they incur severely decayed retrieval performance due to the separation between encoding and compression. To tackle this problem, we present JPQ, which stands for Joint optimization of query encoding and Product Quantization. It trains the query encoder and PQ index jointly in an end-to-end manner based on three optimization strategies, namely ranking-oriented loss, PQ centroid optimization, and end-to-end negative sampling. We evaluate JPQ on two publicly available retrieval benchmarks. Experimental results show that JPQ significantly outperforms existing popular vector compression methods in terms of different trade-off settings. Compared with previous DR models that use brute-force search, JPQ almost matches the best retrieval performance with 30x compression on index size. The compressed index further brings 10x speedup on CPU and 2x speedup on GPU in query latency.
This paper studies the single image super-resolution problem using adder neural networks (AdderNet). Compared with convolutional neural networks, AdderNet utilizing additions to calculate the output features thus avoid massive energy consumptions of conventional multiplications. However, it is very hard to directly inherit the existing success of AdderNet on large-scale image classification to the image super-resolution task due to the different calculation paradigm. Specifically, the adder operation cannot easily learn the identity mapping, which is essential for image processing tasks. In addition, the functionality of high-pass filters cannot be ensured by AdderNet. To this end, we thoroughly analyze the relationship between an adder operation and the identity mapping and insert shortcuts to enhance the performance of SR models using adder networks. Then, we develop a learnable power activation for adjusting the feature distribution and refining details. Experiments conducted on several benchmark models and datasets demonstrate that, our image super-resolution models using AdderNet can achieve comparable performance and visual quality to that of their CNN baselines with an about 2$\times$ reduction on the energy consumption.