The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.
The effect of higher order continuity in the solution field by using NURBS basis function in isogeometric analysis (IGA) is investigated for an efficient mixed finite element formulation for elastostatic beams. It is based on the Hu-Washizu variational principle considering geometrical and material nonlinearities. Here we present a reduced degree of basis functions for the additional fields of the stress resultants and strains of the beam, which are allowed to be discontinuous across elements. This approach turns out to significantly improve the computational efficiency and the accuracy of the results. We consider a beam formulation with extensible directors, where cross-sectional strains are enriched to avoid Poisson locking by an enhanced assumed strain method. In numerical examples, we show the superior per degree-of-freedom accuracy of IGA over conventional finite element analysis, due to the higher order continuity in the displacement field. We further verify the efficient rotational coupling between beams, as well as the path-independence of the results.
Neural Network designs are quite diverse, from VGG-style to ResNet-style, and from Convolutional Neural Networks to Transformers. Towards the design of efficient accelerators, many works have adopted a dataflow-based, inter-layer pipelined architecture, with a customised hardware towards each layer, achieving ultra high throughput and low latency. The deployment of neural networks to such dataflow architecture accelerators is usually hindered by the available on-chip memory as it is desirable to preload the weights of neural networks on-chip to maximise the system performance. To address this, networks are usually compressed before the deployment through methods such as pruning, quantization and tensor decomposition. In this paper, a framework for mapping CNNs onto FPGAs based on a novel tensor decomposition method called Mixed-TD is proposed. The proposed method applies layer-specific Singular Value Decomposition (SVD) and Canonical Polyadic Decomposition (CPD) in a mixed manner, achieving 1.73x to 10.29x throughput per DSP to state-of-the-art CNNs. Our work is open-sourced: //github.com/Yu-Zhewen/Mixed-TD
We present an implicit-explicit finite volume scheme for two-fluid single-temperature flow in all Mach number regimes which is based on a symmetric hyperbolic thermodynamically compatible description of the fluid flow. The scheme is stable for large time steps controlled by the interface transport and is computational efficient due to a linear implicit character. The latter is achieved by linearizing along constant reference states given by the asymptotic analysis of the single-temperature model. Thus, the use of a stiffly accurate IMEX Runge Kutta time integration and the centered treatment of pressure based quantities provably guarantee the asymptotic preserving property of the scheme for weakly compressible Euler equations with variable volume fraction. The properties of the first and second order scheme are validated by several numerical test cases.
Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is of significant interest to find a way to harness these IMMUs to improve the performance of HPC applications while maintaining accuracy. We focus on the Ozaki scheme, which computes a high-precision matrix multiplication by using lower-precision computing units, and show the advantages and disadvantages of using IMMU. The experiment using integer Tensor Cores shows that we can compute double-precision matrix multiplication faster than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum circuit simulation by up to 4.33 while maintaining the FP64 accuracy.
Matrix diagonalization is at the cornerstone of numerous fields of scientific computing. Diagonalizing a matrix to solve an eigenvalue problem requires a sequential path of iterations that eventually reaches a sufficiently converged and accurate solution for all the eigenvalues and eigenvectors. This typically translates into a high computational cost. Here we demonstrate how reinforcement learning, using the AlphaZero framework, can accelerate Jacobi matrix diagonalizations by viewing the selection of the fastest path to solution as a board game. To demonstrate the viability of our approach we apply the Jacobi diagonalization algorithm to symmetric Hamiltonian matrices that appear in quantum chemistry calculations. We find that a significant acceleration can often be achieved. Our findings highlight the opportunity to use machine learning as a promising tool to improve the performance of numerical linear algebra.
Betweenness centrality is essential in complex network analysis; it characterizes the importance of nodes and edges in networks. It is a crucial problem that exactly computes the betweenness centrality in large networks faster, which urgently needs to be solved. We propose a novel algorithm for betweenness centrality based on the parallel computing of adjacency matrices, which is faster than the existing algorithms for large networks. The time complexity of the algorithm is only related to the number of nodes in the network, not the number of edges. Experimental evidence shows that the algorithm is effective and efficient. This novel algorithm is faster than Brandes' algorithm on small and dense networks and offers excellent solutions for betweenness centrality index computing on large-scale complex networks.
We present a randomized, inverse-free algorithm for producing an approximate diagonalization of any $n \times n$ matrix pencil $(A,B)$. The bulk of the algorithm rests on a randomized divide-and-conquer eigensolver for the generalized eigenvalue problem originally proposed by Ballard, Demmel, and Dumitriu [Technical Report 2010]. We demonstrate that this divide-and-conquer approach can be formulated to succeed with high probability as long as the input pencil is sufficiently well-behaved, which is accomplished by generalizing the recent pseudospectral shattering work of Banks, Garza-Vargas, Kulkarni, and Srivastava [Foundations of Computational Mathematics 2022]. In particular, we show that perturbing and scaling $(A,B)$ regularizes its pseudospectra, allowing divide-and-conquer to run over a simple random grid and in turn producing an accurate diagonalization of $(A,B)$ in the backward error sense. The main result of the paper states the existence of a randomized algorithm that with high probability (and in exact arithmetic) produces invertible $S,T$ and diagonal $D$ such that $||A - SDT^{-1}||_2 \leq \varepsilon$ and $||B - SIT^{-1}||_2 \leq \varepsilon$ in at most $O \left( \log(n) \log^2 \left( \frac{n}{\varepsilon} \right) T_{\text{MM}}(n) \right)$ operations, where $T_{\text{MM}}(n)$ is the asymptotic complexity of matrix multiplication. This not only provides a new set of guarantees for highly parallel generalized eigenvalue solvers but also establishes nearly matrix multiplication time as an upper bound on the complexity of exact arithmetic matrix pencil diagonalization.
Partitioning a polygonal mesh into meaningful parts can be challenging. Many applications require decomposing such structures for further processing in computer graphics. In the last decade, several methods were proposed to tackle this problem, at the cost of intensive computational times. Recently, machine learning has proven to be effective for the segmentation task on 3D structures. Nevertheless, these state-of-the-art methods are often hardly generalizable and require dividing the learned model into several specific classes of objects to avoid overfitting. We present a data-driven approach leveraging deep learning to encode a mapping function prior to mesh segmentation for multiple applications. Our network reproduces a neighborhood map using our knowledge of the \textsl{Shape Diameter Function} (SDF) method using similarities among vertex neighborhoods. Our approach is resolution-agnostic as we downsample the input meshes and query the full-resolution structure solely for neighborhood contributions. Using our predicted SDF values, we can inject the resulting structure into a graph-cut algorithm to generate an efficient and robust mesh segmentation while considerably reducing the required computation times.
Graph Convolutional Network (GCN) has achieved extraordinary success in learning effective task-specific representations of nodes in graphs. However, regarding Heterogeneous Information Network (HIN), existing HIN-oriented GCN methods still suffer from two deficiencies: (1) they cannot flexibly explore all possible meta-paths and extract the most useful ones for a target object, which hinders both effectiveness and interpretability; (2) they often need to generate intermediate meta-path based dense graphs, which leads to high computational complexity. To address the above issues, we propose an interpretable and efficient Heterogeneous Graph Convolutional Network (ie-HGCN) to learn the representations of objects in HINs. It is designed as a hierarchical aggregation architecture, i.e., object-level aggregation first, followed by type-level aggregation. The novel architecture can automatically extract useful meta-paths for each object from all possible meta-paths (within a length limit), which brings good model interpretability. It can also reduce the computational cost by avoiding intermediate HIN transformation and neighborhood attention. We provide theoretical analysis about the proposed ie-HGCN in terms of evaluating the usefulness of all possible meta-paths, its connection to the spectral graph convolution on HINs, and its quasi-linear time complexity. Extensive experiments on three real network datasets demonstrate the superiority of ie-HGCN over the state-of-the-art methods.
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.