亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Nowadays, various memory-hungry applications like machine learning algorithms are knocking "the memory wall". Toward this, emerging memories featuring computational capacity are foreseen as a promising solution that performs data process inside the memory itself, so-called computation-in-memory, while eliminating the need for costly data movement. Recent research shows that utilizing the custom extension of RISC-V instruction set architecture to support computation-in-memory operations is effective. To evaluate the applicability of such methods further, this work enhances the standard GNU binary utilities to generate RISC-V executables with Logic-in-Memory (LiM) operations and develop a new gem5 simulation environment, which simulates the entire system (CPU, peripherals, etc.) in a cycle-accurate manner together with a user-defined LiM module integrated into the system. This work provides a modular testbed for the research community to evaluate potential LiM solutions and co-designs between hardware and software.

相關內容

To reduce the complexity of the hardware implementation of neural network-based optical channel equalizers, we demonstrate that the performance of the biLSTM equalizer with approximated activation functions is close to that of the original model.

Signal Temporal Logic (STL) is capable of expressing a broad range of temporal properties that controlled dynamical systems must satisfy. In the literature, both mixed-integer programming (MIP) and nonlinear programming (NLP) methods have been applied to solve optimal control problems with STL specifications. However, neither approach has succeeded in solving problems with complex long-horizon STL specifications within a realistic timeframe. This study proposes a new optimization framework, called \textit{STLCCP}, which explicitly incorporates several structures of STL to mitigate this issue. The core of our framework is a structure-aware decomposition of STL formulas, which converts the original program into a difference of convex (DC) programs. This program is then solved as a convex quadratic program sequentially, based on the convex-concave procedure (CCP). Our numerical experiments on several commonly used benchmarks demonstrate that this framework can effectively handle complex scenarios over long horizons, which have been challenging to address even using state-of-the-art optimization methods.

The emergence of a new, open, and free instruction set architecture, RISC-V, has heralded a new era in microprocessor architectures. Starting with low-power, low-performance prototypes, the RISC-V community has a good chance of moving towards fully functional high-end microprocessors suitable for high-performance computing. Achieving progress in this direction requires comprehensive development of the software environment, namely operating systems, compilers, mathematical libraries, and approaches to performance analysis and optimization. In this paper, we analyze the performance of two available RISC-V devices when executing three memory-bound applications: a widely used STREAM benchmark, an in-place dense matrix transposition algorithm, and a Gaussian Blur algorithm. We show that, compared to x86 and ARM CPUs, RISC-V devices are still expected to be inferior in terms of computation time but are very good in resource utilization. We also demonstrate that well-developed memory optimization techniques for x86 CPUs improve the performance on RISC-V CPUs. Overall, the paper shows the potential of RISC-V as an alternative architecture for high-performance computing.

Degraded broadcast channels (DBC) are a typical multi-user communications scenario. There exist classic transmission methods, such as superposition coding with successive interference cancellation, to achieve the DBC capacity region. However, semantic communications method over DBC remains lack of in-depth research. To address this, we design a fusion-based multi-user semantic communications system for wireless image transmission over DBC in this paper. The proposed architecture supports a transmitter extracting semantic features for two users separately, and learns to dynamically fuse these semantic features into a joint latent representation for broadcasting. The key here is to design a flexible image semantic fusion (FISF) module to fuse the semantic features of two users, and to use a multi-layer perceptron (MLP) based neural network to adjust the weights of different user semantic features for flexible adaptability to different users channels. Experiments present the semantic performance region based on the peak signal-to-noise ratio (PSNR) of both users, and show that the proposed system dominates the traditional methods.

Ultra-wideband (UWB) positioning has emerged as a low-cost and dependable localization solution for multiple use cases, from mobile robots to asset tracking within the Industrial IoT. The technology is mature and the scientific literature contains multiple datasets and methods for localization based on fixed UWB nodes. At the same time, research in UWB-based relative localization and infrastructure-free localization is gaining traction, further domains. tools and datasets in this domain are scarce. Therefore, we introduce in this paper a novel dataset for benchmarking infrastructure-free relative localization targeting the domain of multi-robot systems. Compared to previous datasets, we analyze the performance of different relative localization approaches for a much wider variety of scenarios with varying numbers of fixed and mobile nodes. A motion capture system provides ground truth data, are multi-modal and include inertial or odometry measurements for benchmarking sensor fusion methods. Additionally, the dataset contains measurements of ranging accuracy based on the relative orientation of antennas and a comprehensive set of measurements for ranging between a single pair of nodes. Our experimental analysis shows that high accuracy can be localization, but the variability of the ranging error is significant across different settings and setups.

Directional tests to compare incomplete undirected graphs are developed in the general context of covariance selection for Gaussian graphical models. The exactness of the underlying saddlepoint approximation is proved for chordal graphs and leads to exact control of the size of the tests, given that the only approximation error involved is due to the numerical calculation of two scalar integrals. Although exactness is not guaranteed for non-chordal graphs, the ability of the saddlepoint approximation to control the relative error leads the directional test to overperform its competitors even in these cases. The accuracy of our proposal is verified by simulation experiments under challenging scenarios, where inference via standard asymptotic approximations to the likelihood ratio test and some of its higher-order modifications fails. The directional approach is used to illustrate the assessment of Markovian dependencies in a dataset from a veterinary trial on cattle. A second example with microarray data shows how to select the graph structure related to genetic anomalies due to acute lymphocytic leukemia.

Parallel I/O refers to the ability of scientific programs to concurrently read/write from/to a single file from multiple processes executing on distributed memory platforms like compute clusters. In the HPC world, I/O becomes a significant bottleneck for many real-world scientific applications. In the last two decades, there has been significant research in improving the performance of I/O operations in scientific computing for traditional languages including C, C++, and Fortran. As a result of this, several mature and high-performance libraries including ROMIO (implementation of MPI-IO), parallel HDF5, Parallel I/O (PIO), and parallel netCDF are available today that provide efficient I/O for scientific applications. However, there is very little research done to evaluate and improve I/O performance of Java-based HPC applications. The main hindrance in the development of efficient parallel I/O Java libraries is the lack of a standard API (something equivalent to MPI-IO). Some adhoc solutions have been developed and used in proprietary applications, but there is no general-purpose solution that can be used by performance hungry applications. As part of this project, we plan to develop a Java-based parallel I/O API inspired by the MPI-IO bindings (MPI 2.0 standard document) for C, C++, and Fortran. Once the Java equivalent API of MPI-IO has been developed, we will develop a reference implementation on top of existing Java messaging libraries. Later, we will evaluate and compare performance of our reference Java Parallel I/O library with C/C++ counterparts using benchmarks and real-world applications.

Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of $2.12$x for 2D stencils and $1.24$x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of $4.86$x in smaller SpMV datasets from SuiteSparse and $1.43$x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: //github.com/neozhang307/PERKS.

Over the years, several memory models have been proposed to capture the subtle concurrency semantics of C/C++.One of the most fundamental problems associated with a memory model M is consistency checking: given an execution X, is X consistent with M? This problem lies at the heart of numerous applications, including specification testing and litmus tests, stateless model checking, and dynamic analyses. As such, it has been explored extensively and its complexity is well-understood for traditional models like SC and TSO. However, less is known for the numerous model variants of C/C++, for which the problem becomes challenging due to the intricacies of their concurrency primitives. In this work we study the problem of consistency checking for popular variants of the C11 memory model, in particular, the RC20 model, its release-acquire (RA) fragment, the strong and weak variants of RA (SRA and WRA), as well as the Relaxed fragment of RC20. Motivated by applications in testing and model checking, we focus on reads-from consistency checking. The input is an execution X specifying a set of events, their program order and their reads-from relation, and the task is to decide the existence of a modification order on the writes of X that makes X consistent in a memory model. We draw a rich complexity landscape for this problem; our results include (i)~nearly-linear-time algorithms for certain variants, which improve over prior results, (ii)~fine-grained optimality results, as well as (iii)~matching upper and lower bounds (NP-hardness) for other variants. To our knowledge, this is the first work to characterize the complexity of consistency checking for C11 memory models. We have implemented our algorithms inside the TruSt model checker and the C11Tester testing tool. Experiments on standard benchmarks show that our new algorithms improve consistency checking, often by a significant margin.

The single-source shortest path (SSSP) problem is a well-studied problem that is used in many applications. In the parallel setting, a work-efficient algorithm that additionally attains $o(n)$ parallel depth has been elusive. Alternatively, various approaches have been developed that take advantage of specific properties of a particular class of graphs. On a graphics processing unit (GPU), the current state-of-the-art SSSP algorithms are implementations of the Delta-stepping algorithm, which does not perform well for graphs with large diameters. The main contribution of this work is to provide an algorithm designed for GPUs that runs efficiently for such graphs. We present the parallel bucket heap, a parallel cache-efficient data structure adapted for modern GPU architectures that supports standard priority queue operations, as well as bulk update. We analyze the structure in several well-known computational models and show that it provides both optimal parallelism and is cache-efficient. We implement the parallel bucket heap and use it in a parallel variant of Dijkstra's algorithm to solve the SSSP problem. Experimental results indicate that, for sufficiently large, dense graphs with high diameter, we outperform the current state-of-the-art SSSP implementations on an NVIDIA RTX 2080 Ti and Quadro M4000 by up to a factor of 2.8 and 5.4, respectively.

北京阿比特科技有限公司