亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

The use of quantum processing units (QPUs) promises speed-ups for solving computational problems, but the quantum devices currently available possess only a very limited number of qubits and suffer from considerable imperfections. One possibility to progress towards practical utility is to use a co-design approach: Problem formulation and algorithm, but also the physical QPU properties are tailored to the specific application. Since QPUs will likely be used as accelerators for classical computers, details of systemic integration into existing architectures are another lever to influence and improve the practical utility of QPUs. In this work, we investigate the influence of different parameters on the runtime of quantum programs on tailored hybrid CPU-QPU-systems. We study the influence of communication times between CPU and QPU, how adapting QPU designs influences quantum and overall execution performance, and how these factors interact. Using a simple model that allows for estimating which design choices should be subjected to optimisation for a given task, we provide an intuition to the HPC community on potentials and limitations of co-design approaches. We also discuss physical limitations for implementing the proposed changes on real quantum hardware devices.

相關內容

IFIP TC13 Conference on Human-Computer Interaction是人機交互領域的研究者和實踐者展示其工作的重要平臺。多年來,這些會議吸引了來自幾個國家和文化的研究人員。官網鏈接: · Integration · Performer · · 講稿 ·
2022 年 10 月 19 日

Optical computing has been recently proposed as a new compute paradigm to meet the demands of future AI/ML workloads in datacenters and supercomputers. However, proposed implementations so far suffer from lack of scalability, large footprints and high power consumption, and incomplete system-level architectures to become integrated within existing datacenter architecture for real-world applications. In this work, we present a truly scalable optical AI accelerator based on a crossbar architecture. We have considered all major roadblocks and address them in this design. Weights will be stored on chip using phase change material (PCM) that can be monolithically integrated in silicon photonic processes. All electro-optical components and circuit blocks are modeled based on measured performance metrics in a 45nm monolithic silicon photonic process, which can be co-packaged with advanced CPU/GPUs and HBM memories. We also present a system-level modeling and analysis of our chip's performance for the Resnet-50V1.5, considering all critical parameters, including memory size, array size, photonic losses, and energy consumption of peripheral electronics. Both on-chip SRAM and off-chip DRAM energy overheads have been considered in this modeling. We additionally address how using a dual-core crossbar design can eliminate programming time overhead at practical SRAM block sizes and batch sizes. Our results show that a 128 x 128 proposed architecture can achieve inference per second (IPS) similar to Nvidia A100 GPU at 15.4 times lower power and 7.24 times lower area.

The increased use of Internet of Things (IoT) devices -- from basic sensors to robust embedded computers -- has boosted the demand for information processing and storing solutions closer to these devices. Edge computing has been established as a standard architecture for developing IoT solutions, since it can optimize the workload and capacity of systems that depend on cloud services by deploying necessary computing power close to where the information is being produced and consumed. However, as the network scale in size, reaching consensus becomes an increasingly challenging task. Distributed ledger technologies (DLTs), which can be described as a network of distributed databases that incorporate cryptography, can be leveraged to achieve consensus among participants. In recent years DLTs have gained traction due to the popularity of blockchains, the most-well known type of implementation. The reliability and trust that can be achieved through transparent and traceable transactions are other key concepts that bring IoT and DLT together. We present the design, development and conducted experiments of a proof-of-concept system that uses DLT smart contracts for efficiently selecting edge nodes for offloading computational tasks. In particular, we integrate network performance indicators in smart contracts with a Hyperledger Blockchain to optimize the offloading on computation under dynamic connectivity solutions. The proposed method can be applied to networks with varied topologies and different means of connectivity. Our results show the applicability of blockchain smart contracts to a variety of industrial use cases.

The rapid advancement of artificial intelligence technologies has given rise to diversified intelligent services, which place unprecedented demands on massive connectivity and gigantic data aggregation. However, the scarce radio resources and stringent latency requirement make it challenging to meet these demands. To tackle these challenges, over-the-air computation (AirComp) emerges as a potential technology. Specifically, AirComp seamlessly integrates the communication and computation procedures through the superposition property of multiple-access channels, which yields a revolutionary multiple-access paradigm shift from "compute-after-communicate" to "compute-when-communicate". Meanwhile, low-latency and spectral-efficient wireless data aggregation can be achieved via AirComp by allowing multiple devices to access the wireless channels non-orthogonally. In this paper, we aim to present the recent advancement of AirComp in terms of foundations, technologies, and applications. The mathematical form and communication design are introduced as the foundations of AirComp, and the critical issues of AirComp over different network architectures are then discussed along with the review of existing literature. The technologies employed for the analysis and optimization on AirComp are reviewed from the information theory and signal processing perspectives. Moreover, we present the existing studies that tackle the practical implementation issues in AirComp systems, and elaborate the applications of AirComp in Internet of Things and edge intelligent networks. Finally, potential research directions are highlighted to motivate the future development of AirComp.

Efficient quantum control is necessary for practical quantum computing implementations with current technologies. Conventional algorithms for determining optimal control parameters are computationally expensive, largely excluding them from use outside of the simulation. Existing hardware solutions structured as lookup tables are imprecise and costly. By designing a machine learning model to approximate the results of traditional tools, a more efficient method can be produced. Such a model can then be synthesized into a hardware accelerator for use in quantum systems. In this study, we demonstrate a machine learning algorithm for predicting optimal pulse parameters. This algorithm is lightweight enough to fit on a low-resource FPGA and perform inference with a latency of 175 ns and pipeline interval of 5 ns with $~>~$0.99 gate fidelity. In the long term, such an accelerator could be used near quantum computing hardware where traditional computers cannot operate, enabling quantum control at a reasonable cost at low latencies without incurring large data bandwidths outside of the cryogenic environment.

Quantum software systems are emerging software engineering (SE) genre that exploit principles of quantum bits (Qubit) and quantum gates (Qgates) to solve complex computing problems that today classic computers can not effectively do in a reasonable time. According to its proponents, agile software development practices have the potential to address many of the problems endemic to the development of quantum software. However, there is a dearth of evidence confirming if agile practices suit and can be adopted by software teams as they are in the context of quantum software development. To address this lack, we conducted an empirical study to investigate the needs and challenges of using agile practices to develop quantum software. While our semi-structured interviews with 26 practitioners across 10 countries highlighted the applicability of agile practices in this domain, the interview findings also revealed new challenges impeding the effective incorporation of these practices. Our research findings provide a springboard for further contextualization and seamless integration of agile practices with developing the next generation of quantum software.

We consider frugal splitting operators for finite sum monotone inclusion problems, i.e., splitting operators that use exactly one direct or resolvent evaluation of each operator of the sum. A novel representation of these operators in terms of what we call a generalized primal-dual resolvent is presented. This representation reveals a number of new results regarding lifting numbers, existence of solution maps, and parallelizability of the forward and backward evaluations. We show that the minimal lifting is $n-1-f$ where $n$ is the number of monotone operators and $f$ is the number of direct evaluations in the splitting. Furthermore, we show that this lifting number is only achievable as long as the first and last evaluations are resolvent evaluations. In the case of frugal resolvent splitting operators, these results are the same as the results of Ryu and Malitsky--Tam. The representation also enables a unified convergence analysis and we present a generally applicable theorem for the convergence and Fej\'er monotonicity of fixed point iterations of frugal splitting operators with cocoercive direct evaluations. We conclude by constructing a new convergent and parallelizable frugal splitting operator with minimal lifting.

Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.

Estimation of rigid transformation between two point clouds is a computationally challenging problem in vision-based relative navigation. Targeting a real-time navigation solution utilizing point-cloud and image registration algorithms, this paper develops high-performance avionics for power and resource constrained pose estimation framework. A Field-Programmable Gate Array (FPGA) based embedded architecture is developed to accelerate estimation of relative pose between the point-clouds, aided by image features that correspond to the individual point sets. At algorithmic level, the pose estimation method is an adaptation of Optimal Linear Attitude and Translation Estimator (OLTAE) for relative attitude and translation estimation. At the architecture level, the proposed embedded solution is a hardware/software co-design that evaluates the OLTAE computations on the bare-metal hardware for high-speed state estimation. The finite precision FPGA evaluation of the OLTAE algorithm is compared with a double-precision evaluation on MATLAB for performance analysis and error quantification. Implementation results of the proposed finite-precision OLTAE accelerator demonstrate the high-performance compute capabilities of the FPGA-based pose estimation while offering relative numerical errors below 7%.

Compiling a high-level quantum circuit down to a low-level description that can be executed on state-of-the-art quantum computers is a crucial part of the software stack for quantum computing. One step in compiling a quantum circuit to some device is quantum circuit mapping, where the circuit is transformed such that it complies with the architecture's limited qubit connectivity. Because the search space in quantum circuit mapping grows exponentially in the number of qubits, it is desirable to consider as few of the device's physical qubits as possible in the process. Previous work conjectured that it suffices to consider only subarchitectures of a quantum computer composed of as many qubits as used in the circuit. In this work, we refute this conjecture and establish criteria for judging whether considering larger parts of the architecture might yield better solutions to the mapping problem. Through rigorous analysis, we show that determining subarchitectures that are of minimal size, i.e., of which no physical qubit can be removed without losing the optimal mapping solution for some quantum circuit, is a very hard problem. Based on a relaxation of the criteria for optimality, we introduce a relaxed consideration that still maintains optimality for practically relevant quantum circuits. Eventually, this results in two methods for computing near-optimal sets of subarchitectures$\unicode{x2014}$providing the basis for efficient quantum circuit mapping solutions. We demonstrate the benefits of this novel method for state-of-the-art quantum computers by IBM, Google and Rigetti.

Deep convolutional neural networks (CNNs) have recently achieved great success in many visual recognition tasks. However, existing deep neural network models are computationally expensive and memory intensive, hindering their deployment in devices with low memory resources or in applications with strict latency requirements. Therefore, a natural thought is to perform model compression and acceleration in deep networks without significantly decreasing the model performance. During the past few years, tremendous progress has been made in this area. In this paper, we survey the recent advanced techniques for compacting and accelerating CNNs model developed. These techniques are roughly categorized into four schemes: parameter pruning and sharing, low-rank factorization, transferred/compact convolutional filters, and knowledge distillation. Methods of parameter pruning and sharing will be described at the beginning, after that the other techniques will be introduced. For each scheme, we provide insightful analysis regarding the performance, related applications, advantages, and drawbacks etc. Then we will go through a few very recent additional successful methods, for example, dynamic capacity networks and stochastic depths networks. After that, we survey the evaluation matrix, the main datasets used for evaluating the model performance and recent benchmarking efforts. Finally, we conclude this paper, discuss remaining challenges and possible directions on this topic.

北京阿比特科技有限公司