国产一本二本三本的区别视频_中文字幕无码乱人伦漫画_日本东京热免费一区二区三区_黄色免费黄色免費久久_欧美成人精品A在线观看_亚洲色欲综合网页在线观看_亚洲AV无码专区一区二区高潮

Trends in hardware, the prevalence of the cloud, and the rise of highly demanding applications have ushered an era of specialization that quickly changes how data is processed at scale. These changes are likely to continue and accelerate in the next years as new technologies are adopted and deployed: smart NICs, smart storage, smart memory, disaggregated storage, disaggregated memory, specialized accelerators (GPUS, TPUs, FPGAs), and a wealth of ASICs specifically created to deal with computationally expensive tasks (e.g., cryptography or compression). In this tutorial, we focus on data processing on FPGAs, a technology that has received less attention than, e.g., TPUs or GPUs but that is, however, increasingly being deployed in the cloud for data processing tasks due to the architectural flexibility of FPGAs, along with their ability to process data at line rate, something not possible with other types of processors or accelerators. In the tutorial, we will cover what FPGAs are, their characteristics, their advantages and disadvantages, as well as examples from deployments in the industry and how they are used in various data processing tasks. We will introduce FPGA programming with high-level languages and describe hardware and software resources available to researchers. The tutorial includes case studies borrowed from research done in collaboration with companies that illustrate the potential of FPGAs in data processing and how software and hardware are evolving to take advantage of the possibilities offered by FPGAs. The use cases include: (1) approximated nearest neighbor search, which is relevant to databases and machine learning, (2) remote disaggregated memory, showing how the cloud architecture is evolving and demonstrating the potential for operator offloading and line rate data processing, and (3) recommendation system as an application with tight latency constraints.

相關內容

FPGA

關注 18

FPGA：ACM/SIGDA International Symposium on Field-Programmable Gate Arrays。 Explanation：ACM/SIGDA現場可編程門陣列(lie)國際研(yan)討(tao)會。 Publisher：ACM。 SIT：

MoDELS · 語言模型化 · TOOLS · API · GPT-4 ·

2023 年 5 月 24 日

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil,Tianjun Zhang,Xin Wang,Joseph E. Gonzalez

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at //gorilla.cs.berkeley.edu

Networking · 路徑 · 可交換的 · 數據獲取 · Buffer（公司） ·

2023 年 5 月 22 日

A novel approach for FPGA-to-server data transmission over an Ethernet-based network using the eXpress Data Path technology

Carsten Dülsen,Tobias Flick,Timo G?hring,Wolfgang Wagner,Marius Wensing

from arxiv, 12 pages, 9 figures

In the context of the upgrade of the Large Hadron Collider at CERN for high-luminosity operation, the particle detectors have to cope with much higher data rates and therefore need to upgrade their data acquisition systems. This upgrade is taken as an opportunity to exchange the currently used highly customized hardware by commercial solutions. Nevertheless, some part of the data processing still needs to be done within Field Programmable Gate Arrays (FPGA), requiring the transfer of data between the FPGAs and the commercial servers. This paper reports on a study of direct data transmission from FPGAs to servers via a commercial network. Large data buffers as required for reliable data-transmission protocols are avoided by using an emerging technique named eXpress Data Path (XDP). Based on XDP, the transmission of 5.2 PB (i.e. 2.92 * 10^{12} packets) was achieved within 168 h without a single missing packet.

Performer · 推斷 · Processing（編程語言） · ML · Integration ·

2023 年 5 月 22 日

IMBUE: In-Memory Boolean-to-CUrrent Inference ArchitecturE for Tsetlin Machines

Omar Ghazal,Simranjeet Singh,Tousif Rahman,Shengqi Yu,Yujin Zheng,Domenico Balsamo,Sachin Patkar,Farhad Merchant,Fei Xia,Alex Yakovlev,Rishad Shafik

from arxiv, Accepted at ACM/IEEE International Symposium on Low Power Electronics and Design 2023 (ISLPED 2023)

In-memory computing for Machine Learning (ML) applications remedies the von Neumann bottlenecks by organizing computation to exploit parallelism and locality. Non-volatile memory devices such as Resistive RAM (ReRAM) offer integrated switching and storage capabilities showing promising performance for ML applications. However, ReRAM devices have design challenges, such as non-linear digital-analog conversion and circuit overheads. This paper proposes an In-Memory Boolean-to-Current Inference Architecture (IMBUE) that uses ReRAM-transistor cells to eliminate the need for such conversions. IMBUE processes Boolean feature inputs expressed as digital voltages and generates parallel current paths based on resistive memory states. The proportional column current is then translated back to the Boolean domain for further digital processing. The IMBUE architecture is inspired by the Tsetlin Machine (TM), an emerging ML algorithm based on intrinsically Boolean logic. The IMBUE architecture demonstrates significant performance improvements over binarized convolutional neural networks and digital TM in-memory implementations, achieving up to a 12.99x and 5.28x increase, respectively.

語言模型化 · Readability · MoDELS · 代碼 · 優化器 ·

2023 年 5 月 21 日

SLaDe: A Portable Small Language Model Decompiler for Optimized Assembler

Jordi Armengol-Estapé,Jackson Woodruff,Chris Cummins,Michael F. P. O'Boyle

Decompilation is a well-studied area with numerous high-quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code. However, to-date such techniques are usually restricted to synthetic programs without optimization, and no models have evaluated their portability. Furthermore, while the code generated may be more readable, it is usually incorrect. This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence transformer trained over real-world code. We develop a novel tokenizer and exploit no-dropout training to produce high-quality code. We utilize type-inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches. Unlike standard approaches, SLaDe can infer out-of-context types and unlike neural approaches, it generates correct code. We evaluate SLaDe on over 4,000 functions from AnghaBench on two ISAs and at two optimizations levels. SLaDe is up to 6 times more accurate than Ghidra, a state-of-the-art, industrial-strength decompiler and up to 4 times more accurate than the large language model ChatGPT and generates significantly more readable code than both.

可約的 · INFORMS · Things · 流 · Storage ·

2023 年 5 月 20 日

Post-Quantum Hybrid Digital Signatures with Hardware-Support for Digital Twins

Saif E. Nouma,Attila A. Yavuz

from arxiv, 20 pages, 7 figures

Digital Twins (DT) virtually model cyber-physical objects using Internet of Things (IoT) components (e.g., sensors) to gather and process senstive information stored in the cloud. Trustworthiness of the streamed data is crucial which requires quantum safety and breach resiliency. Digital signatures are essential for scalable authentication and non-repudiation. Yet, NIST PQC signature standards are exorbitantly costly for low-end IoT without considering forward security. Moreover, Post-Quantum (PQ) signatures lack aggregation, which is highly desirable to reduce the transmission and storage burdens in DTs. Hence, there is an urgent need for lightweight digital signatures that offer compromise resiliency and compactness while permitting an effective transition into the PQ era for DTs. We create a series of highly lightweight digital signatures called Hardware-ASsisted Efficient Signature (HASES) that meets the above requirements. The core of HASES is a hardware-assisted cryptographic commitment construct oracle (CCO) that permits verifiers to obtain expensive commitments without signer interaction. We created three HASES schemes: PQ-HASES is a forward-secure PQ signature, LA-HASES is an efficient aggregate Elliptic-Curve signature, and HY-HASES is a novel hybrid scheme that combines PQ-HASES and LA-HASES with novel strong nesting and sequential aggregation. HASES does not require a secure-hardware on the signer. We proved that HASES schemes are secure and implemented them on commodity hardware and an 8-bit AVR ATmega2560. Our experiments confirm that PQ-HASES and LA-HASES are two magnitudes of times more signer efficient than their PQ and conventional-secure counterparts, respectively. HY-HASES outperforms NIST PQC and conventional signature combinations, offering a standardcompliant transitional solution for emerging DTs. We open-source HASES schemes for public testing and adaptation.

糾刪碼 · Storage · 總回報 · 服務器 · 操作 ·

2023 年 5 月 20 日

CausalEC: A Causally Consistent Data Storage Algorithm based on Cross-Object Erasure Coding

Viveck R. Cadambe,Shihang Lyu

from arxiv, Extended version of a brief announcement at ACM PODC 2023

Current causally consistent data storage algorithms use partial or full replication to ensure data access to clients over a distributed setting. We develop, for the first time, an erasure coding-based algorithm called CausalEC that ensures causal consistency for a collection of read-write objects stored in a distributed set of nodes over an asynchronous message-passing system. CausalEC can use an arbitrary linear erasure code for data storage and ensures liveness, fault-tolerance, and storage properties prescribed by the erasure code. CausalEC retains a key benefit of previous replication-based algorithms - every write operation is "local", that is, a server performs only local actions before returning to a client that issued a write operation. For servers that store certain objects in an uncoded manner, read operations to those objects also return locally. In general, a read operation to an object can be returned by a server on contacting a small subset of other servers so long as the underlying erasure code allows for the object to be decoded from that subset. Notably, unlike previous consistent erasure coding-based algorithms, CausalEC is compatible with cross-object erasure coding, where nodes encode values across multiple objects. CausalEC navigates the technical challenges of cross-object erasure coding, in particular, pertaining to re-encoding when writes update the values and ensuring that concurrent reads are served in a non-blocking manner during the transition to storing codeword symbols corresponding to the updated values.

state-of-the-art · ML · Machine Learning · 中央處理器 (CPU) · Learning ·

2023 年 5 月 19 日

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Juan Gómez-Luna,Yuxin Guo,Sylvan Brocard,Julien Legriel,Remy Cimadomo,Geraldo F. Oliveira,Gagandeep Singh,Onur Mutlu

from arxiv, Our open-source software is available at //github.com/CMU-SAFARI/pim-ml

Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is $27\times$ faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and $1.34\times$ faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is $2.8\times$ and $3.2\times$ than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems.

Processing（編程語言） · 深度強化學習 · 學成 · 強化學習 · INTERACT ·

2022 年 2 月 4 日

A Survey on Deep Reinforcement Learning for Data Processing and Analytics

Qingpeng Cai,Can Cui,Yiyuan Xiong,Wei Wang,Zhongle Xie,Meihui Zhang

from arxiv, 39 pages, 3 figures and 3 tables

Data processing and analytics are fundamental and pervasive. Algorithms play a vital role in data processing and analytics where many algorithm designs have incorporated heuristics and general rules from human knowledge and experience to improve their effectiveness. Recently, reinforcement learning, deep reinforcement learning (DRL) in particular, is increasingly explored and exploited in many areas because it can learn better strategies in complicated environments it is interacting with than statically designed algorithms. Motivated by this trend, we provide a comprehensive review of recent works focusing on utilizing DRL to improve data processing and analytics. First, we present an introduction to key concepts, theories, and methods in DRL. Next, we discuss DRL deployment on database systems, facilitating data processing and analytics in various aspects, including data organization, scheduling, tuning, and indexing. Then, we survey the application of DRL in data processing and analytics, ranging from data preparation, natural language processing to healthcare, fintech, etc. Finally, we discuss important open challenges and future research directions of using DRL in data processing and analytics.

Networking · SimPLe · Automator · INFORMS · Prompt ·

2021 年 6 月 11 日

Neural Architecture Search without Training

Joseph Mellor,Jack Turner,Amos Storkey,Elliot J. Crowley

from arxiv, Accepted at ICML 2021 for a long presentation

The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be alleviated if we could partially predict a network's trained accuracy from its initial state. In this work, we examine the overlap of activations between datapoints in untrained networks and motivate how this can give a measure which is usefully indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU, and verify its effectiveness on NAS-Bench-101, NAS-Bench-201, NATS-Bench, and Network Design Spaces. Our approach can be readily combined with more expensive search methods; we examine a simple adaptation of regularised evolutionary search. Code for reproducing our experiments is available at //github.com/BayesWatch/nas-without-training.

卷積神經網絡 · Neural Networks · Performer · Seven · Processing（編程語言） ·

2019 年 1 月 17 日

A Survey of the Recent Architectures of Deep Convolutional Neural Networks

Asifullah Khan,Anabia Sohail,Umme Zahoora,Aqsa Saeed Qureshi

from arxiv, Number of Pages: 60 Number of Figures: 11 Number of Tables:1

Deep Convolutional Neural Networks (CNNs) are a special type of Neural Networks, which have shown state-of-the-art results on various competitive benchmarks. The powerful learning ability of deep CNN is largely achieved with the use of multiple non-linear feature extraction stages that can automatically learn hierarchical representation from the data. Availability of a large amount of data and improvements in the hardware processing units have accelerated the research in CNNs and recently very interesting deep CNN architectures are reported. The recent race in deep CNN architectures for achieving high performance on the challenging benchmarks has shown that the innovative architectural ideas, as well as parameter optimization, can improve the CNN performance on various vision-related tasks. In this regard, different ideas in the CNN design have been explored such as use of different activation and loss functions, parameter optimization, regularization, and restructuring of processing units. However, the major improvement in representational capacity is achieved by the restructuring of the processing units. Especially, the idea of using a block as a structural unit instead of a layer is gaining substantial appreciation. This survey thus focuses on the intrinsic taxonomy present in the recently reported CNN architectures and consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting and attention. Additionally, it covers the elementary understanding of the CNN components and sheds light on the current challenges and applications of CNNs.