国产高清一区二区在线影院,999精品视频在线免费观看,中文字幕色欲AV一区二区三区

As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aware communication using native APIs for GPUs such as CUDA can be a daunting task as it requires considerable effort with little guarantee of performance. In this work, we demonstrate the capability of the Unified Communication X (UCX) framework to compose a GPU-aware communication layer that serves multiple parallel programming models of the Charm++ ecosystem: Charm++, Adaptive MPI (AMPI), and Charm4py. We demonstrate the performance impact of our designs with microbenchmarks adapted from the OSU benchmark suite, obtaining improvements in latency of up to 10.1x in Charm++, 11.7x in AMPI, and 17.4x in Charm4py. We also observe increases in bandwidth of up to 10.1x in Charm++, 10x in AMPI, and 10.5x in Charm4py. We show the potential impact of our designs on real-world applications by evaluating a proxy application for the Jacobi iterative method, improving the communication performance by up to 12.4x in Charm++, 12.8x in AMPI, and 19.7x in Charm4py.

相關內容

Performer

關注 10

詞元分析器 · MoDELS · Performer · 詞表 · 語言模型化 ·

2022 年 4 月 19 日

Impact of Tokenization on Language Models: An Analysis for Turkish

Cagri Toraman,Eyup Halit Yilmaz,Furkan ?ahinu?,Oguzhan Ozcelik

from arxiv, submitted to ACM TALLIP

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.

穩健性 · 講稿 · 值域 · MoDELS · 數值分析 ·

2022 年 4 月 19 日

Can DtN and GenEO coarse spaces be sufficiently robust for heterogeneous Helmholtz problems?

Niall Bootland,Victorita Dolean

Numerical solution of heterogeneous Helmholtz problems presents various computational challenges, with descriptive theory remaining out of reach for many popular approaches. Robustness and scalability are key for practical and reliable solvers in large-scale applications, especially for large wave number problems. In this work we explore the use of a GenEO-type coarse space to build a two-level additive Schwarz method applicable to highly indefinite Helmholtz problems. Through a range of numerical tests on a 2D model problem, discretised by finite elements on pollution-free meshes, we observe robust convergence, iteration counts that do not increase with the wave number, and good scalability of our approach. We further provide results showing a favourable comparison with the DtN coarse space. Our numerical study shows promise that our solver methodology can be effective for challenging heterogeneous applications.

Automator · 代碼 · 可約的 · Readability · 端到端 ·

2022 年 4 月 19 日

Auto-Icon+: An Automated End-to-End Code Generation Tool for Icon Designs in UI Development

Sidong Feng,Minmin Jiang,Tingting Zhou,Yankun Zhen,Chunyang Chen

from arxiv, Accepted to ACM Transactions on Interactive Intelligent Systems (TIIS)

Approximately 50% of development resources are devoted to UI development tasks [9]. Occupying a large proportion of development resources, developing icons can be a time-consuming task, because developers need to consider not only effective implementation methods but also easy-to-understand descriptions. In this paper, we present Auto-Icon+, an approach for automatically generating readable and efficient code for icons from design artifacts. According to our interviews to understand the gap between designers (icons are assembled from multiple components) and developers (icons as single images), we apply a heuristic clustering algorithm to compose the components into an icon image. We then propose an approach based on a deep learning model and computer vision methods to convert the composed icon image to fonts with descriptive labels, thereby reducing the laborious manual effort for developers and facilitating UI development. We quantitatively evaluate the quality of our method in the real world UI development environment and demonstrate that our method offers developers accurate, efficient, readable, and usable code for icon designs, in terms of saving 65.2% implementing time.

INTERACT · 教程 · 回合 · Integration · 學習器 ·

2022 年 4 月 19 日

ITSS: Interactive Web-Based Authoring and Playback Integrated Environment for Programming Tutorials

Eng Lieh Ouh,Benjamin Kok Siew Gan,David Lo

Video-based programming tutorials are a popular form of tutorial used by authors to guide learners to code. Still, the interactivity of these videos is limited primarily to control video flow. There are existing works with increased interactivity that are shown to improve the learning experience. Still, these solutions require setting up a custom recording environment and are not well-integrated with the playback environment. This paper describes our integrated ITSS environment and evaluates the ease of authoring and playback of our interactive programming tutorials. Our environment is designed to run within the browser sandbox and is less intrusive to record interactivity actions. We develop a recording approach that tracks the author's interactivity actions (e.g., typing code, highlighting words, scrolling panels) on the browser and stored in text and audio formats. We replay these actions using the recorded artefacts for learners to have a more interactive, integrated and realistic playback of the author's actions instead of watching video frames. Our design goals are 1) efficient recording and playback, 2) extensible interactivity features to help students learn better, and 3) a scalable web-based environment. Our first user study of 20 participants who carry out the author tasks agree that it is efficient and easy to author interactive videos in our environment with no additional software needed. Our second user study of 84 students using the environment agrees that the increased interactivity can help them learn better over a video-based tutorial. Our performance test shows that the environment can scale to support up to 500 concurrent users. We hope our open-source environment enable more educators to create interactive programming tutorials.

推薦系統 · MoDELS · 模型評估 · 可約的 · SOTA ·

2022 年 4 月 17 日

PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems

Yuanxing Zhang,Langshi Chen,Siran Yang,Man Yuan,Huimin Yi,Jie Zhang,Jiamang Wang,Jianbo Dong,Yunlong Xu,Yue Song,Yong Li,Di Zhang,Wei Lin,Lin Qu,Bo Zheng

The development of personalized recommendation has significantly improved the accuracy of information matching and the revenue of e-commerce platforms. Recently, it has 2 trends: 1) recommender systems must be trained timely to cope with ever-growing new products and ever-changing user interests from online marketing and social network; 2) SOTA recommendation models introduce DNN modules to improve prediction accuracy. Traditional CPU-based recommender systems cannot meet these two trends, and GPU- centric training has become a trending approach. However, we observe that GPU devices in training recommender systems are underutilized, and they cannot attain an expected throughput improvement as what it has achieved in CV and NLP areas. This issue can be explained by two characteristics of these recommendation models: First, they contain up to a thousand input feature fields, introducing fragmentary and memory-intensive operations; Second, the multiple constituent feature interaction submodules introduce substantial small-sized compute kernels. To remove this roadblock to the development of recommender systems, we propose a novel framework named PICASSO to accelerate the training of recommendation models on commodity hardware. Specifically, we conduct a systematic analysis to reveal the bottlenecks encountered in training recommendation models. We leverage the model structure and data distribution to unleash the potential of hardware through our packing, interleaving, and caching optimization. Experiments show that PICASSO increases the hardware utilization by an order of magnitude on the basis of SOTA baselines and brings up to 6x throughput improvement for a variety of industrial recommendation models. Using the same hardware budget in production, PICASSO on average shortens the walltime of daily training tasks by 7 hours, significantly reducing the delay of continuous delivery.

聯邦學習 · 學成 · Extensibility · MoDELS · 結點 ·

2022 年 4 月 16 日

A Distributed and Elastic Aggregation Service for Scalable Federated Learning Systems

Ahmad Khan,Yuze Li,Ali Anwar,Yue Cheng,Thang Hoang,Nathalie Baracaldo,Ali Butt

from arxiv, 10 pages, 14 figures, 1 table

Federated Learning has promised a new approach to resolve the challenges in machine learning by bringing computation to the data. The popularity of the approach has led to rapid progress in the algorithmic aspects and the emergence of systems capable of simulating Federated Learning. State of art systems in Federated Learning support a single node aggregator that is insufficient to train a large corpus of devices or train larger-sized models. As the model size or the number of devices increase the single node aggregator incurs memory and computation burden while performing fusion tasks. It also faces communication bottlenecks when a large number of model updates are sent to a single node. We classify the workload for the aggregator into categories and propose a new aggregation service for handling each load. Our aggregation service is based on a holistic approach that chooses the best solution depending on the model update size and the number of clients. Our system provides a fault-tolerant, robust and efficient aggregation solution utilizing existing parallel and distributed frameworks. Through evaluation, we show the shortcomings of the state of art approaches and how a single solution is not suitable for all aggregation requirements. We also provide a comparison of current frameworks with our system through extensive experiments.

Extensibility · 講稿 · Signal Processing · Performer · Integration ·

2022 年 4 月 15 日

CEDR -- A Compiler-integrated, Extensible DSSoC Runtime

Joshua Mack,Sahil Hassan,Nirmal Kumbhare,Miguel Castro-Gonzalez,Ali Akoglu

from arxiv, 35 pages single column, 16 figures, 7 tables. Accepted for publication in the ACM Transactions on Embedded and Computing Systems

In this work, we present CEDR, a Compiler-integrated, Extensible Domain Specific System on Chip Runtime ecosystem to facilitate research towards addressing the challenges of architecture, system software and application development with distinct plug-and-play integration points in a unified compile time and run time workflow. We demonstrate the utility of CEDR on the Xilinx Zynq MPSoC-ZCU102 for evaluating performance of pre-silicon hardware in the trade space of SoC configuration, scheduling policy and workload complexity based on dynamically arriving workload scenarios composed of real-life signal processing applications scaling to thousands of application instances with FFT and matrix multiply accelerators. We provide insights into the trade-offs present in this design space through a number of distinct case studies. CEDR is portable and has been deployed and validated on Odroid-XU3, X86 and Nvidia Jetson Xavier based SoC platforms. Taken together, CEDR is a capable environment for enabling research in exploring the boundaries of productive application development, resource management heuristic development, and hardware configuration analysis for heterogeneous architectures.

Automator · 操作系統 · 情景 · 操作 · 代碼 ·

2022 年 4 月 15 日

Towards Porting Operating Systems with Program Synthesis

Jingmei Hu,Eric Lu,David A. Holland,Ming Kawaguchi,Stephen Chong,Margo I. Seltzer

from arxiv, submitted to TOPLAS (under revision)

The end of Moore's Law has ushered in a diversity of hardware not seen in decades. Operating system (and system software) portability is accordingly becoming increasingly critical. Simultaneously, there has been tremendous progress in program synthesis. We set out to explore the feasibility of using modern program synthesis to generate the machine-dependent parts of an operating system. Our ultimate goal is to generate new ports automatically from descriptions of new machines. One of the issues involved is writing specifications, both for machine-dependent operating system functionality and for instruction set architectures. We designed two domain-specific languages: Alewife for machine-independent specifications of machine-dependent operating system functionality, and Cassiopea for describing instruction set architecture semantics. Automated porting also requires an implementation. We developed a toolchain that, given an Alewife specification and a Cassiopea machine description, specializes the machine-independent specification to the target instruction set architecture and synthesizes an implementation in assembly language. Using this approach, we demonstrate successful synthesis of a total of 140 OS components from two pre-existing OSes for four real hardware platforms. We also developed several optimization methods for OS-related assembly synthesis to improve scalability. The effectiveness of our languages and ability to synthesize code for all 140 specifications is evidence of the feasibility of program synthesis for machine-dependent OS code. However, many research challenges remain; we also discuss the benefits and limitations of our synthesis-based approach to automated OS porting.

SAT · Extensibility · 可約的 · 中位數 · INFORMS ·

2022 年 4 月 15 日

A Flexible Proof Format for SAT Solver-Elaborator Communication

Seulkee Baek,Mario Carneiro,Marijn J. H. Heule

We introduce FRAT, a new proof format for unsatisfiable SAT problems, and its associated toolchain. Compared to DRAT, the FRAT format allows solvers to include more information in proofs to reduce the computational cost of subsequent elaboration to LRAT. The format is easy to parse forward and backward, and it is extensible to future proof methods. The provision of optional proof steps allows SAT solver developers to balance implementation effort against elaboration time, with little to no overhead on solver time. We benchmark our FRAT toolchain against a comparable DRAT toolchain and confirm >84% median reduction in elaboration time and >94% median decrease in peak memory usage.

回合 · Extensibility · 圖 · Performer · prototype ·

2022 年 4 月 14 日

Analysis of Workflow Schedulers in Simulated Distributed Environments

Jakub Beránek,Stanislav B?hm,Vojtěch Cima

Task graphs provide a simple way to describe scientific workflows (sets of tasks with dependencies) that can be executed on both HPC clusters and in the cloud. An important aspect of executing such graphs is the used scheduling algorithm. Many scheduling heuristics have been proposed in existing works; nevertheless, they are often tested in oversimplified environments. We provide an extensible simulation environment designed for prototyping and benchmarking task schedulers, which contains implementations of various scheduling algorithms and is open-sourced, in order to be fully reproducible. We use this environment to perform a comprehensive analysis of workflow scheduling algorithms with a focus on quantifying the effect of scheduling challenges that have so far been mostly neglected, such as delays between scheduler invocations or partially unknown task durations. Our results indicate that network models used by many previous works might produce results that are off by an order of magnitude in comparison to a more realistic model. Additionally, we show that certain implementation details of scheduling algorithms which are often neglected can have a large effect on the scheduler's performance, and they should thus be described in great detail to enable proper evaluation.