国产一本二本三本的区别视频-中文字幕无码乱人伦漫画

Meenakshi S. Kagda,Bonita Lam,Casey Litton,Corinn Small,Cricket A. Sloan,Emma Spragins,Forrest Tanaka,Ian Whaling,Idan Gabdank,Ingrid Youngworth,J. Seth Strattan,Jason Hilton,Jennifer Jou,Jessica Au,Jin-Wook Lee,Kalina Andreeva,Keenan Graham,Khine Lin,Matt Simison,Otto Jolanki,Paul Sud,Pedro Assis,Philip Adenekan,Eric Douglas,Mingjie Li,Pedro Assis,Keenan Graham,Paul Sud,Stuart Miyasato,Weiwei Zhong,Yunhai Luo,Zachary Myers,J. Michael Cherry,Benjamin C. Hitz

Spanning two decades, the Encyclopaedia of DNA Elements (ENCODE) is a collaborative research project that aims to identify all the functional elements in the human and mouse genomes. To best serve the scientific community, all data generated by the consortium is shared through a web-portal (//www.encodeproject.org/) with no access restrictions. The fourth and final phase of the project added a diverse set of new samples (including those associated with human disease), and a wide range of new assays aimed at detection, characterization and validation of functional genomic elements. The ENCODE data portal hosts results from over 23,000 functional genomics experiments, over 800 functional elements characterization experiments (including in vivo transgenic enhancer assays, reporter assays and CRISPR screens) along with over 60,000 results of computational and integrative analyses (including imputations, predictions and genome annotations). The ENCODE Data Coordination Center (DCC) is responsible for development and maintenance of the data portal, along with the implementation and utilisation of the ENCODE uniform processing pipelines to generate uniformly processed data. Here we report recent updates to the data portal. Specifically, we have completely redesigned the home page, improved search interface, added several new pages to highlight collections of biologically related data (deeply profiled cell lines, immune cells, Alzheimer's Disease, RNA-Protein interactions, degron matrix and a matrix of experiments organised by human donors), added single-cell experiments, and enhanced the cart interface for visualisation and download of user-selected datasets.

相關內容

泛函

關注 0

相互獨立的 · Processing（編程語言） · Buffer（公司） · 可辨認的 · Performer ·

2023 年 6 月 16 日

Verification and Validation of the Stakeholder Tool for Assessing Radioactive Transportation (START)

Caitlin Condon,Philip Jensen,Patrick Royer,Harish Gadey,Mark Abkowitz,Robert Claypool,Steven Maheras,Matt Feldman

from arxiv, Presented at the 2022 Waste Management Symposia (//s3.amazonaws.com/amz.xcdsystem.com/A464D2CF-E476-F46B-841E415B85C431CC_abstract_File754/FinalPaper_22323_0306055149.pdf)

The U.S. Department of Energy (DOE) Office of Integrated Waste Management is planning for the eventual transportation, storage, and disposal of spent nuclear fuel (SNF) and high-level radioactive waste (HLW) from nuclear power plant and DOE sites. The Stakeholder Tool for Assessing Radioactive Transportation (START) is a web-based, geospatial decision-support tool developed for evaluating routing options and other aspects of transporting SNF and HLW, covering rail, truck, barge, and intermodal infrastructure and operations in the continental United States. The verification and validation (V&V) process is intended to independently assess START to provide confidence in the ability of START to accurately provide intended results. The V&V process checks the START tool using a variety of methods, ranging from independent hand calculations to comparison of START performance and results to those of other codes. The V&V activity was conducted independently from the START development team with opportunities to provide feedback and collaborate throughout the process. The V&V analyzed attributes of transportation routes produced by START, including route distance and both population and population density captured within buffer zones around routes. Population in the buffer zone, population density in the buffer zone, and route distance were all identified as crucial outputs of the START code and were subject to V&V tasks. Some of the improvements identified through the V&V process were standardizing the underlying population data in START, changing the projection of the population raster data, and changes to the methodology used for population density to improve its applicability for expected users. This collaboration also led to suggested improvements to some of the underlying shape file segments within START.

多樣性 · MoDELS · 數據集 · 穩健性 · 傳感器 ·

2023 年 6 月 16 日

A context model for collecting diversity-aware data

Matteo Busso,Xiaoyue Li

Diversity-aware data are essential for a robust modeling of human behavior in context. In addition, being the human behavior of interest for numerous applications, data must also be reusable across domain, to ensure diversity of interpretations. Current data collection techniques allow only a partial representation of the diversity of people and often generate data that is difficult to reuse. To fill this gap, we propose a data collection methodology, within a hybrid machine-artificial intelligence approach, and its related dataset, based on a comprehensive ontological notion of context which enables data reusability. The dataset has a sample of 158 participants and is collected via the iLog smartphone application. It contains more than 170 GB of subjective and objective data, which comes from 27 smartphone sensors that are associated with 168,095 self-reported annotations on the participants context. The dataset is highly reusable, as demonstrated by its diverse applications.

查全率/召回率 · 查準率/準確率 · 假陰性 · 偽標記 · 訓練數據 ·

2023 年 6 月 16 日

Class-Adaptive Self-Training for Relation Extraction with Incompletely Annotated Training Data

Qingyu Tan,Lu Xu,Lidong Bing,Hwee Tou Ng

from arxiv, ACL 2023 Findings

Relation extraction (RE) aims to extract relations from sentences and documents. Existing relation extraction models typically rely on supervised machine learning. However, recent studies showed that many RE datasets are incompletely annotated. This is known as the false negative problem in which valid relations are falsely annotated as 'no_relation'. Models trained with such data inevitably make similar mistakes during the inference stage. Self-training has been proven effective in alleviating the false negative problem. However, traditional self-training is vulnerable to confirmation bias and exhibits poor performance in minority classes. To overcome this limitation, we proposed a novel class-adaptive re-sampling self-training framework. Specifically, we re-sampled the pseudo-labels for each class by precision and recall scores. Our re-sampling strategy favored the pseudo-labels of classes with high precision and low recall, which improved the overall recall without significantly compromising precision. We conducted experiments on document-level and biomedical relation extraction datasets, and the results showed that our proposed self-training framework consistently outperforms existing competitive methods on the Re-DocRED and ChemDisgene datasets when the training data are incompletely annotated. Our code is released at //github.com/DAMO-NLP-SG/CAST.

相互獨立的 · 輸出 · 可辨認的 · Storage · Microsoft Surface ·

2023 年 6 月 15 日

Update on the Verification and Validation Efforts for the Stakeholder Tool for Assessing Radioactive Transportation

Harish Gadey,Caitlin Condon,Steven Maheras,Kacey McGee

from arxiv, Presented at PATRAM 2022 (June 11-15, 2023)

The United States Department of Energy (U.S. DOE) is planning for the transportation, storage, and disposal of spent nuclear fuel (SNF) and high-level radioactive waste (HLW) from commercial nuclear power plants and other U.S. DOE sites. The Stakeholder Tool for Assessing Radioactive Transportation (START) is a web-based, geospatial decision-support tool developed for evaluating routing options and other aspects of transporting SNF and HLW via barge, train, truck, and intermodal surface transport in the continental United States. The verification and validation (V&V) effort is intended to independently assess START to provide confidence in the ability of the tool to accurately provide intended outputs. The results selected for the V&V effort of the START code include those identified as crucial outputs by subject matter experts. Outputs from START such as shape files and keyhole markup language (KML) files are analyzed using a geodesic computation using the WSG-84 ellipsoid model. Most of the V&V efforts are aimed towards examining and comparing the total length reported in the various files in the START tool. This work also focuses on the development of V&V methodologies for various outputs that could be replicated by the end user on a set of user-defined routes. Over 150 origin destination pairs were run as part of this effort to test the functionality of the START tool. In addition to presenting results using an independent geodesic computation, this work will provide a comparison of the total route lengths between START version 3.3 and the previous release of START (version 3.2.2).

點云 · 蒸餾 · Vision · MoDELS · 多樣性 ·

2023 年 6 月 15 日

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

Youquan Liu,Lingdong Kong,Jun Cen,Runnan Chen,Wenwei Zhang,Liang Pan,Kai Chen,Ziwei Liu

from arxiv, Preprint; 36 pages, 16 figures, 14 tables; Code at //github.com/youquanl/Segment-Any-Point-Cloud

Recent advancements in vision foundation models (VFMs) have opened up new possibilities for versatile and efficient visual perception. In this work, we introduce Seal, a novel framework that harnesses VFMs for segmenting diverse automotive point cloud sequences. Seal exhibits three appealing properties: i) Scalability: VFMs are directly distilled into point clouds, eliminating the need for annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial and temporal relationships are enforced at both the camera-to-LiDAR and point-to-segment stages, facilitating cross-modal representation learning. iii) Generalizability: Seal enables knowledge transfer in an off-the-shelf manner to downstream tasks involving diverse point clouds, including those from real/synthetic, low/high-resolution, large/small-scale, and clean/corrupted datasets. Extensive experiments conducted on eleven different point cloud datasets showcase the effectiveness and superiority of Seal. Notably, Seal achieves a remarkable 45.0% mIoU on nuScenes after linear probing, surpassing random initialization by 36.9% mIoU and outperforming prior arts by 6.1% mIoU. Moreover, Seal demonstrates significant performance gains over existing methods across 20 different few-shot fine-tuning tasks on all eleven tested point cloud datasets.

Learning · Analysis · 數據集 · Unstructured · Automator ·

2023 年 6 月 15 日

SIGHT: A Large Annotated Dataset on Student Insights Gathered from Higher Education Transcripts

Rose E. Wang,Pawan Wirawarn,Noah Goodman,Dorottya Demszky

from arxiv, First two authors contributed equally. In the Proceedings of Innovative Use of NLP for Building Educational Applications 2023. The code and data are open-sourced here: //github.com/rosewang2008/sight

Lectures are a learning experience for both students and teachers. Students learn from teachers about the subject material, while teachers learn from students about how to refine their instruction. However, online student feedback is unstructured and abundant, making it challenging for teachers to learn and improve. We take a step towards tackling this challenge. First, we contribute a dataset for studying this problem: SIGHT is a large dataset of 288 math lecture transcripts and 15,784 comments collected from the Massachusetts Institute of Technology OpenCourseWare (MIT OCW) YouTube channel. Second, we develop a rubric for categorizing feedback types using qualitative analysis. Qualitative analysis methods are powerful in uncovering domain-specific insights, however they are costly to apply to large data sources. To overcome this challenge, we propose a set of best practices for using large language models (LLMs) to cheaply classify the comments at scale. We observe a striking correlation between the model's and humans' annotation: Categories with consistent human annotations (>$0.9$ inter-rater reliability, IRR) also display higher human-model agreement (>$0.7$), while categories with less consistent human annotations ($0.7$-$0.8$ IRR) correspondingly demonstrate lower human-model agreement ($0.3$-$0.5$). These techniques uncover useful student feedback from thousands of comments, costing around $\$0.002$ per comment. We conclude by discussing exciting future directions on using online student feedback and improving automated annotation techniques for qualitative research.

MoDELS · 可理解性 · Performer · HTTPS · GitHub ·

2023 年 6 月 15 日

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

Yiming Cui,Ziqing Yang,Xin Yao

from arxiv, 18 pages

Large Language Models (LLMs), such as ChatGPT and GPT-4, have dramatically transformed natural language processing research and shown promising strides towards Artificial General Intelligence (AGI). Nonetheless, the high costs associated with training and deploying LLMs present substantial obstacles to transparent, accessible academic research. While several large language models, such as LLaMA, have been open-sourced by the community, these predominantly focus on English corpora, limiting their usefulness for other languages. In this paper, we propose a method to augment LLaMA with capabilities for understanding and generating Chinese text and its ability to follow instructions. We achieve this by extending LLaMA's existing vocabulary with an additional 20,000 Chinese tokens, thereby improving its encoding efficiency and semantic understanding of Chinese. We further incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets, significantly enhancing the model's ability to comprehend and execute instructions. Our experimental results indicate that the newly proposed model markedly enhances the original LLaMA's proficiency in understanding and generating Chinese content. Additionally, the results on the C-Eval dataset yield competitive performance among the models with several times the size of ours. We have made our pre-trained models, training scripts, and other resources available through GitHub, fostering open research for our community. GitHub repository: //github.com/ymcui/Chinese-LLaMA-Alpaca

Integration · 數據集 · MoDELS · INTERACT · 可理解性 ·

2023 年 6 月 14 日

KuaiSAR: A Unified Search And Recommendation Dataset

Zhongxiang Sun,Zihua Si,Xiaoxue Zang,Dewei Leng,Yanan Niu,Yang Song,Xiao Zhang,Jun Xu

from arxiv, big update

The confluence of Search and Recommendation services is a vital aspect of online content platforms like Kuaishou and TikTok. The integration of S&R modeling is a highly intuitive approach adopted by industry practitioners. However, there is a noticeable lack of research conducted in this area within the academia, primarily due to the absence of publicly available datasets. Consequently, a substantial gap has emerged between academia and industry regarding research endeavors in this field. To bridge this gap, we introduce the first large-scale, real-world dataset KuaiSAR of integrated Search And Recommendation behaviors collected from Kuaishou, a leading short-video app in China with over 300 million daily active users. Previous research in this field has predominantly employed publicly available datasets that are semi-synthetic and simulated, with artificially fabricated search behaviors. Distinct from previous datasets, KuaiSAR records genuine user behaviors, the occurrence of each interaction within either search or recommendation service, and the users' transitions between the two services. This work aids in joint modeling of S&R, and the utilization of search data for recommenders (and recommendation data for search engines). Additionally, due to the diverse feedback labels of user-video interactions, KuaiSAR also supports a wide range of other tasks, including intent recommendation, multi-task learning, and long sequential multi-behavior modeling etc. We believe this dataset will facilitate innovative research and enrich our understanding of S&R services integration in real-world applications.

圖 · 知識圖譜 · 鏈路預測 · Extensibility · entity ·

2020 年 10 月 6 日

CoDEx: A Comprehensive Knowledge Graph Completion Benchmark

Tara Safavi,Danai Koutra

from arxiv, EMNLP 2020

We present CoDEx, a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. In terms of scope, CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false. To characterize CoDEx, we contribute thorough empirical analyses and benchmarking experiments. First, we analyze each CoDEx dataset in terms of logical relation patterns. Next, we report baseline link prediction and triple classification results on CoDEx for five extensively tuned embedding models. Finally, we differentiate CoDEx from the popular FB15K-237 knowledge graph completion dataset by showing that CoDEx covers more diverse and interpretable content, and is a more difficult link prediction benchmark. Data, code, and pretrained models are available at //bit.ly/2EPbrJs.

掩碼 · BERT · MoDELS · 掩碼語言模型化 · Extensibility ·

2019 年 6 月 19 日

Pre-Training with Whole Word Masking for Chinese BERT

Yiming Cui,Wanxiang Che,Ting Liu,Bing Qin,Ziqing Yang,Shijin Wang,Guoping Hu

from arxiv, 10 pages

Bidirectional Encoder Representations from Transformers (BERT) has shown marvelous improvements across various NLP tasks. Recently, an upgraded version of BERT has been released with Whole Word Masking (WWM), which mitigate the drawbacks of masking partial WordPiece tokens in pre-training BERT. In this technical report, we adapt whole word masking in Chinese text, that masking the whole word instead of masking Chinese characters, which could bring another challenge in Masked Language Model (MLM) pre-training task. The model was trained on the latest Chinese Wikipedia dump. We aim to provide easy extensibility and better performance for Chinese BERT without changing any neural architecture or even hyper-parameters. The model is verified on various NLP tasks, across sentence-level to document-level, including sentiment classification (ChnSentiCorp, Sina Weibo), named entity recognition (People Daily, MSRA-NER), natural language inference (XNLI), sentence pair matching (LCQMC, BQ Corpus), and machine reading comprehension (CMRC 2018, DRCD, CAIL RC). Experimental results on these datasets show that the whole word masking could bring another significant gain. Moreover, we also examine the effectiveness of Chinese pre-trained models: BERT, ERNIE, BERT-wwm. We release the pre-trained model (both TensorFlow and PyTorch) on GitHub: //github.com/ymcui/Chinese-BERT-wwm