一级片免费电影看黄片免费,欧美丰满大乳屁股流白浆,久久久久99个精品无码

Kabir Ahuja,Rishav Hada,Millicent Ochieng,Prachi Jain,Harshita Diddee,Samuel Maina,Tanuja Ganu,Sameer Segal,Maxamed Axmed,Kalika Bali,Sunayana Sitaram

Generative AI models have impressive performance on many Natural Language Processing tasks such as language understanding, reasoning and language generation. One of the most important questions that is being asked by the AI community today is about the capabilities and limits of these models, and it is clear that evaluating generative AI is very challenging. Most studies on generative Large Language Models (LLMs) are restricted to English and it is unclear how capable these models are at understanding and generating other languages. We present the first comprehensive benchmarking of generative LLMs - MEGA, which evaluates models on standard NLP benchmarks, covering 8 diverse tasks and 33 typologically diverse languages. We also compare the performance of generative LLMs to State of the Art (SOTA) non-autoregressive models on these tasks to determine how well generative models perform compared to the previous generation of LLMs. We present a thorough analysis of the performance of models across languages and discuss some of the reasons why generative LLMs are currently not optimal for all languages. We create a framework for evaluating generative LLMs in the multilingual setting and provide directions for future progress in the field.

相關內容

語言生成

關注 0

語言模型化 · Performer · MoDELS · TOOLS · ChatGPT ·

2023 年 5 月 24 日

From Text to MITRE Techniques: Exploring the Malicious Use of Large Language Models for Generating Cyber Attack Payloads

P. V. Sai Charan,Hrushikesh Chunduri,P. Mohan Anand,Sandeep K Shukla

This research article critically examines the potential risks and implications arising from the malicious utilization of large language models(LLM), focusing specifically on ChatGPT and Google's Bard. Although these large language models have numerous beneficial applications, the misuse of this technology by cybercriminals for creating offensive payloads and tools is a significant concern. In this study, we systematically generated implementable code for the top-10 MITRE Techniques prevalent in 2022, utilizing ChatGPT, and conduct a comparative analysis of its performance with Google's Bard. Our experimentation reveals that ChatGPT has the potential to enable attackers to accelerate the operation of more targeted and sophisticated attacks. Additionally, the technology provides amateur attackers with more capabilities to perform a wide range of attacks and empowers script kiddies to develop customized tools that contribute to the acceleration of cybercrime. Furthermore, LLMs significantly benefits malware authors, particularly ransomware gangs, in generating sophisticated variants of wiper and ransomware attacks with ease. On a positive note, our study also highlights how offensive security researchers and pentesters can make use of LLMs to simulate realistic attack scenarios, identify potential vulnerabilities, and better protect organizations. Overall, we conclude by emphasizing the need for increased vigilance in mitigating the risks associated with LLMs. This includes implementing robust security measures, increasing awareness and education around the potential risks of this technology, and collaborating with security experts to stay ahead of emerging threats.

知識 (knowledge) · ImageNet (數據集) · MoDELS · 語言模型化 · 單峰值 ·

2023 年 5 月 24 日

ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories

Heming Xia,Qingxiu Dong,Lei Li,Jingjing Xu,Ziwei Qin,Zhifang Sui

Recently, Pretrained Language Models (PLMs) have been serving as general-purpose interfaces, posing a significant demand for comprehensive visual knowledge. However, it remains unclear how well current PLMs and their visually augmented counterparts (VaLMs) can master visual commonsense knowledge. To investigate this, we propose ImageNetVC, a fine-grained, human-annotated dataset specifically designed for zero-shot visual commonsense evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we delve into the fundamental visual commonsense knowledge of both unimodal PLMs and VaLMs, uncovering the scaling law and the influence of the backbone model on VaLMs. Furthermore, we investigate the factors affecting the visual commonsense knowledge of large-scale models, providing insights into the development of language models enriched with visual commonsense knowledge. Our code and dataset are available at //github.com/hemingkx/ImageNetVC.

代碼 · 知識 (knowledge) · 多樣性 · state-of-the-art · Performer ·

2023 年 5 月 24 日

GrACE: Generation using Associated Code Edits

Priyanshu Gupta,Avishree Khare,Yasharth Bajpai,Saikat Chakraborty,Sumit Gulwani,Aditya Kanade,Arjun Radhakrishna,Gustavo Soares,Ashish Tiwari

Developers expend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) of code with the knowledge of prior, relevant edits. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, Codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, the knowledge of prior edits boosts the performance of the LLMs significantly and enables them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively.

Atom（文本編輯器） · 查準率/準確率 · Automator · MoDELS · 語言模型化 ·

2023 年 5 月 23 日

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Sewon Min,Kalpesh Krishna,Xinxi Lyu,Mike Lewis,Wen-tau Yih,Pang Wei Koh,Mohit Iyyer,Luke Zettlemoyer,Hannaneh Hajishirzi

from arxiv, 23 pages, 7 figures

Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FActScore (Factual precision in Atomicity Score), a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FActScores of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FActScore, using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models.

NLP · MoDELS · Performer · HTTPS · GLUE ·

2023 年 5 月 23 日

WYWEB: A NLP Evaluation Benchmark For Classical Chinese

Bo Zhou,Qianglong Chen,Tianyu Wang,Xiaomi Zhong,Yin Zhang

from arxiv, Accepted by ACL 2023

To fully evaluate the overall performance of different NLP models in a given domain, many evaluation benchmarks are proposed, such as GLUE, SuperGLUE and CLUE. The fi eld of natural language understanding has traditionally focused on benchmarks for various tasks in languages such as Chinese, English, and multilingua, however, there has been a lack of attention given to the area of classical Chinese, also known as "wen yan wen", which has a rich history spanning thousands of years and holds signifi cant cultural and academic value. For the prosperity of the NLP community, in this paper, we introduce the WYWEB evaluation benchmark, which consists of nine NLP tasks in classical Chinese, implementing sentence classifi cation, sequence labeling, reading comprehension, and machine translation. We evaluate the existing pre-trained language models, which are all struggling with this benchmark. We also introduce a number of supplementary datasets and additional tools to help facilitate further progress on classical Chinese NLU. The github repository is //github.com/baudzhou/WYWEB.

Performer · MoDELS · ChatGPT · 穩健性 · 語言模型化 ·

2023 年 5 月 22 日

Evaluating ChatGPT's Performance for Multilingual and Emoji-based Hate Speech Detection

Mithun Das,Saurabh Kumar Pandey,Animesh Mukherjee

Hate speech is a severe issue that affects many online platforms. So far, several studies have been performed to develop robust hate speech detection systems. Large language models like ChatGPT have recently shown great potential in performing several tasks, including hate speech detection. However, it is crucial to comprehend the limitations of these models to build more robust hate speech detection systems. Thus to bridge the gap, our study aims to evaluate the weaknesses of the ChatGPT model in detecting hate speech at a granular level across 11 languages. In addition, we investigate the influence of complex emotions, such as the use of emojis in hate speech, on the performance of the ChatGPT model. Through our analysis, we examine and investigate the errors made by the model, shedding light on its shortcomings in detecting certain types of hate speech and highlighting the need for further research and improvements in hate speech detection.

Automator · 語言模型化 · ChatGPT · Performer · 文本分類 ·

2023 年 5 月 22 日

Distilling ChatGPT for Explainable Automated Student Answer Assessment

Jiazheng Li,Lin Gui,Yuxiang Zhou,David West,Cesare Aloisi,Yulan He

Assessing student answers and providing valuable feedback is crucial for effective learning, but it can be a time-consuming task. Traditional methods of automating student answer assessment through text classification often suffer from issues such as lack of trustworthiness, transparency, and the ability to provide a rationale for the automated assessment process. These limitations hinder their usefulness in practice. In this paper, we explore using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation under both the zero-shot and few-shot settings. We introduce a critic module which automatically filters incorrect outputs from ChatGPT and utilizes the remaining ChtaGPT outputs as noisy labelled data to fine-tune a smaller language model, enabling it to perform student answer scoring and rationale generation. Moreover, by drawing multiple samples from ChatGPT outputs, we are able to compute predictive confidence scores, which in turn can be used to identify corrupted data and human label errors in the training set. Our experimental results demonstrate that despite being a few orders of magnitude smaller than ChatGPT, the fine-tuned language model achieves better performance in student answer scoring. Furthermore, it generates more detailed and comprehensible assessments than traditional text classification methods. Our approach provides a viable solution to achieve explainable automated assessment in education.

類別 · LIDAR · 3D · 多峰值 · 模型評估 ·

2023 年 5 月 19 日

Towards Long-Tailed 3D Detection

Neehar Peri,Achal Dave,Deva Ramanan,Shu Kong

from arxiv, This work has been accepted to the Conference on Robot Learning (CoRL) 2022

Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors, particularly on large-scale lidar data. Surprisingly, although semantic class labels naturally follow a long-tailed distribution, contemporary benchmarks focus on only a few common classes (e.g., pedestrian and car) and neglect many rare classes in-the-tail (e.g., debris and stroller). However, AVs must still detect rare classes to ensure safe operation. Moreover, semantic classes are often organized within a hierarchy, e.g., tail classes such as child and construction-worker are arguably subclasses of pedestrian. However, such hierarchical relationships are often ignored, which may lead to misleading estimates of performance and missed opportunities for algorithmic innovation. We address these challenges by formally studying the problem of Long-Tailed 3D Detection (LT3D), which evaluates on all classes, including those in-the-tail. We evaluate and innovate upon popular 3D detection codebases, such as CenterPoint and PointPillars, adapting them for LT3D. We develop hierarchical losses that promote feature sharing across common-vs-rare classes, as well as improved detection metrics that award partial credit to "reasonable" mistakes respecting the hierarchy (e.g., mistaking a child for an adult). Finally, we point out that fine-grained tail class accuracy is particularly improved via multimodal fusion of RGB images with LiDAR; simply put, small fine-grained classes are challenging to identify from sparse (lidar) geometry alone, suggesting that multimodal cues are crucial to long-tailed 3D detection. Our modifications improve accuracy by 5% AP on average for all classes, and dramatically improve AP for rare classes (e.g., stroller AP improves from 3.6 to 31.6)! Our code is available at //github.com/neeharperi/LT3D

語言模型化 · Performer · MoDELS · ChatGPT · 知識 (knowledge) ·

2023 年 5 月 19 日

HELMA: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Junyi Li,Xiaoxue Cheng,Wayne Xin Zhao,Jian-Yun Nie,Ji-Rong Wen

from arxiv, Working in progress

Large language models (LLMs), such as ChatGPT, are prone to generate hallucinations, \ie content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation for Large Language Models (HELMA) benchmark, a large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing and alleviating hallucination. To generate these samples, we propose a ChatGPT-based two-step framework, \ie sampling-then-filtering. Specifically, we first adopt two different sampling methods to generate hallucinated samples based on instructions, and then use an example-enhanced filtering method to select the best one. Furthermore, we also hire some human labelers to annotate the hallucinations in ChatGPT responses. The empirical results suggest that ChatGPT has some probabilities to generate hallucinations and existing LLMs face great challenges in recognizing the hallucinations in text. In addition, the performance can be improved by providing external knowledge or adding reasoning steps. Our benchmark can be accessed at //github.com/RUCAIBox/HELMA.

可理解性 · 多峰值 · MoDELS · Extensibility · Performer ·

2020 年 2 月 15 日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Huaishao Luo,Lei Ji,Botian Shi,Haoyang Huang,Nan Duan,Tianrui Li,Xilin Chen,Ming Zhou

We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pre-training using narrated instructional videos. Different from their works which only pre-train understanding task, we propose a unified video-language pre-training model for both understanding and generation tasks. Our model comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. We first pre-train our model to learn the universal representation for both video and language on a large instructional video dataset. Then we fine-tune the model on two multimodal tasks including understanding task (text-based video retrieval) and generation task (multimodal video captioning). Our extensive experiments show that our method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.