亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Explainable AI methods facilitate the understanding of model behaviour, yet, small, imperceptible perturbations to inputs can vastly distort explanations. As these explanations are typically evaluated holistically, before model deployment, it is difficult to assess when a particular explanation is trustworthy. Some studies have tried to create confidence estimators for explanations, but none have investigated an existing link between uncertainty and explanation quality. We artificially simulate epistemic uncertainty in text input by introducing noise at inference time. In this large-scale empirical study, we insert different levels of noise perturbations and measure the effect on the output of pre-trained language models and different uncertainty metrics. Realistic perturbations have minimal effect on performance and explanations, yet masking has a drastic effect. We find that high uncertainty doesn't necessarily imply low explanation plausibility; the correlation between the two metrics can be moderately positive when noise is exposed during the training process. This suggests that noise-augmented models may be better at identifying salient tokens when uncertain. Furthermore, when predictive and epistemic uncertainty measures are over-confident, the robustness of a saliency map to perturbation can indicate model stability issues. Integrated Gradients shows the overall greatest robustness to perturbation, while still showing model-specific patterns in performance; however, this phenomenon is limited to smaller Transformer-based language models.

相關內容

ACM/IEEE第23屆模型驅動工程語言和系統國際會議,是模型驅動軟件和系統工程的首要會議系列,由ACM-SIGSOFT和IEEE-TCSE支持組織。自1998年以來,模型涵蓋了建模的各個方面,從語言和方法到工具和應用程序。模特的參加者來自不同的背景,包括研究人員、學者、工程師和工業專業人士。MODELS 2019是一個論壇,參與者可以圍繞建模和模型驅動的軟件和系統交流前沿研究成果和創新實踐經驗。今年的版本將為建模社區提供進一步推進建模基礎的機會,并在網絡物理系統、嵌入式系統、社會技術系統、云計算、大數據、機器學習、安全、開源等新興領域提出建模的創新應用以及可持續性。 官網鏈接: · 泛函 · 近似 · 總回報 · 噪聲 ·
2024 年 7 月 16 日

Consider the model where we can access a parity function through random uniform labeled examples in the presence of random classification noise. In this paper, we show that approximating the number of relevant variables in the parity function is as hard as properly learning parities. More specifically, let $\gamma:{\mathbb R}^+\to {\mathbb R}^+$, where $\gamma(x) \ge x$, be any strictly increasing function. In our first result, we show that from any polynomial-time algorithm that returns a $\gamma$-approximation, $D$ (i.e., $\gamma^{-1}(d(f)) \leq D \leq \gamma(d(f))$), of the number of relevant variables~$d(f)$ for any parity $f$, we can, in polynomial time, construct a solution to the long-standing open problem of polynomial-time learning $k(n)$-sparse parities (parities with $k(n)\le n$ relevant variables), where $k(n) = \omega_n(1)$. In our second result, we show that from any $T(n)$-time algorithm that, for any parity $f$, returns a $\gamma$-approximation of the number of relevant variables $d(f)$ of $f$, we can, in polynomial time, construct a $poly(\Gamma(n))T(\Gamma(n)^2)$-time algorithm that properly learns parities, where $\Gamma(x)=\gamma(\gamma(x))$. If $T(\Gamma(n)^2)=\exp({o(n/\log n)})$, this would resolve another long-standing open problem of properly learning parities in the presence of random classification noise in time $\exp({o(n/\log n)})$.

The ongoing research scenario for automatic speech recognition (ASR) envisions a clear division between end-to-end approaches and classic modular systems. Even though a high-level comparison between the two approaches in terms of their requirements and (dis)advantages is commonly addressed, a closer comparison under similar conditions is not readily available in the literature. In this work, we present a comparison focused on the label topology and training criterion. We compare two discriminative alignment models with hidden Markov model (HMM) and connectionist temporal classification topology, and two first-order label context ASR models utilizing factored HMM and strictly monotonic recurrent neural network transducer, respectively. We use different measurements for the evaluation of the alignment quality, and compare word error rate and real time factor of our best systems. Experiments are conducted on the LibriSpeech 960h and Switchboard 300h tasks.

Language model (LM) distillation is a trending area that aims to distil the knowledge residing in a large teacher LM to a small student one. While various methods have been proposed to maximize the effectiveness of the distillation, significant challenges persist, particularly when there is a substantial capacity gap between the teacher and student LMs. This issue, often referred to as the \textit{curse} of capacity gap, suggests that a larger teacher does not necessarily result in a superior student compared to one distilled from a smaller teacher. In other words, there is likely an optimal teacher yielding the best student along the scaling course of the teacher. Even worse, the curse of capacity gap can not be lifted without additional compute, as indicated in previous studies. In the context of large LMs (LLMs), previously viable approaches become much less meaningful, as it is impossible to distill a large teacher to a good student without notably additional compute. However, the tale is not ever one-sided. It is always not late to acquire that using a large teacher is resource-demanding. Consequently, instead of sticking to lifting the curse, leaving the curse as is and using a small yet adequate teacher should be arguably fine. Even better, in this paper, we take the spirits of scaling law and reveal that the optimal teacher scale is almost consistently and linearly correlated to the student scale across different model architectures and data scales, fortunately turning the curse into a \textit{law} of capacity gap. The law later guides us to distil a 3B student LM (termed \textsc{MiniMA}) from LLaMA2-7B. \textsc{MiniMA} is demonstrated to outperform a wide range of 3B competitors and could even compete with several 7B models.

The SETH is a hypothesis of fundamental importance to (fine-grained) parameterized complexity theory and many important tight lower bounds are based on it. This situation is somewhat problematic, because the validity of the SETH is not universally believed and because in some senses the SETH seems to be "too strong" a hypothesis for the considered lower bounds. Motivated by this, we consider a number of reasonable weakenings of the SETH that render it more plausible, with sources ranging from circuit complexity, to backdoors for SAT-solving, to graph width parameters, to weighted satisfiability problems. Despite the diversity of the different formulations, we are able to uncover several non-obvious connections using tools from classical complexity theory. This leads us to a hierarchy of five main equivalence classes of hypotheses, with some of the highlights being the following: We show that beating brute force search for SAT parameterized by a modulator to a graph of bounded pathwidth, or bounded treewidth, or logarithmic tree-depth, is actually the same question, and is in fact equivalent to beating brute force for circuits of depth $\epsilon n$; we show that beating brute force search for a strong 2-SAT backdoor is equivalent to beating brute force search for a modulator to logarithmic pathwidth; we show that beting brute force search for a strong Horn backdoor is equivalent to beating brute force search for arbitrary circuit SAT.

Different distribution shifts require different interventions, and algorithms must be grounded in the specific shifts they address. However, methodological development for robust algorithms typically relies on structural assumptions that lack empirical validation. Advocating for an empirically grounded data-driven approach to research, we build an empirical testbed comprising natural shifts across 5 tabular datasets and 60,000 method configurations encompassing imbalanced learning and distributionally robust optimization (DRO) methods. We find $Y|X$-shifts are most prevalent on our testbed, in stark contrast to the heavy focus on $X$ (covariate)-shifts in the ML literature. The performance of robust algorithms varies significantly over shift types, and is no better than that of vanilla methods. To understand why, we conduct an in-depth empirical analysis of DRO methods and find that although often neglected by researchers, implementation details -- such as the choice of underlying model class (e.g., XGBoost) and hyperparameter selection -- have a bigger impact on performance than the ambiguity set or its radius. To further bridge that gap between methodological research and practice, we design case studies that illustrate how such a data-driven, inductive understanding of distribution shifts can enhance both data-centric and algorithmic interventions.

Since the work of Polyanskiy, Poor and Verd\'u on the finite blocklength performance of capacity-achieving codes for discrete memoryless channels, many papers have attempted to find further results for more practically relevant channels. However, it seems that the complexity of computing capacity-achieving codes has not been investigated until now. We study this question for the simplest non-trivial Gaussian channels, i.e., the additive colored Gaussian noise channel. To assess the computational complexity, we consider the classes $\mathrm{FP}_1$ and $\#\mathrm{P}_1$. $\mathrm{FP}_1$ includes functions computable by a deterministic Turing machine in polynomial time, whereas $\#\mathrm{P}_1$ encompasses functions that count the number of solutions verifiable in polynomial time. It is widely assumed that $\mathrm{FP}_1\neq\#\mathrm{P}_1$. It is of interest to determine the conditions under which, for a given $M \in \mathbb{N}$, where $M$ describes the precision of the deviation of $C(P,N)$, for a certain blocklength $n_M$ and a decoding error $\epsilon > 0$ with $\epsilon\in\mathbb{Q}$, the following holds: $R_{n_M}(\epsilon)>C(P,N)-\frac{1}{2^M}$. It is shown that there is a polynomial-time computable $N_*$ such that for sufficiently large $P_*\in\mathbb{Q}$, the sequences $\{R_{n_M}(\epsilon)\}_{{n_M}\in\mathbb{N}}$, where each $R_{n_M}(\epsilon)$ satisfies the previous condition, cannot be computed in polynomial time if $\mathrm{FP}_1\neq\#\mathrm{P}_1$. Hence, the complexity of computing the sequence $\{R_{n_M}(\epsilon)\}_{n_M\in\mathbb{N}}$ grows faster than any polynomial as $M$ increases. Consequently, it is shown that either the sequence of achievable rates $\{R_{n_M}(\epsilon)\}_{n_M\in\mathbb{N}}$ as a function of the blocklength, or the sequence of blocklengths $\{n_M\}_{M\in\mathbb{N}}$ corresponding to the achievable rates, is not a polynomial-time computable sequence.

In the realm of self-supervised learning (SSL), masked image modeling (MIM) has gained popularity alongside contrastive learning methods. MIM involves reconstructing masked regions of input images using their unmasked portions. A notable subset of MIM methodologies employs discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored. In this paper, we explore the role of these discrete tokens, aiming to unravel their benefits and limitations. Building upon the connection between MIM and contrastive learning, we provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities. Furthermore, we propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework. Inspired by this metric, we contribute an innovative tokenizer design and propose a corresponding MIM method named ClusterMIM. It demonstrates superior performance on a variety of benchmark datasets and ViT backbones. Code is available at //github.com/PKU-ML/ClusterMIM.

Digital quantum simulation has broad applications in approximating unitary evolution of Hamiltonians. In practice, many simulation tasks for quantum systems focus on quantum states in the low-energy subspace instead of the entire Hilbert space. In this paper, we systematically investigate the complexity of digital quantum simulation based on product formulas in the low-energy subspace. We show that the simulation error depends on the effective low-energy norm of the Hamiltonian for a variety of digital quantum simulation algorithms and quantum systems, allowing improvements over the previous complexities for full unitary simulations even for imperfect state preparations due to thermalization. In particular, for simulating spin models in the low-energy subspace, we prove that randomized product formulas such as qDRIFT and random permutation require smaller Trotter numbers. Such improvement also persists in symmetry-protected digital quantum simulations. We prove a similar improvement in simulating the dynamics of power-law quantum interactions. We also provide a query lower bound for general digital quantum simulations in the low-energy subspace.

The fusion of causal models with deep learning introducing increasingly intricate data sets, such as the causal associations within images or between textual components, has surfaced as a focal research area. Nonetheless, the broadening of original causal concepts and theories to such complex, non-statistical data has been met with serious challenges. In response, our study proposes redefinitions of causal data into three distinct categories from the standpoint of causal structure and representation: definite data, semi-definite data, and indefinite data. Definite data chiefly pertains to statistical data used in conventional causal scenarios, while semi-definite data refers to a spectrum of data formats germane to deep learning, including time-series, images, text, and others. Indefinite data is an emergent research sphere inferred from the progression of data forms by us. To comprehensively present these three data paradigms, we elaborate on their formal definitions, differences manifested in datasets, resolution pathways, and development of research. We summarize key tasks and achievements pertaining to definite and semi-definite data from myriad research undertakings, present a roadmap for indefinite data, beginning with its current research conundrums. Lastly, we classify and scrutinize the key datasets presently utilized within these three paradigms.

Rishi Bommasani,Drew A. Hudson,Ehsan Adeli,Russ Altman,Simran Arora,Sydney von Arx,Michael S. Bernstein,Jeannette Bohg,Antoine Bosselut,Emma Brunskill,Erik Brynjolfsson,Shyamal Buch,Dallas Card,Rodrigo Castellon,Niladri Chatterji,Annie Chen,Kathleen Creel,Jared Quincy Davis,Dora Demszky,Chris Donahue,Moussa Doumbouya,Esin Durmus,Stefano Ermon,John Etchemendy,Kawin Ethayarajh,Li Fei-Fei,Chelsea Finn,Trevor Gale,Lauren Gillespie,Karan Goel,Noah Goodman,Shelby Grossman,Neel Guha,Tatsunori Hashimoto,Peter Henderson,John Hewitt,Daniel E. Ho,Jenny Hong,Kyle Hsu,Jing Huang,Thomas Icard,Saahil Jain,Dan Jurafsky,Pratyusha Kalluri,Siddharth Karamcheti,Geoff Keeling,Fereshte Khani,Omar Khattab,Pang Wei Kohd,Mark Krass,Ranjay Krishna,Rohith Kuditipudi,Ananya Kumar,Faisal Ladhak,Mina Lee,Tony Lee,Jure Leskovec,Isabelle Levent,Xiang Lisa Li,Xuechen Li,Tengyu Ma,Ali Malik,Christopher D. Manning,Suvir Mirchandani,Eric Mitchell,Zanele Munyikwa,Suraj Nair,Avanika Narayan,Deepak Narayanan,Ben Newman,Allen Nie,Juan Carlos Niebles,Hamed Nilforoshan,Julian Nyarko,Giray Ogut,Laurel Orr,Isabel Papadimitriou,Joon Sung Park,Chris Piech,Eva Portelance,Christopher Potts,Aditi Raghunathan,Rob Reich,Hongyu Ren,Frieda Rong,Yusuf Roohani,Camilo Ruiz,Jack Ryan,Christopher Ré,Dorsa Sadigh,Shiori Sagawa,Keshav Santhanam,Andy Shih,Krishnan Srinivasan,Alex Tamkin,Rohan Taori,Armin W. Thomas,Florian Tramèr,Rose E. Wang,William Wang,Bohan Wu,Jiajun Wu,Yuhuai Wu,Sang Michael Xie,Michihiro Yasunaga,Jiaxuan You,Matei Zaharia,Michael Zhang,Tianyi Zhang,Xikun Zhang,Yuhui Zhang,Lucia Zheng,Kaitlyn Zhou,Percy Liang
Rishi Bommasani,Drew A. Hudson,Ehsan Adeli,Russ Altman,Simran Arora,Sydney von Arx,Michael S. Bernstein,Jeannette Bohg,Antoine Bosselut,Emma Brunskill,Erik Brynjolfsson,Shyamal Buch,Dallas Card,Rodrigo Castellon,Niladri Chatterji,Annie Chen,Kathleen Creel,Jared Quincy Davis,Dora Demszky,Chris Donahue,Moussa Doumbouya,Esin Durmus,Stefano Ermon,John Etchemendy,Kawin Ethayarajh,Li Fei-Fei,Chelsea Finn,Trevor Gale,Lauren Gillespie,Karan Goel,Noah Goodman,Shelby Grossman,Neel Guha,Tatsunori Hashimoto,Peter Henderson,John Hewitt,Daniel E. Ho,Jenny Hong,Kyle Hsu,Jing Huang,Thomas Icard,Saahil Jain,Dan Jurafsky,Pratyusha Kalluri,Siddharth Karamcheti,Geoff Keeling,Fereshte Khani,Omar Khattab,Pang Wei Kohd,Mark Krass,Ranjay Krishna,Rohith Kuditipudi,Ananya Kumar,Faisal Ladhak,Mina Lee,Tony Lee,Jure Leskovec,Isabelle Levent,Xiang Lisa Li,Xuechen Li,Tengyu Ma,Ali Malik,Christopher D. Manning,Suvir Mirchandani,Eric Mitchell,Zanele Munyikwa,Suraj Nair,Avanika Narayan,Deepak Narayanan,Ben Newman,Allen Nie,Juan Carlos Niebles,Hamed Nilforoshan,Julian Nyarko,Giray Ogut,Laurel Orr,Isabel Papadimitriou,Joon Sung Park,Chris Piech,Eva Portelance,Christopher Potts,Aditi Raghunathan,Rob Reich,Hongyu Ren,Frieda Rong,Yusuf Roohani,Camilo Ruiz,Jack Ryan,Christopher Ré,Dorsa Sadigh,Shiori Sagawa,Keshav Santhanam,Andy Shih,Krishnan Srinivasan,Alex Tamkin,Rohan Taori,Armin W. Thomas,Florian Tramèr,Rose E. Wang,William Wang,Bohan Wu,Jiajun Wu,Yuhuai Wu,Sang Michael Xie,Michihiro Yasunaga,Jiaxuan You,Matei Zaharia,Michael Zhang,Tianyi Zhang,Xikun Zhang,Yuhui Zhang,Lucia Zheng,Kaitlyn Zhou,Percy Liang

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.

北京阿比特科技有限公司