国产亚洲欧美日韩精品色狠二区,碰碰女人公开免费视频,色狠狠一区二区三区

There exists an extremely wide array of LLM benchmarking tasks, whereas oftentimes a single number is the most actionable for decision-making, especially by non-experts. No such aggregation schema exists that is not Elo-based, which could be costly or time-consuming. Here we propose a method to aggregate performance across a general space of benchmarks, nicknamed Project "MPG," dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance. Here, we create two numbers: a "Goodness" number (answer accuracy) and a "Fastness" number (cost or QPS). We compare models against each other and present a ranking according to our general metric as well as subdomains. We find significant agreement between the raw Pearson correlation of our scores and those of Chatbot Arena, even improving on the correlation of the MMLU leaderboard to Chatbot Arena.

相關內容

Performer

關注 10

樣本 · 可行 · 優化器 · 缺失值 · motivation ·

2024 年 12 月 12 日

Tests of Missing Completely At Random based on sample covariance matrices

Alberto Bordino,Thomas B. Berrett

from arxiv, 96 pages, 16 figures

We study the problem of testing whether the missing values of a potentially high-dimensional dataset are Missing Completely at Random (MCAR). We relax the problem of testing MCAR to the problem of testing the compatibility of a collection of covariance matrices, motivated by the fact that this procedure is feasible when the dimension grows with the sample size. Our first contributions are to define a natural measure of the incompatibility of a collection of correlation matrices, which can be characterised as the optimal value of a Semi-definite Programming (SDP) problem, and to establish a key duality result allowing its practical computation and interpretation. By analysing the concentration properties of the natural plug-in estimator for this measure, we propose a novel hypothesis test, which is calibrated via a bootstrap procedure and demonstrates power against any distribution with incompatible covariance matrices. By considering key examples of missingness structures, we demonstrate that our procedures are minimax rate optimal in certain cases. We further validate our methodology with numerical simulations that provide evidence of validity and power, even when data are heavy tailed. Furthermore, tests of compatibility can be used to test the feasibility of positive semi-definite matrix completion problems with noisy observations, and thus our results may be of independent interest.

統計量 · 相同 · 原點 · 估計/估計量 · Performer ·

2024 年 12 月 12 日

Assessing the replicability of RCTs in RWE emulations

Jeanette K?ppe,Charlotte Micheloud,Stella Erdmann,Rachel Heyard,Leonhard Held

Background: The standard regulatory approach to assess replication success is the two-trials rule, requiring both the original and the replication study to be significant with effect estimates in the same direction. The sceptical p-value was recently presented as an alternative method for the statistical assessment of the replicability of study results. Methods: We compare the statistical properties of the sceptical p-value and the two-trials rule. We illustrate the performance of the different methods using real-world evidence emulations of randomized, controlled trials (RCTs) conducted within the RCT DUPLICATE initiative. Results: The sceptical p-value depends not only on the two p-values, but also on sample size and effect size of the two studies. It can be calibrated to have the same Type-I error rate as the two-trials rule, but has larger power to detect an existing effect. In the application to the results from the RCT DUPLICATE initiative, the sceptical p-value leads to qualitatively similar results than the two-trials rule, but tends to show more evidence for treatment effects compared to the two-trials rule. Conclusion: The sceptical p-value represents a valid statistical measure to assess the replicability of study results and is especially useful in the context of real-world evidence emulations.

集成 · MoDELS · 模型評估 · 可理解性 · Performer ·

2024 年 12 月 12 日

Beyond forecast leaderboards: Measuring individual model importance based on contribution to ensemble accuracy

Minsu Kim,Evan L. Ray,Nicholas G. Reich

from arxiv, 28 pages, 8 figures in the main text; includes supplementary material

Ensemble forecasts often outperform forecasts from individual standalone models, and have been used to support decision-making and policy planning in various fields. As collaborative forecasting efforts to create effective ensembles grow, so does interest in understanding individual models' relative importance in the ensemble. To this end, we propose two practical methods that measure the difference between ensemble performance when a given model is or is not included in the ensemble: a leave-one-model-out algorithm and a leave-all-subsets-of-models-out algorithm, which is based on the Shapley value. We explore the relationship between these metrics, forecast accuracy, and the similarity of errors, both analytically and through simulations. We illustrate this measure of the value a component model adds to an ensemble in the presence of other models using US COVID-19 death forecasts. This study offers valuable insight into individual models' unique features within an ensemble, which standard accuracy metrics alone cannot reveal.

分解的 · Neural Networks · Learning · Networking · 輸出 ·

2024 年 12 月 11 日

Learning incomplete factorization preconditioners for GMRES

Paul H?usner,Aleix Nieto Juscafresa,Jens Sj?lund

from arxiv, The first two authors contributed equally, Northern Lights Deep Learning Conference, 15 pages

Incomplete LU factorizations of sparse matrices are widely used as preconditioners in Krylov subspace methods to speed up solving linear systems. Unfortunately, computing the preconditioner itself can be time-consuming and sensitive to hyper-parameters. Instead, we replace the hand-engineered algorithm with a graph neural network that is trained to approximate the matrix factorization directly. To apply the output of the neural network as a preconditioner, we propose an output activation function that guarantees that the predicted factorization is invertible. Further, applying a graph neural network architecture allows us to ensure that the output itself is sparse which is desirable from a computational standpoint. We theoretically analyze and empirically evaluate different loss functions to train the learned preconditioners and show their effectiveness in decreasing the number of GMRES iterations and improving the spectral properties on synthetic data. The code is available at //github.com/paulhausner/neural-incomplete-factorization.

列 · 可約的 · 分解的 · 正交 · 評論員 ·

2024 年 12 月 10 日

An improved Shifted CholeskyQR based on columns

Yuwei Fan,Haoran Guan,Zhonghua Qiao

Among all the deterministic CholeskyQR-type algorithms, Shifted CholeskyQR3 is specifically designed to address the QR factorization of ill-conditioned matrices. This algorithm introduces a shift parameter $s$ to prevent failure during the initial Cholesky factorization step, making the choice of this parameter critical for the algorithm's effectiveness. Our goal is to identify a smaller $s$ compared to the traditional selection based on $\norm{X}_{2}$. In this research, we propose a new matrix norm called the $g$-norm, which is based on the column properties of $X$. This norm allows us to obtain a reduced shift parameter $s$ for the Shifted CholeskyQR3 algorithm, thereby improving the sufficient condition of $\kappa_{2}(X)$ for this method. We provide rigorous proofs of orthogonality and residuals for the improved algorithm using our proposed $s$. Numerical experiments confirm the enhanced numerical stability of orthogonality and residuals with the reduced $s$. We find that Shifted CholeskyQR3 can effectively handle ill-conditioned $X$ with a larger $\kappa_{2}(X)$ when using our reduced $s$ compared to the original $s$. Furthermore, we compare CPU times with other algorithms to assess performance improvements.

穩健性 · Networking · 泛化理論 · Neural Networks · SimPLe ·

2024 年 12 月 8 日

Revisiting DeepFool: generalization and improvement

Alireza Abdollahpoorrostam,Mahed Abroshan,Seyed-Mohsen Moosavi-Dezfooli

Deep neural networks have been known to be vulnerable to adversarial examples, which are inputs that are modified slightly to fool the network into making incorrect predictions. This has led to a significant amount of research on evaluating the robustness of these networks against such perturbations. One particularly important robustness metric is the robustness to minimal $\ell_2$ adversarial perturbations. However, existing methods for evaluating this robustness metric are either computationally expensive or not very accurate. In this paper, we introduce a new family of adversarial attacks that strike a balance between effectiveness and computational efficiency. Our proposed attacks are generalizations of the well-known DeepFool (DF) attack, while they remain simple to understand and implement. We demonstrate that our attacks outperform existing methods in terms of both effectiveness and computational efficiency. Our proposed attacks are also suitable for evaluating the robustness of large models and can be used to perform adversarial training (AT) to achieve state-of-the-art robustness to minimal $\ell_2$ adversarial perturbations.

語音識別 · 正則化項 · Performance · 自動語音識別 · Processing（編程語言） ·

2024 年 12 月 8 日

CR-CTC: Consistency regularization on CTC for improved speech recognition

Zengwei Yao,Wei Kang,Xiaoyu Yang,Fangjun Kuang,Liyong Guo,Han Zhu,Zengrui Jin,Zhaoqing Li,Long Lin,Daniel Povey

Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR), renowned for its simplicity and computational efficiency. However, it often falls short in recognition performance. In this work, we propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. We provide in-depth insights into its essential behaviors from three perspectives: 1) it conducts self-distillation between random pairs of sub-models that process different augmented views; 2) it learns contextual representation through masked prediction for positions within time-masked regions, especially when we increase the amount of time masking; 3) it suppresses the extremely peaky CTC distributions, thereby reducing overfitting and improving the generalization ability. Extensive experiments on LibriSpeech, Aishell-1, and GigaSpeech datasets demonstrate the effectiveness of our CR-CTC. It significantly improves the CTC performance, achieving state-of-the-art results comparable to those attained by transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED). We release our code at \url{//github.com/k2-fsa/icefall}.

操作 · Performer · 置信度 · 計算成本 · 代價 ·

2024 年 12 月 7 日

Timely reliable Bayesian decision-making enabled using memristors

Lekai Song,Pengyu Liu,Yang Liu,Jingfang Pei,Wenyu Cui,Songwei Liu,Yingyi Wen,Teng Ma,Kong-Pang Pun,Guohua Hu

Brains perform timely reliable decision-making by Bayes theorem. Bayes theorem quantifies events as probabilities and, through probability rules, renders the decisions. Learning from this, applying Bayes theorem in practical problems can visualize the potential risks and decision confidence, thereby enabling efficient user-scene interactions. However, given the probabilistic nature, implementing Bayes theorem with the conventional deterministic computing can inevitably induce excessive computational cost and decision latency. Herein, we propose a probabilistic computing approach using memristors to implement Bayes theorem. We integrate volatile memristors with Boolean logics and, by exploiting the volatile stochastic switching of the memristors, realize Boolean operations with statistical probabilities and correlations, key for enabling Bayes theorem. To practically demonstrate the effectiveness of our memristor-enabled Bayes theorem approach in user-scene interactions, we design lightweight Bayesian inference and fusion operators using our probabilistic logics and apply the operators in road scene parsing for self-driving, including route planning and obstacle detection. The results show that our operators can achieve reliable decisions at a rate over 2,500 frames per second, outperforming human decision-making and the existing driving assistance systems.

Weight · 線性的 · Analysis · 近似 · 泛函 ·

2024 年 12 月 6 日

On one dimensional weighted Poincare inequalities for Global Sensitivity Analysis

David Heredia,Aldéric Joulin,Olivier Roustant

One-dimensional Poincare inequalities are used in Global Sensitivity Analysis (GSA) to provide derivative-based upper bounds and approximations of Sobol indices. We add new perspectives by investigating weighted Poincare inequalities. Our contributions are twofold. In a first part, we provide new theoretical results for weighted Poincare inequalities, guided by GSA needs. We revisit the construction of weights from monotonic functions, providing a new proof from a spectral point of view. In this approach, given a monotonic function g, the weight is built such that g is the first non-trivial eigenfunction of a convenient diffusion operator. This allows us to reconsider the linear standard, i.e. the weight associated to a linear g. In particular, we construct weights that guarantee the existence of an orthonormal basis of eigenfunctions, leading to approximation of Sobol indices with Parseval formulas. In a second part, we develop specific methods for GSA. We study the equality case of the upper bound of a total Sobol index, and link the sharpness of the inequality to the proximity of the main effect to the eigenfunction. This leads us to theoretically investigate the construction of data-driven weights from estimators of the main effects when they are monotonic, another extension of the linear standard. Finally, we illustrate the benefits of using weights on a GSA study of two toy models and a real flooding application, involving the Poincare constant and/or the whole eigenbasis.

MoDELS · 模型評估 · NLP · Extensibility · 可辨認的 ·

2020 年 5 月 8 日

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Marco Tulio Ribeiro,Tongshuang Wu,Carlos Guestrin,Sameer Singh

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.