亚洲精品无码国产爽快A片百度_国产精品性爱视频亚洲国产黄片_色色色色色色欧美日韩_98久久精品人人妻人人搡_成年性午夜免费羞羞视频_亚洲色图激情小说_亚洲AV无码专区首页第一页

We present ir-measures, a new tool that makes it convenient to calculate a diverse set of evaluation measures used in information retrieval. Rather than implementing its own measure calculations, ir-measures provides a common interface to a handful of evaluation tools. The necessary tools are automatically invoked (potentially multiple times) to calculate all the desired metrics, simplifying the evaluation process for the user. The tool also makes it easier for researchers to use recently-proposed measures (such as those from the C/W/L framework) alongside traditional measures, potentially encouraging their adoption.

相關內容

TOOLS

關注 1

這個新版本的工具會議系列恢復了從1989年到2012年的50個會議的傳統。工具最初是“面向對象語言和系統的技術”，后來發展到包括軟件技術的所有創新方面。今天許多最重要的軟件概念都是在這里首次引入的。2019年TOOLS 50+1在俄羅斯喀山附近舉行，以同樣的創新精神、對所有與軟件相關的事物的熱情、科學穩健性和行業適用性的結合以及歡迎該領域所有趨勢和社區的開放態度，延續了該系列。官網鏈接： · Machine Learning · 學成 · AIM · MoDELS ·

2022 年 1 月 28 日

Heterogeneous Treatment Effect Estimation using machine learning for Healthcare application: tutorial and benchmark

Yaobin Ling,Pulakesh Upadhyaya,Luyao Chen,Xiaoqian Jiang,Yejin Kim

from arxiv, 52 pages, 8 figures

Developing new drugs for target diseases is a time-consuming and expensive task, drug repurposing has become a popular topic in the drug development field. As much health claim data become available, many studies have been conducted on the data. The real-world data is noisy, sparse, and has many confounding factors. In addition, many studies have shown that drugs effects are heterogeneous among the population. Lots of advanced machine learning models about estimating heterogeneous treatment effects (HTE) have emerged in recent years, and have been applied to in econometrics and machine learning communities. These studies acknowledge medicine and drug development as the main application area, but there has been limited translational research from the HTE methodology to drug development. We aim to introduce the HTE methodology to the healthcare area and provide feasibility consideration when translating the methodology with benchmark experiments on healthcare administrative claim data. Also, we want to use benchmark experiments to show how to interpret and evaluate the model when it is applied to healthcare research. By introducing the recent HTE techniques to a broad readership in biomedical informatics communities, we expect to promote the wide adoption of causal inference using machine learning. We also expect to provide the feasibility of HTE for personalized drug effectiveness.

CASES · SimPLe · HTTPS · GROUP · 注意力機制 ·

2022 年 1 月 28 日

Comparison of Evaluation Metrics for Landmark Detection in CMR Images

Sven Koehler,Lalith Sharan,Julian Kuhm,Arman Ghanaat,Jelizaveta Gordejeva,Nike K. Simon,Niko M. Grell,Florian André,Sandy Engelhardt

from arxiv, Accepted at Bildverarbeitung f\"ur die Medizin (BVM), Informatik aktuell. Springer Vieweg, Wiesbaden 2022

Cardiac Magnetic Resonance (CMR) images are widely used for cardiac diagnosis and ventricular assessment. Extracting specific landmarks like the right ventricular insertion points is of importance for spatial alignment and 3D modeling. The automatic detection of such landmarks has been tackled by multiple groups using Deep Learning, but relatively little attention has been paid to the failure cases of evaluation metrics in this field. In this work, we extended the public ACDC dataset with additional labels of the right ventricular insertion points and compare different variants of a heatmap-based landmark detection pipeline. In this comparison, we demonstrate very likely pitfalls of apparently simple detection and localisation metrics which highlights the importance of a clear detection strategy and the definition of an upper limit for localisation-based metrics. Our preliminary results indicate that a combination of different metrics is necessary, as they yield different winners for method comparison. Additionally, they highlight the need of a comprehensive metric description and evaluation standardisation, especially for the error cases where no metrics could be computed or where no lower/upper boundary of a metric exists. Code and labels: //github.com/Cardio-AI/rvip_landmark_detection

Automator · MoDELS · 數據點 · Processing（編程語言） · 學成 ·

2022 年 1 月 27 日

Using Shape Metrics to Describe 2D Data Points

William Franz Lamberti

Traditional machine learning (ML) algorithms, such as multiple regression, require human analysts to make decisions on how to treat the data. These decisions can make the model building process subjective and difficult to replicate for those who did not build the model. Deep learning approaches benefit by allowing the model to learn what features are important once the human analyst builds the architecture. Thus, a method for automating certain human decisions for traditional ML modeling would help to improve the reproducibility and remove subjective aspects of the model building process. To that end, we propose to use shape metrics to describe 2D data to help make analyses more explainable and interpretable. The proposed approach provides a foundation to help automate various aspects of model building in an interpretable and explainable fashion. This is particularly important in applications in the medical community where the `right to explainability' is crucial. We provide various simulated data sets ranging from probability distributions, functions, and model quality control checks (such as QQ-Plots and residual analyses from ordinary least squares) to showcase the breadth of this approach.

圖像字幕 · Automator · 成對型 · Better · 相關系數 ·

2022 年 1 月 27 日

Can Audio Captions Be Evaluated with Image Caption Metrics?

Zelin Zhou,Zhiling Zhang,Xuenan Xu,Zeyu Xie,Mengyue Wu,Kenny Q. Zhu

from arxiv, ICASSP 2022

Automated audio captioning aims at generating textual descriptions for an audio clip. To evaluate the quality of generated audio captions, previous works directly adopt image captioning metrics like SPICE and CIDEr, without justifying their suitability in this new domain, which may mislead the development of advanced models. This problem is still unstudied due to the lack of human judgment datasets on caption quality. Therefore, we firstly construct two evaluation benchmarks, AudioCaps-Eval and Clotho-Eval. They are established with pairwise comparison instead of absolute rating to achieve better inter-annotator agreement. Current metrics are found in poor correlation with human annotations on these datasets. To overcome their limitations, we propose a metric named FENSE, where we combine the strength of Sentence-BERT in capturing similarity, and a novel Error Detector to penalize erroneous sentences for robustness. On the newly established benchmarks, FENSE outperforms current metrics by 14-25% accuracy. Code, data and web demo available at: //github.com/blmoistawinde/fense

設計 · INFORMS · state-of-the-art · AIM · 示例 ·

2022 年 1 月 27 日

CacheFX: A Framework for Evaluating Cache Security

Daniel Genkin,William Kosasih,Fangfei Liu,Anna Trikalinou,Thomas Unterluggauer,Yuval Yarom

Over the last two decades, the danger of sharing resources between programs has been repeatedly highlighted. Multiple side-channel attacks, which seek to exploit shared components for leaking information, have been devised, mostly targeting shared caching components. In response, the research community has proposed multiple cache designs that aim at curbing the source of side channels. With multiple competing designs, there is a need for assessing the level of security against side-channel attacks that each design offers. In this work we propose CacheFX, a flexible framework for assessing and evaluating the resilience of cache designs to side-channel attacks. CacheFX allows the evaluator to implement various cache designs, victims, and attackers, as well as to exercise them for assessing the leakage of information via the cache. To demonstrate the power of CacheFX, we implement multiple cache designs and replacement algorithms, and devise three evaluation metrics that measure different aspects of the caches:(1) the entropy induced by a memory access; (2) the complexity of building an eviction set; and (3) protection against cryptographic attacks. Our experiments highlight that different security metrics give different insights to designs, making a comprehensive analysis mandatory. For instance, while eviction-set building was fastest for randomized skewed caches, these caches featured lower eviction entropy and higher practical attack complexity. Our experiments show that all non-partitioned designs allow for effective cryptographic attacks. However, in state-of-the-art secure caches, eviction-based attacks are more difficult to mount than occupancy-based attacks, highlighting the need to consider the latter in cache design.

得分 · Continuity · Weight · 隨機變量 · 秩 ·

2022 年 1 月 26 日

Extreme events evaluation using CRPS distributions

Maxime Taillardat,Anne-Laure Fougères,Philippe Naveau,Rapha?l de Fondeville

Verification of probabilistic forecasts for extreme events has been a very active field of research, stirred by media and public opinions who naturally focus their attention on extreme events, and easily draw biased onclusions. In this context, classical verification methodologies tailored for extreme events, such as thresholded and weighted scoring rules, have undesirable properties that cannot be mitigated; the well-known Continuous Ranked Probability Score (CRPS) makes no exception. In this paper, we define a formal framework to assess the behavior of forecast evaluation procedures with respect to extreme events, that we use to point out that assessment based on the expectation of a proper score is not suitable for extremes. As an alternative, we propose to study the properties of the CRPS as a random variable using extreme value theory to address extreme events verification. To compare calibrated forecasts, an index is introduced that summarizes the ability of probabilistic forecasts to predict extremes. Its strengths and limitations are discussed using both theoretical arguments and simulations.

MoDELS · Processing（編程語言） · 無監督學習 · 穩健性 · 無監督 ·

2022 年 1 月 25 日

Evaluating Sensitivity to the Stick-Breaking Prior in Bayesian Nonparametrics

Ryan Giordano,Runjing Liu,Michael I. Jordan,Tamara Broderick

from arxiv, 69 pages. Accepted for submission at Bayesian Analysis

Bayesian models based on the Dirichlet process and other stick-breaking priors have been proposed as core ingredients for clustering, topic modeling, and other unsupervised learning tasks. However, due to the flexibility of these models, the consequences of prior choices can be opaque. And so prior specification can be relatively difficult. At the same time, prior choice can have a substantial effect on posterior inferences. Thus, considerations of robustness need to go hand in hand with nonparametric modeling. In the current paper, we tackle this challenge by exploiting the fact that variational Bayesian methods, in addition to having computational advantages in fitting complex nonparametric models, also yield sensitivities with respect to parametric and nonparametric aspects of Bayesian models. In particular, we demonstrate how to assess the sensitivity of conclusions to the choice of concentration parameter and stick-breaking distribution for inferences under Dirichlet process mixtures and related mixture models. We provide both theoretical and empirical support for our variational approach to Bayesian sensitivity analysis.

MoDELS · 模型評估 · NLP · Extensibility · 可辨認的 ·

2020 年 5 月 8 日

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Marco Tulio Ribeiro,Tongshuang Wu,Carlos Guestrin,Sameer Singh

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

話題模型 · MoDELS · 話題 · Performer · 相關系數 ·

2018 年 4 月 26 日

Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation

Shudong Hao,Jordan Boyd-Graber,Michael J. Paul

from arxiv, North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), New Orleans, Louisiana. June 2018

Multilingual topic models enable document analysis across languages through coherent multilingual summaries of the data. However, there is no standard and effective metric to evaluate the quality of multilingual topics. We introduce a new intrinsic evaluation of multilingual topic models that correlates well with human judgments of multilingual topic coherence as well as performance in downstream applications. Importantly, we also study evaluation for low-resource languages. Because standard metrics fail to accurately measure topic quality when robust external resources are unavailable, we propose an adaptation model that improves the accuracy and reliability of these metrics in low-resource settings.

Re-ID · 度量學習 · 行人重識別 · Extensibility · 特征提取 ·

2018 年 2 月 14 日

A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets

Srikrishna Karanam,Mengran Gou,Ziyan Wu,Angels Rates-Borras,Octavia Camps,Richard J. Radke

from arxiv, Preliminary work on person Re-Id benchmark. S. Karanam and M. Gou contributed equally. 14 pages, 6 figures, 4 tables. For supplementary material, see //robustsystems.coe.neu.edu/sites/robustsystems.coe.neu.edu/files/systems/supmat/ReID_benchmark_supp.zip

Person re-identification (re-id) is a critical problem in video analytics applications such as security and surveillance. The public release of several datasets and code for vision algorithms has facilitated rapid progress in this area over the last few years. However, directly comparing re-id algorithms reported in the literature has become difficult since a wide variety of features, experimental protocols, and evaluation metrics are employed. In order to address this need, we present an extensive review and performance evaluation of single- and multi-shot re-id algorithms. The experimental protocol incorporates the most recent advances in both feature extraction and metric learning. To ensure a fair comparison, all of the approaches were implemented using a unified code library that includes 11 feature extraction algorithms and 22 metric learning and ranking techniques. All approaches were evaluated using a new large-scale dataset that closely mimics a real-world problem setting, in addition to 16 other publicly available datasets: VIPeR, GRID, CAVIAR, DukeMTMC4ReID, 3DPeS, PRID, V47, WARD, SAIVT-SoftBio, CUHK01, CHUK02, CUHK03, RAiD, iLIDSVID, HDA+ and Market1501. The evaluation codebase and results will be made publicly available for community use.