亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Regression testing is an important phase to deliver software with quality. However, flaky tests hamper the evaluation of test results and can increase costs. This is because a flaky test may pass or fail non-deterministically and to identify properly the flakiness of a test requires rerunning the test suite multiple times. To cope with this challenge, approaches have been proposed based on prediction models and machine learning. Existing approaches based on the use of the test case vocabulary may be context-sensitive and prone to overfitting, presenting low performance when executed in a cross-project scenario. To overcome these limitations, we investigate the use of test smells as predictors of flaky tests. We conducted an empirical study to understand if test smells have good performance as a classifier to predict the flakiness in the cross-project context, and analyzed the information gain of each test smell. We also compared the test smell-based approach with the vocabulary-based one. As a result, we obtained a classifier that had a reasonable performance (Random Forest, 0.83) to predict the flakiness in the testing phase. This classifier presented better performance than vocabulary-based model for cross-project prediction. The Assertion Roulette and Sleepy Test test smell types are the ones associated with the best information gain values.

相關內容

We expand our effective framework for weak convergence of measures on the real line by showing that effective convergence in the Prokhorov metric is equivalent to effective weak convergence. In addition, we establish a framework for the study of the effective theory of vague convergence of measures. We introduce a uniform notion and a non-uniform notion of vague convergence, and we show that both these notions are equivalent. However, limits under effective vague convergence may not be computable even when they are finite. We give an example of a finite incomputable effective vague limit measure, and we provide a necessary and sufficient condition so that effective vague convergence produces a computable limit. Finally, we determine a sufficient condition for which effective weak and vague convergence of measures coincide. As a corollary, we obtain an effective version of the equivalence between classical weak and vague convergence of sequences of probability measures.

We resolve the long-standing "impossible tuning" issue for the classic expert problem and show that, it is in fact possible to achieve regret $O\left(\sqrt{(\ln d)\sum_t \ell_{t,i}^2}\right)$ simultaneously for all expert $i$ in a $T$-round $d$-expert problem where $\ell_{t,i}$ is the loss for expert $i$ in round $t$. Our algorithm is based on the Mirror Descent framework with a correction term and a weighted entropy regularizer. While natural, the algorithm has not been studied before and requires a careful analysis. We also generalize the bound to $O\left(\sqrt{(\ln d)\sum_t (\ell_{t,i}-m_{t,i})^2}\right)$ for any prediction vector $m_t$ that the learner receives, and recover or improve many existing results by choosing different $m_t$. Furthermore, we use the same framework to create a master algorithm that combines a set of base algorithms and learns the best one with little overhead. The new guarantee of our master allows us to derive many new results for both the expert problem and more generally Online Linear Optimization.

In settings ranging from weather forecasts to political prognostications to financial projections, probability estimates of future binary outcomes often evolve over time. For example, the estimated likelihood of rain on a specific day changes by the hour as new information becomes available. Given a collection of such probability paths, we introduce a Bayesian framework -- which we call the Gaussian latent information martingale, or GLIM -- for modeling the structure of dynamic predictions over time. Suppose, for example, that the likelihood of rain in a week is 50 %, and consider two hypothetical scenarios. In the first, one expects the forecast to be equally likely to become either 25 % or 75 % tomorrow; in the second, one expects the forecast to stay constant for the next several days. A time-sensitive decision-maker might select a course of action immediately in the latter scenario, but may postpone their decision in the former, knowing that new information is imminent. We model these trajectories by assuming predictions update according to a latent process of information flow, which is inferred from historical data. In contrast to general methods for time series analysis, this approach preserves important properties of probability paths such as the martingale structure and appropriate amount of volatility and better quantifies future uncertainties around probability paths. We show that GLIM outperforms three popular baseline methods, producing better estimated posterior probability path distributions measured by three different metrics. By elucidating the dynamic structure of predictions over time, we hope to help individuals make more informed choices.

Test smells are known as bad development practices that reflect poor design and implementation choices in software tests. Over the last decade, test smells were heavily studied to measure their prevalence and impacts on test maintainability. However, these studies focused mainly on the unit level and to this day, the work on system tests that interact with the System Under Test through a Graphical User Interface remains limited. To fill the gap, we conduct an exploratory analysis of test smells occurring in System User Interactive Tests (SUIT). First, based on a multi-vocal literature review, we propose a catalog of 35 SUIT-specific smells. Then, we conduct an empirical analysis to assess the prevalence and refactoring of these smells in 48 industrial test suites and 12 open-source projects. We show that the same type of smells tends to appear in industrial and open-source projects, but the symptoms are not addressed in the same way. Smells such as Obscure Test, Sneaky Checking, and Over Checking show symptoms in more than 70% of the tests. Yet refactoring actions are much less frequent with less than 50% of the affected tests ever undergoing refactoring. Interestingly, while refactoring actions are rare, some smells, such as Narcissistic, disappear through the removal of old symptomatic tests and the introduction of new tests not presenting such symptoms.

Background: Due to the finite size of the development sample, predicted probabilities from a risk prediction model are inevitably uncertain. We apply Value of Information methodology to evaluate the decision-theoretic implications of prediction uncertainty. Methods: Adopting a Bayesian perspective, we extend the definition of the Expected Value of Perfect Information (EVPI) from decision analysis to net benefit calculations in risk prediction. In the context of model development, EVPI is the expected gain in net benefit by using the correct predictions as opposed to predictions from a proposed model. We suggest bootstrap methods for sampling from the posterior distribution of predictions for EVPI calculation using Monte Carlo simulations. In a case study, we used subsets of data of various sizes from a clinical trial for predicting mortality after myocardial infarction to show how EVPI changes with sample size. Results: With a sample size of 1,000 and at the pre-specified threshold of 2% on predicted risks, the gain in net benefit by using the proposed and the correct models were 0.0006 and 0.0011, respectively, resulting in an EVPI of 0.0005 and a relative EVPI of 87%. EVPI was zero only at unrealistically high thresholds (>85%). As expected, EVPI declined with larger samples. We summarize an algorithm for incorporating EVPI calculations into the commonly used bootstrap method for optimism correction. Conclusion: Value of Information methods can be applied to explore decision-theoretic consequences of uncertainty in risk prediction and can complement inferential methods when developing risk prediction models. R code for implementing this method is provided.

In many applications, such as recommender systems, online advertising, and product search, click-through rate (CTR) prediction is a critical task, because its accuracy has a direct impact on both platform revenue and user experience. In recent years, with the prevalence of deep learning, CTR prediction has been widely studied in both academia and industry, resulting in an abundance of deep CTR models. Unfortunately, there is still a lack of a standardized benchmark and uniform evaluation protocols for CTR prediction. This leads to the non-reproducible and even inconsistent experimental results among these studies. In this paper, we present an open benchmark (namely FuxiCTR) for reproducible research and provide a rigorous comparison of different models for CTR prediction. Specifically, we ran over 4,600 experiments for a total of more than 12,000 GPU hours in a uniform framework to re-evaluate 24 existing models on two widely-used datasets, Criteo and Avazu. Surprisingly, our experiments show that many models have smaller differences than expected and sometimes are even inconsistent with what reported in the literature. We believe that our benchmark could not only allow researchers to gauge the effectiveness of new models conveniently, but also share some good practices to fairly compare with the state of the arts. We will release all the code and benchmark settings.

To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to "buy" arbitrary levels of skills for a system, in a way that masks the system's own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.

To solve complex real-world problems with reinforcement learning, we cannot rely on manually specified reward functions. Instead, we can have humans communicate an objective to the agent directly. In this work, we combine two approaches to learning from human feedback: expert demonstrations and trajectory preferences. We train a deep neural network to model the reward function and use its predicted reward to train an DQN-based deep reinforcement learning agent on 9 Atari games. Our approach beats the imitation learning baseline in 7 games and achieves strictly superhuman performance on 2 games without using game rewards. Additionally, we investigate the goodness of fit of the reward model, present some reward hacking problems, and study the effects of noise in the human labels.

Predictive models of student success in Massive Open Online Courses (MOOCs) are a critical component of effective content personalization and adaptive interventions. In this article we review the state of the art in predictive models of student success in MOOCs and present a categorization of MOOC research according to the predictors (features), prediction (outcomes), and underlying theoretical model. We critically survey work across each category, providing data on the raw data source, feature engineering, statistical model, evaluation method, prediction architecture, and other aspects of these experiments. Such a review is particularly useful given the rapid expansion of predictive modeling research in MOOCs since the emergence of major MOOC platforms in 2012. This survey reveals several key methodological gaps, which include extensive filtering of experimental subpopulations, ineffective student model evaluation, and the use of experimental data which would be unavailable for real-world student success prediction and intervention, which is the ultimate goal of such models. Finally, we highlight opportunities for future research, which include temporal modeling, research bridging predictive and explanatory student models, work which contributes to learning theory, and evaluating long-term learner success in MOOCs.

Neural machine translation (NMT) suffers a performance deficiency when a limited vocabulary fails to cover the source or target side adequately, which happens frequently when dealing with morphologically rich languages. To address this problem, previous work focused on adjusting translation granularity or expanding the vocabulary size. However, morphological information is relatively under-considered in NMT architectures, which may further improve translation quality. We propose a novel method, which can not only reduce data sparsity but also model morphology through a simple but effective mechanism. By predicting the stem and suffix separately during decoding, our system achieves an improvement of up to 1.98 BLEU compared with previous work on English to Russian translation. Our method is orthogonal to different NMT architectures and stably gains improvements on various domains.

北京阿比特科技有限公司