Background: Machine Learning (ML) methods are being increasingly used for automating different activities, e.g., Test Case Prioritization (TCP), of Continuous Integration (CI). However, ML models need frequent retraining as a result of changes in the CI environment, more commonly known as data drift. Also, continuously retraining ML models consume a lot of time and effort. Hence, there is an urgent need of identifying and evaluating suitable approaches that can help in reducing the retraining efforts and time for ML models used for TCP in CI environments. Aims: This study aims to investigate the performance of using data drift detection techniques for automatically detecting the retraining points for ML models for TCP in CI environments without requiring detailed knowledge of the software projects. Method: We employed the Hellinger distance to identify changes in both the values and distribution of input data and leveraged these changes as retraining points for the ML model. We evaluated the efficacy of this method on multiple datasets and compared the APFDc and NAPFD evaluation metrics against models that were regularly retrained, with careful consideration of the statistical methods. Results: Our experimental evaluation of the Hellinger distance-based method demonstrated its efficacy and efficiency in detecting retraining points and reducing the associated costs. However, the performance of this method may vary depending on the dataset. Conclusions: Our findings suggest that data drift detection methods can assist in identifying retraining points for ML models in CI environments, while significantly reducing the required retraining time. These methods can be helpful for practitioners who lack specialized knowledge of software projects, enabling them to maintain ML model accuracy.
There is abundant observational data in the software engineering domain, whereas running large-scale controlled experiments is often practically impossible. Thus, most empirical studies can only report statistical correlations -- instead of potentially more insightful and robust causal relations. To support analyzing purely observational data for causal relations, and to assess any differences between purely predictive and causal models of the same data, this paper discusses some novel techniques based on structural causal models (such as directed acyclic graphs of causal Bayesian networks). Using these techniques, one can rigorously express, and partially validate, causal hypotheses; and then use the causal information to guide the construction of a statistical model that captures genuine causal relations -- such that correlation does imply causation. We apply these ideas to analyzing public data about programmer performance in Code Jam, a large world-wide coding contest organized by Google every year. Specifically, we look at the impact of different programming languages on a participant's performance in the contest. While the overall effect associated with programming languages is weak compared to other variables -- regardless of whether we consider correlational or causal links -- we found considerable differences between a purely associational and a causal analysis of the very same data. The takeaway message is that even an imperfect causal analysis of observational data can help answer the salient research questions more precisely and more robustly than with just purely predictive techniques -- where genuine causal effects may be confounded.
This research proposes a machine learning-based attack detection model for power systems, specifically targeting smart grids. By utilizing data and logs collected from Phasor Measuring Devices (PMUs), the model aims to learn system behaviors and effectively identify potential security boundaries. The proposed approach involves crucial stages including dataset pre-processing, feature selection, model creation, and evaluation. To validate our approach, we used a dataset used, consist of 15 separate datasets obtained from different PMUs, relay snort alarms and logs. Three machine learning models: Random Forest, Logistic Regression, and K-Nearest Neighbour were built and evaluated using various performance metrics. The findings indicate that the Random Forest model achieves the highest performance with an accuracy of 90.56% in detecting power system disturbances and has the potential in assisting operators in decision-making processes.
We consider the problem of empirical Bayes estimation for (multivariate) Poisson means. Existing solutions that have been shown theoretically optimal for minimizing the regret (excess risk over the Bayesian oracle that knows the prior) have several shortcomings. For example, the classical Robbins estimator does not retain the monotonicity property of the Bayes estimator and performs poorly under moderate sample size. Estimators based on the minimum distance and non-parametric maximum likelihood (NPMLE) methods correct these issues, but are computationally expensive with complexity growing exponentially with dimension. Extending the approach of Barbehenn and Zhao (2022), in this work we construct monotone estimators based on empirical risk minimization (ERM) that retain similar theoretical guarantees and can be computed much more efficiently. Adapting the idea of offset Rademacher complexity Liang et al. (2015) to the non-standard loss and function class in empirical Bayes, we show that the shape-constrained ERM estimator attains the minimax regret within constant factors in one dimension and within logarithmic factors in multiple dimensions.
Speeding has been acknowledged as a critical determinant in increasing the risk of crashes and their resulting injury severities. This paper demonstrates that severe speeding-related crashes within the state of Pennsylvania have a spatial clustering trend, where four crash datasets are extracted from four hotspot districts. Two log-likelihood ratio (LR) tests were conducted to determine whether speeding-related crashes classified by hotspot districts should be modeled separately. The results suggest that separate modeling is necessary. To capture the unobserved heterogeneity, four correlated random parameter order models with heterogeneity in means are employed to explore the factors contributing to crash severity involving at least one vehicle speeding. Overall, the findings exhibit that some indicators are observed to be spatial instability, including hit pedestrian crashes, head-on crashes, speed limits, work zones, light conditions (dark), rural areas, older drivers, running stop signs, and running red lights. Moreover, drunk driving, exceeding the speed limit, and being unbelted present relative spatial stability in four district models. This paper provides insights into preventing speeding-related crashes and potentially facilitating the development of corresponding crash injury mitigation policies.
The automatic detection of hate speech online is an active research area in NLP. Most of the studies to date are based on social media datasets that contribute to the creation of hate speech detection models trained on them. However, data creation processes contain their own biases, and models inherently learn from these dataset-specific biases. In this paper, we perform a large-scale cross-dataset comparison where we fine-tune language models on different hate speech detection datasets. This analysis shows how some datasets are more generalisable than others when used as training data. Crucially, our experiments show how combining hate speech detection datasets can contribute to the development of robust hate speech detection models. This robustness holds even when controlling by data size and compared with the best individual datasets.
Since the 1950s, machine translation (MT) has become one of the important tasks of AI and development, and has experienced several different periods and stages of development, including rule-based methods, statistical methods, and recently proposed neural network-based learning methods. Accompanying these staged leaps is the evaluation research and development of MT, especially the important role of evaluation methods in statistical translation and neural translation research. The evaluation task of MT is not only to evaluate the quality of machine translation, but also to give timely feedback to machine translation researchers on the problems existing in machine translation itself, how to improve and how to optimise. In some practical application fields, such as in the absence of reference translations, the quality estimation of machine translation plays an important role as an indicator to reveal the credibility of automatically translated target languages. This report mainly includes the following contents: a brief history of machine translation evaluation (MTE), the classification of research methods on MTE, and the the cutting-edge progress, including human evaluation, automatic evaluation, and evaluation of evaluation methods (meta-evaluation). Manual evaluation and automatic evaluation include reference-translation based and reference-translation independent participation; automatic evaluation methods include traditional n-gram string matching, models applying syntax and semantics, and deep learning models; evaluation of evaluation methods includes estimating the credibility of human evaluations, the reliability of the automatic evaluation, the reliability of the test set, etc. Advances in cutting-edge evaluation methods include task-based evaluation, using pre-trained language models based on big data, and lightweight optimisation models using distillation techniques.
Analyzing observational data from multiple sources can be useful for increasing statistical power to detect a treatment effect; however, practical constraints such as privacy considerations may restrict individual-level information sharing across data sets. This paper develops federated methods that only utilize summary-level information from heterogeneous data sets. Our federated methods provide doubly-robust point estimates of treatment effects as well as variance estimates. We derive the asymptotic distributions of our federated estimators, which are shown to be asymptotically equivalent to the corresponding estimators from the combined, individual-level data. We show that to achieve these properties, federated methods should be adjusted based on conditions such as whether models are correctly specified and stable across heterogeneous data sets.
Evaluating the quality of learned representations without relying on a downstream task remains one of the challenges in representation learning. In this work, we present Geometric Component Analysis (GeomCA) algorithm that evaluates representation spaces based on their geometric and topological properties. GeomCA can be applied to representations of any dimension, independently of the model that generated them. We demonstrate its applicability by analyzing representations obtained from a variety of scenarios, such as contrastive learning models, generative models and supervised learning models.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.
Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics, for decades. Nowadays, estimating causal effect from observational data has become an appealing research direction owing to the large amount of available data and low budget requirement, compared with randomized controlled trials. Embraced with the rapidly developed machine learning area, various causal effect estimation methods for observational data have sprung up. In this survey, we provide a comprehensive review of causal inference methods under the potential outcome framework, one of the well known causal inference framework. The methods are divided into two categories depending on whether they require all three assumptions of the potential outcome framework or not. For each category, both the traditional statistical methods and the recent machine learning enhanced methods are discussed and compared. The plausible applications of these methods are also presented, including the applications in advertising, recommendation, medicine and so on. Moreover, the commonly used benchmark datasets as well as the open-source codes are also summarized, which facilitate researchers and practitioners to explore, evaluate and apply the causal inference methods.