Although the recent rise and uptake of COVID-19 vaccines in the United States has been encouraging, there continues to be significant vaccine hesitancy in various geographic and demographic clusters of the adult population. Surveys, such as the one conducted by Gallup over the past year, can be useful in determining vaccine hesitancy, but can be expensive to conduct and do not provide real-time data. At the same time, the advent of social media suggests that it may be possible to get vaccine hesitancy signals at an aggregate level (such as at the level of zip codes) by using machine learning models and socioeconomic (and other) features from publicly available sources. It is an open question at present whether such an endeavor is feasible, and how it compares to baselines that only use constant priors. To our knowledge, a proper methodology and evaluation results using real data has also not been presented. In this article, we present such a methodology and experimental study, using publicly available Twitter data collected over the last year. Our goal is not to devise novel machine learning algorithms, but to evaluate existing and established models in a comparative framework. We show that the best models significantly outperform constant priors, and can be set up using open-source tools.
Despite an increasing reliance on fully-automated algorithmic decision-making in our day-to-day lives, human beings still make highly consequential decisions. As frequently seen in business, healthcare, and public policy, recommendations produced by algorithms are provided to human decision-makers to guide their decisions. While there exists a fast-growing literature evaluating the bias and fairness of such algorithmic recommendations, an overlooked question is whether they help humans make better decisions. We develop a statistical methodology for experimentally evaluating the causal impacts of algorithmic recommendations on human decisions. We also show how to examine whether algorithmic recommendations improve the fairness of human decisions and derive the optimal decision rules under various settings. We apply the proposed methodology to preliminary data from the first-ever randomized controlled trial that evaluates the pretrial Public Safety Assessment (PSA) in the criminal justice system. A goal of the PSA is to help judges decide which arrested individuals should be released. On the basis of the preliminary data available, we find that providing the PSA to the judge has little overall impact on the judge's decisions and subsequent arrestee behavior. However, our analysis yields some potentially suggestive evidence that the PSA may help avoid unnecessarily harsh decisions for female arrestees regardless of their risk levels while it encourages the judge to make stricter decisions for male arrestees who are deemed to be risky. In terms of fairness, the PSA appears to increase the gender bias against males while having little effect on any existing racial differences in judges' decision. Finally, we find that the PSA's recommendations might be unnecessarily severe unless the cost of a new crime is sufficiently high.
This paper has the goal of evaluating how changes in mobility has affected the infection spread of Covid-19 throughout the 2020-2021 years. However, identifying a "clean" causal relation is not an easy task due to a high number of non-observable (behavioral) effects. We suggest the usage of Google Trends and News-based indexes as controls for some of these behavioral effects and we find that a 1\% increase in residential mobility (i.e. a reduction in overall mobility) have significant impacts for reducing both Covid-19 cases (at least 3.02\% on a one-month horizon) and deaths (at least 2.43\% at the two-weeks horizon) over the 2020-2021 sample. We also evaluate the effects of mobility on Covid-19 spread on the restricted sample (only 2020) where vaccines were not available. The results of diminishing mobility over cases and deaths on the restricted sample are still observable (with similar magnitudes in terms of residential mobility) and cumulative higher, as the effects of restricting workplace mobility turns to be also significant: a 1\% decrease in workplace mobility diminishes cases around 1\% and deaths around 2\%.
As decision-making increasingly relies on machine learning and (big) data, the issue of fairness in data-driven AI systems is receiving increasing attention from both research and industry. A large variety of fairness-aware machine learning solutions have been proposed which propose fairness-related interventions in the data, learning algorithms and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware machine learning. We focus on tabular data as the most common data representation for fairness-aware machine learning. We start our analysis by identifying relationships among the different attributes, particularly w.r.t. protected attributes and class attributes, using a Bayesian network. For a deeper understanding of bias and fairness in the datasets, we investigate the interesting relationships using exploratory analysis.
This transformation of food delivery businesses to online platforms has gained high attention in recent years. This due to the availability of customizing ordering experiences, easy payment methods, fast delivery, and others. The competition between online food delivery providers has intensified to attain a wider range of customers. Hence, they should have a better understanding of their customers' needs and predict their purchasing decisions. Machine learning has a significant impact on companies' bottom line. They are used to construct models and strategies in industries that rely on big data and need a system to evaluate it fast and effectively. Predictive modeling is a type of machine learning that uses various regression algorithms, analytics, and statistics to estimate the probability of an occurrence. The incorporation of predictive models helps online food delivery providers to understand their customers. In this study, a dataset collected from 388 consumers in Bangalore, India was provided to predict their purchasing decisions. Four prediction models are considered: CART and C4.5 decision trees, random forest, and rule-based classifiers, and their accuracies in providing the correct class label are evaluated. The findings show that all models perform similarly, but the C4.5 outperforms them all with an accuracy of 91.67%.
There are many ways machine learning and big data analytics are used in the fight against the COVID-19 pandemic, including predictions, risk management, diagnostics, and prevention. This study focuses on predicting COVID-19 patient shielding -- identifying and protecting patients who are clinically extremely vulnerable from coronavirus. This study focuses on techniques used for the multi-label classification of medical text. Using the information published by the United Kingdom NHS and the World Health Organisation, we present a novel approach to predicting COVID-19 patient shielding as a multi-label classification problem. We use publicly available, de-identified ICU medical text data for our experiments. The labels are derived from the published COVID-19 patient shielding data. We present an extensive comparison across 12 multi-label classifiers from the simple binary relevance to neural networks and the most recent transformers. To the best of our knowledge this is the first comprehensive study, where such a range of multi-label classifiers for medical text are considered. We highlight the benefits of various approaches, and argue that, for the task at hand, both predictive accuracy and processing time are essential.
The United Nations identified gender equality as a Sustainable Development Goal in 2015, recognizing the underrepresentation of women in politics as a specific barrier to achieving gender equality. Political systems around the world experience gender inequality across all levels of elected government as fewer women run for office than men. This is due in part to online abuse, particularly on social media platforms like Twitter, where women seeking or in power tend to be targeted with more toxic maltreatment than their male counterparts. In this paper, we present reflections on ParityBOT - the first natural language processing-based intervention designed to affect online discourse for women in politics for the better, at scale. Deployed across elections in Canada, the United States and New Zealand, ParityBOT was used to analyse and classify more than 12 million tweets directed at women candidates and counter toxic tweets with supportive ones. From these elections we present three case studies highlighting the current limitations of, and future research and application opportunities for, using a natural language processing-based system to detect online toxicity, specifically with regards to contextually important microaggressions. We examine the rate of false negatives, where ParityBOT failed to pick up on insults directed at specific high profile women, which would be obvious to human users. We examine the unaddressed harms of microaggressions and the potential of yet unseen damage they cause for women in these communities, and for progress towards gender equality overall, in light of these technological blindspots. This work concludes with a discussion on the benefits of partnerships between nonprofit social groups and technology experts to develop responsible, socially impactful approaches to addressing online hate.
We consider two or more forecasters each making a sequence of predictions over time and tackle the problem of how to compare them -- either online or post-hoc. In fields ranging from meteorology to sports, forecasters make predictions on different events or quantities over time, and this work describes how to compare them in a statistically rigorous manner. Specifically, we design a nonasymptotic sequential inference procedure for estimating the time-varying difference in forecast quality when using a relatively large class of scoring rules (bounded scores with a linear equivalent). The resulting confidence intervals can be continuously monitored and yield statistically valid comparisons at arbitrary data-dependent stopping times ("anytime-valid"); this is enabled by adapting recent variance-adaptive confidence sequences (CS) to our setting. In the spirit of Shafer and Vovk's game-theoretic probability, the coverage guarantees for our CSs are also distribution-free, in the sense that they make no distributional assumptions whatsoever on the forecasts or outcomes. Additionally, in contrast to a recent preprint by Henzi and Ziegel, we show how to sequentially test a weak null hypothesis about whether one forecaster outperforms another on average over time, by designing different e-processes that quantify the evidence at any stopping time. We examine the validity of our methods over their fixed-time and asymptotic counterparts in synthetic experiments and demonstrate their effectiveness in real-data settings, including comparing probability forecasts on Major League Baseball (MLB) games and comparing statistical postprocessing methods for ensemble weather forecasts.
Understanding the trustworthiness of a prediction yielded by a classifier is critical for the safe and effective use of AI models. Prior efforts have been proven to be reliable on small-scale datasets. In this work, we study the problem of predicting trustworthiness on real-world large-scale datasets, where the task is more challenging due to high-dimensional features, diverse visual concepts, and large-scale samples. In such a setting, we observe that the trustworthiness predictors trained with prior-art loss functions, i.e., the cross entropy loss, focal loss, and true class probability confidence loss, are prone to view both correct predictions and incorrect predictions to be trustworthy. The reasons are two-fold. Firstly, correct predictions are generally dominant over incorrect predictions. Secondly, due to the data complexity, it is challenging to differentiate the incorrect predictions from the correct ones on real-world large-scale datasets. To improve the generalizability of trustworthiness predictors, we propose a novel steep slope loss to separate the features w.r.t. correct predictions from the ones w.r.t. incorrect predictions by two slide-like curves that oppose each other. The proposed loss is evaluated with two representative deep learning models, i.e., Vision Transformer and ResNet, as trustworthiness predictors. We conduct comprehensive experiments and analyses on ImageNet, which show that the proposed loss effectively improves the generalizability of trustworthiness predictors. The code and pre-trained trustworthiness predictors for reproducibility are available at //github.com/luoyan407/predict_trustworthiness.
There has been considerable growth and interest in industrial applications of machine learning (ML) in recent years. ML engineers, as a consequence, are in high demand across the industry, yet improving the efficiency of ML engineers remains a fundamental challenge. Automated machine learning (AutoML) has emerged as a way to save time and effort on repetitive tasks in ML pipelines, such as data pre-processing, feature engineering, model selection, hyperparameter optimization, and prediction result analysis. In this paper, we investigate the current state of AutoML tools aiming to automate these tasks. We conduct various evaluations of the tools on many datasets, in different data segments, to examine their performance, and compare their advantages and disadvantages on different test cases.
We introduce DAiSEE, the largest multi-label video classification dataset comprising of over two-and-a-half million video frames (2,723,882), 9068 video snippets (about 25 hours of recording) captured from 112 users for recognizing user affective states, including engagement, in the wild. In addition to engagement, it also includes associated affective states of boredom, confusion, and frustration, which are relevant to such applications. The dataset has four levels of labels from very low to very high for each of the affective states, collected using crowd annotators and correlated with a gold standard annotation obtained from a team of expert psychologists. We have also included benchmark results on this dataset using state-of-the-art video classification methods that are available today, and the baselines on each of the labels is included with this dataset. To the best of our knowledge, DAiSEE is the first and largest such dataset in this domain. We believe that DAiSEE will provide the research community with challenges in feature extraction, context-based inference, and development of suitable machine learning methods for related tasks, thus providing a springboard for further research.