The Bayes error rate (BER) is a fundamental concept in machine learning that quantifies the best possible accuracy any classifier can achieve on a fixed probability distribution. Despite years of research on building estimators of lower and upper bounds for the BER, these were usually compared only on synthetic datasets with known probability distributions, leaving two key questions unanswered: (1) How well do they perform on real-world datasets?, and (2) How practical are they? Answering these is not trivial. Apart from the obvious challenge of an unknown BER for real-world datasets, there are two main aspects any BER estimator needs to overcome in order to be applicable in real-world settings: (1) the computational and sample complexity, and (2) the sensitivity and selection of hyper-parameters. In this work, we propose FeeBee, the first principled framework for analyzing and comparing BER estimators on any modern real-world dataset with unknown probability distribution. We achieve this by injecting a controlled amount of label noise and performing multiple evaluations on a series of different noise levels, supported by a theoretical result which allows drawing conclusions about the evolution of the BER. By implementing and analyzing 7 multi-class BER estimators on 6 commonly used datasets of the computer vision and NLP domains, FeeBee allows a thorough study of these estimators, clearly identifying strengths and weaknesses of each, whilst being easily deployable on any future BER estimator.
An important problem in causal inference is to break down the total effect of a treatment on an outcome into different causal pathways and to quantify the causal effect in each pathway. For instance, in causal fairness, the total effect of being a male employee (i.e., treatment) constitutes its direct effect on annual income (i.e., outcome) and the indirect effect via the employee's occupation (i.e., mediator). Causal mediation analysis (CMA) is a formal statistical framework commonly used to reveal such underlying causal mechanisms. One major challenge of CMA in observational studies is handling confounders, variables that cause spurious causal relationships among treatment, mediator, and outcome. Conventional methods assume sequential ignorability that implies all confounders can be measured, which is often unverifiable in practice. This work aims to circumvent the stringent sequential ignorability assumptions and consider hidden confounders. Drawing upon proxy strategies and recent advances in deep learning, we propose to simultaneously uncover the latent variables that characterize hidden confounders and estimate the causal effects. Empirical evaluations using both synthetic and semi-synthetic datasets validate the effectiveness of the proposed method. We further show the potentials of our approach for causal fairness analysis.
The problem of linear predictions has been extensively studied for the past century under pretty generalized frameworks. Recent advances in the robust statistics literature allow us to analyze robust versions of classical linear models through the prism of Median of Means (MoM). Combining these approaches in a piecemeal way might lead to ad-hoc procedures, and the restricted theoretical conclusions that underpin each individual contribution may no longer be valid. To meet these challenges coherently, in this study, we offer a unified robust framework that includes a broad variety of linear prediction problems on a Hilbert space, coupled with a generic class of loss functions. Notably, we do not require any assumptions on the distribution of the outlying data points ($\mathcal{O}$) nor the compactness of the support of the inlying ones ($\mathcal{I}$). Under mild conditions on the dual norm, we show that for misspecification level $\epsilon$, these estimators achieve an error rate of $O(\max\left\{|\mathcal{O}|^{1/2}n^{-1/2}, |\mathcal{I}|^{1/2}n^{-1} \right\}+\epsilon)$, matching the best-known rates in literature. This rate is slightly slower than the classical rates of $O(n^{-1/2})$, indicating that we need to pay a price in terms of error rates to obtain robust estimates. Additionally, we show that this rate can be improved to achieve so-called ``fast rates" under additional assumptions.
In the classic TARGET SAT SELECTION problem, we are asked to minimise the number of nodes to activate so that, after the application of a certain propagation process, all nodes of the graph are active. Bazgan and Chopin [Discrete Optimization}, 14:170--182, 2014] introduced the opposite problem, named HARMLESS SET, in which they ask to maximise the number of nodes to activate such that not a single additional node is activated. In this paper we investigate how sparsity impacts the tractability of HARMLESS SET. Specifically, we answer two open questions posed by the aforementioned authors, namely a) whether the problem is FPT on planar graphs and b) whether it is FPT parametrised by treewidth. The first question can be answered in the positive using existing meta-theorems on sparse classes, and we further show that HARMLESS SET not only admits a polynomial kernel, but that it can be solved in subexponential time. We then answer the second question in the negative by showing that the problem is W[1]-hard when parametrised by a parameter that upper bounds treewidth.
The Yin-Yang dataset was developed for research on biologically plausible error backpropagation and deep learning in spiking neural networks. It serves as an alternative to classic deep learning datasets, especially in early-stage prototyping scenarios for both network models and hardware platforms, for which it provides several advantages. First, it is smaller and therefore faster to learn, thereby being better suited for small-scale exploratory studies in both software simulations and hardware prototypes. Second, it exhibits a very clear gap between the accuracies achievable using shallow as compared to deep neural networks. Third, it is easily transferable between spatial and temporal input domains, making it interesting for different types of classification scenarios.
Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there is a fundamental tradeoff between the two sources of error involved: approximation and empirical estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. We explore this tradeoff for an estimator based on a shallow NN by means of non-asymptotic error bounds, focusing on four popular $\mathsf{f}$-divergences -- Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. The bounds reveal the tension between the NN size and the number of samples, and enable to characterize scaling rates thereof that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are near minimax rate-optimal, achieving the parametric rate up to logarithmic factors.
Transformers are state-of-the-art in a wide range of NLP tasks and have also been applied to many real-world products. Understanding the reliability and certainty of transformer model predictions is crucial for building trustable machine learning applications, e.g., medical diagnosis. Although many recent transformer extensions have been proposed, the study of the uncertainty estimation of transformer models is under-explored. In this work, we propose a novel way to enable transformers to have the capability of uncertainty estimation and, meanwhile, retain the original predictive performance. This is achieved by learning a hierarchical stochastic self-attention that attends to values and a set of learnable centroids, respectively. Then new attention heads are formed with a mixture of sampled centroids using the Gumbel-Softmax trick. We theoretically show that the self-attention approximation by sampling from a Gumbel distribution is upper bounded. We empirically evaluate our model on two text classification tasks with both in-domain (ID) and out-of-domain (OOD) datasets. The experimental results demonstrate that our approach: (1) achieves the best predictive performance and uncertainty trade-off among compared methods; (2) exhibits very competitive (in most cases, improved) predictive performance on ID datasets; (3) is on par with Monte Carlo dropout and ensemble methods in uncertainty estimation on OOD datasets.
Substantial efforts have been devoted more recently to presenting various methods for object detection in optical remote sensing images. However, the current survey of datasets and deep learning based methods for object detection in optical remote sensing images is not adequate. Moreover, most of the existing datasets have some shortcomings, for example, the numbers of images and object categories are small scale, and the image diversity and variations are insufficient. These limitations greatly affect the development of deep learning based object detection methods. In the paper, we provide a comprehensive review of the recent deep learning based object detection progress in both the computer vision and earth observation communities. Then, we propose a large-scale, publicly available benchmark for object DetectIon in Optical Remote sensing images, which we name as DIOR. The dataset contains 23463 images and 192472 instances, covering 20 object classes. The proposed DIOR dataset 1) is large-scale on the object categories, on the object instance number, and on the total image number; 2) has a large range of object size variations, not only in terms of spatial resolutions, but also in the aspect of inter- and intra-class size variability across objects; 3) holds big variations as the images are obtained with different imaging conditions, weathers, seasons, and image quality; and 4) has high inter-class similarity and intra-class diversity. The proposed benchmark can help the researchers to develop and validate their data-driven methods. Finally, we evaluate several state-of-the-art approaches on our DIOR dataset to establish a baseline for future research.
Classification tasks are usually analysed and improved through new model architectures or hyperparameter optimisation but the underlying properties of datasets are discovered on an ad-hoc basis as errors occur. However, understanding the properties of the data is crucial in perfecting models. In this paper we analyse exactly which characteristics of a dataset best determine how difficult that dataset is for the task of text classification. We then propose an intuitive measure of difficulty for text classification datasets which is simple and fast to calculate. We show that this measure generalises to unseen data by comparing it to state-of-the-art datasets and results. This measure can be used to analyse the precise source of errors in a dataset and allows fast estimation of how difficult a dataset is to learn. We searched for this measure by training 12 classical and neural network based models on 78 real-world datasets, then use a genetic algorithm to discover the best measure of difficulty. Our difficulty-calculating code ( //github.com/Wluper/edm ) and datasets ( //data.wluper.com ) are publicly available.
Though quite challenging, leveraging large-scale unlabeled or partially labeled images in a cost-effective way has increasingly attracted interests for its great importance to computer vision. To tackle this problem, many Active Learning (AL) methods have been developed. However, these methods mainly define their sample selection criteria within a single image context, leading to the suboptimal robustness and impractical solution for large-scale object detection. In this paper, aiming to remedy the drawbacks of existing AL methods, we present a principled Self-supervised Sample Mining (SSM) process accounting for the real challenges in object detection. Specifically, our SSM process concentrates on automatically discovering and pseudo-labeling reliable region proposals for enhancing the object detector via the introduced cross image validation, i.e., pasting these proposals into different labeled images to comprehensively measure their values under different image contexts. By resorting to the SSM process, we propose a new AL framework for gradually incorporating unlabeled or partially labeled data into the model learning while minimizing the annotating effort of users. Extensive experiments on two public benchmarks clearly demonstrate our proposed framework can achieve the comparable performance to the state-of-the-art methods with significantly fewer annotations.
Person re-identification (re-id) is a critical problem in video analytics applications such as security and surveillance. The public release of several datasets and code for vision algorithms has facilitated rapid progress in this area over the last few years. However, directly comparing re-id algorithms reported in the literature has become difficult since a wide variety of features, experimental protocols, and evaluation metrics are employed. In order to address this need, we present an extensive review and performance evaluation of single- and multi-shot re-id algorithms. The experimental protocol incorporates the most recent advances in both feature extraction and metric learning. To ensure a fair comparison, all of the approaches were implemented using a unified code library that includes 11 feature extraction algorithms and 22 metric learning and ranking techniques. All approaches were evaluated using a new large-scale dataset that closely mimics a real-world problem setting, in addition to 16 other publicly available datasets: VIPeR, GRID, CAVIAR, DukeMTMC4ReID, 3DPeS, PRID, V47, WARD, SAIVT-SoftBio, CUHK01, CHUK02, CUHK03, RAiD, iLIDSVID, HDA+ and Market1501. The evaluation codebase and results will be made publicly available for community use.