The prediction accuracy of machine learning methods is steadily increasing, but the calibration of their uncertainty predictions poses a significant challenge. Numerous works focus on obtaining well-calibrated predictive models, but less is known about reliably assessing model calibration. This limits our ability to know when algorithms for improving calibration have a real effect, and when their improvements are merely artifacts due to random noise in finite datasets. In this work, we consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem. The null hypothesis is that the predictive model is calibrated, while the alternative hypothesis is that the deviation from calibration is sufficiently large. We find that detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions. When the conditional class probabilities are H\"older continuous, we propose T-Cal, a minimax optimal test for calibration based on a debiased plug-in estimator of the $\ell_2$-Expected Calibration Error (ECE). We further propose Adaptive T-Cal, a version that is adaptive to unknown smoothness. We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. T-Cal is a practical general-purpose tool, which -- combined with classical tests for discrete-valued predictors -- can be used to test the calibration of virtually any probabilistic classification method.
The ability to ensure that a classifier gives reliable confidence scores is essential to ensure informed decision-making. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores. Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.
We present a new deep unfolding network for analysis-sparsity-based Compressed Sensing. The proposed network coined Decoding Network (DECONET) jointly learns a decoder that reconstructs vectors from their incomplete, noisy measurements and a redundant sparsifying analysis operator, which is shared across the layers of DECONET. Moreover, we formulate the hypothesis class of DECONET and estimate its associated Rademacher complexity. Then, we use this estimate to deliver meaningful upper bounds for the generalization error of DECONET. Finally, the validity of our theoretical results is assessed and comparisons to state-of-the-art unfolding networks are made, on both synthetic and real-world datasets. Experimental results indicate that our proposed network outperforms the baselines, consistently for all datasets, and its behaviour complies with our theoretical findings.
The utilization of renewable energy technologies, particularly hydrogen, has seen a boom in interest and has spread throughout the world. Ethanol steam reformation is one of the primary methods capable of producing hydrogen efficiently and reliably. This paper provides an in-depth study of the reformulated system both theoretically and numerically, as well as a plan to explore the possibility of converting the system into its conservation form. Lastly, we offer an overview of several numerical approaches for solving the general first-order quasi-linear hyperbolic equation to the particular model for ethanol steam reforming (ESR). We conclude by presenting some results that would enable the usage of these ODE/PDE solvers to be used in non-linear model predictive control (NMPC) algorithms and discuss the limitations of our approach and directions for future work.
If $A$ and $B$ are sets such that $A \subset B$, generalisation may be understood as the inference from $A$ of a hypothesis sufficient to construct $B$. One might infer any number of hypotheses from $A$, yet only some of those may generalise to $B$. How can one know which are likely to generalise? One strategy is to choose the shortest, equating the ability to compress information with the ability to generalise (a proxy for intelligence). We examine this in the context of a mathematical formalism of enactive cognition. We show that compression is neither necessary nor sufficient to maximise performance (measured in terms of the probability of a hypothesis generalising). We formulate a proxy unrelated to length or simplicity, called weakness. We show that if tasks are uniformly distributed, then there is no choice of proxy that performs at least as well as weakness maximisation in all tasks while performing strictly better in at least one. In experiments comparing maximum weakness and minimum description length in the context of binary arithmetic, the former generalised at between $1.1$ and $5$ times the rate of the latter. We argue this demonstrates that weakness is a far better proxy, and explains why Deepmind's Apperception Engine is able to generalise effectively.
In many recommender systems and search problems, presenting a well balanced set of results can be an important goal in addition to serving highly relevant content. For example, in a movie recommendation system, it may be helpful to achieve a certain balance of different genres, likewise, it may be important to balance between highly popular versus highly personalized shows. Such balances could be thought across many categories and may be required for enhanced user experience, business considerations, fairness objectives etc. In this paper, we consider the problem of calibrating with respect to any given categories over items. We propose a way to balance a trade-off between relevance and calibration via a Linear Programming optimization problem where we learn a doubly stochastic matrix to achieve optimal balance in expectation. We then realize the learned policy using the Birkhoff-von Neumann decomposition of a doubly stochastic matrix. Several optimizations are considered over the proposed basic approach to make it fast. The experiments show that the proposed formulation can achieve a much better trade-off compared to many other baselines. This paper does not prescribe the exact categories to calibrate over (such as genres) universally for applications. This is likely dependent on the particular task or business objective. The main contribution of the paper is that it proposes a framework that can be applied to a variety of problems and demonstrates the efficacy of the proposed method using a few use-cases.
Correct radar data fusion depends on knowledge of the spatial transform between sensor pairs. Current methods for determining this transform operate by aligning identifiable features in different radar scans, or by relying on measurements from another, more accurate sensor. Feature-based alignment requires the sensors to have overlapping fields of view or necessitates the construction of an environment map. Several existing techniques require bespoke retroreflective radar targets. These requirements limit both where and how calibration can be performed. In this paper, we take a different approach: instead of attempting to track targets or features, we rely on ego-velocity estimates from each radar to perform calibration. Our method enables calibration of a subset of the transform parameters, including the yaw and the axis of translation between the radar pair, without the need for a shared field of view or for specialized targets. In general, the yaw and the axis of translation are the most important parameters for data fusion, the most likely to vary over time, and the most difficult to calibrate manually. We formulate calibration as a batch optimization problem, show that the radar-radar system is identifiable, and specify the platform excitation requirements. Through simulation studies and real-world experiments, we establish that our method is more reliable and accurate than state-of-the-art methods. Finally, we demonstrate that the full rigid body transform can be recovered if relatively coarse information about the platform rotation rate is available.
An essential problem in causal inference is estimating causal effects from observational data. The problem becomes more challenging with the presence of unobserved confounders. When there are unobserved confounders, the commonly used back-door adjustment is not applicable. Although the instrumental variable (IV) methods can deal with unobserved confounders, they all assume that the treatment directly affects the outcome, and there is no mediator between the treatment and the outcome. This paper aims to use the front-door criterion to address the challenging problem with the presence of unobserved confounders and mediators. In practice, it is often difficult to identify the set of variables used for front-door adjustment from data. By leveraging the ability of deep generative models in representation learning, we propose FDVAE to learn the representation of a Front-Door adjustment set with a Variational AutoEncoder, instead of trying to search for a set of variables for front-door adjustment. Extensive experiments on synthetic datasets validate the effectiveness of FDVAE and its superiority over existing methods. The experiments also show that the performance of FDVAE is not sensitive to the causal strength of unobserved confounders and is feasible in the case of dimensionality mismatch between learned representations and the ground truth. We further apply the method to three real-world datasets to demonstrate its potential applications.
In this paper we study the type IV Knorr Held space time models. Such models typically apply intrinsic Markov random fields and constraints are imposed for identifiability. INLA is an efficient inference tool for such models where constraints are dealt with through a conditioning by kriging approach. When the number of spatial and/or temporal time points become large, it becomes computationally expensive to fit such models, partly due to the number of constraints involved. We propose a new approach, HyMiK, dividing constraints into two separate sets where one part is treated through a mixed effect approach while the other one is approached by the standard conditioning by kriging method, resulting in a more efficient procedure for dealing with constraints. The new approach is easy to apply based on existing implementations of INLA. We run the model on simulated data, on a real data set containing dengue fever cases in Brazil and another real data set of confirmed positive test cases of Covid-19 in the counties of Norway. For all cases we get very similar results when comparing the new approach with the tradition one while at the same time obtaining a significant increase in computational speed, varying on a factor from 3 to 23, depending on the sizes of the data sets.
In this paper, we introduce Optimal Classification Forests, a new family of classifiers that takes advantage of an optimal ensemble of decision trees to derive accurate and interpretable classifiers. We propose a novel mathematical optimization-based methodology in which a given number of trees are simultaneously constructed, each of them providing a predicted class for the observations in the feature space. The classification rule is derived by assigning to each observation its most frequently predicted class among the trees in the forest. We provide a mixed integer linear programming formulation for the problem. We report the results of our computational experiments, from which we conclude that our proposed method has equal or superior performance compared with state-of-the-art tree-based classification methods. More importantly, it achieves high prediction accuracy with, for example, orders of magnitude fewer trees than random forests. We also present three real-world case studies showing that our methodology has very interesting implications in terms of interpretability.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.