We propose a novel prediction interval method to learn prediction mean values, lower and upper bounds of prediction intervals from three independently trained neural networks only using the standard mean squared error (MSE) loss, for uncertainty quantification in regression tasks. Our method requires no distributional assumption on data, does not introduce unusual hyperparameters to either the neural network models or the loss function. Moreover, our method can effectively identify out-of-distribution samples and reasonably quantify their uncertainty. Numerical experiments on benchmark regression problems show that our method outperforms the state-of-the-art methods with respect to predictive uncertainty quality, robustness, and identification of out-of-distribution samples.
Crucial for building trust in deep learning models for critical real-world applications is efficient and theoretically sound uncertainty quantification, a task that continues to be challenging. Useful uncertainty information is expected to have two key properties: It should be valid (guaranteeing coverage) and discriminative (more uncertain when the expected risk is high). Moreover, when combined with deep learning (DL) methods, it should be scalable and affect the DL model performance minimally. Most existing Bayesian methods lack frequentist coverage guarantees and usually affect model performance. The few available frequentist methods are rarely discriminative and/or violate coverage guarantees due to unrealistic assumptions. Moreover, many methods are expensive or require substantial modifications to the base neural network. Building upon recent advances in conformal prediction [13, 32] and leveraging the classical idea of kernel regression, we propose Locally Valid and Discriminative prediction intervals (LVD), a simple, efficient and lightweight method to construct discriminative prediction intervals (PIs) for almost any DL model. With no assumptions on the data distribution, such PIs also offer finite-sample local coverage guarantees (contrasted to the simpler marginal coverage). We empirically verify, using diverse datasets, that besides being the only locally valid method for DL, LVD also exceeds or matches the performance (including coverage rate and prediction accuracy) of existing uncertainty quantification methods, while offering additional benefits in scalability and flexibility.
Wireless applications that use high-reliability low-latency links depend critically on the capability of the system to predict link quality. This dependence is especially acute at the high carrier frequencies used by mmWave and THz systems, where the links are susceptible to blockages. Predicting blockages with high reliability requires a large number of data samples to train effective machine learning modules. With the aim of mitigating data requirements, we introduce a framework based on meta-learning, whereby data from distinct deployments are leveraged to optimize a shared initialization that decreases the data set size necessary for any new deployment. Predictors of two different events are studied: (1) at least one blockage occurs in a time window, and (2) the link is blocked for the entire time window. The results show that an RNN-based predictor trained using meta-learning is able to predict blockages after observing fewer samples than predictors trained using standard methods.
This paper concerns the construction of confidence intervals in standard seroprevalence surveys. In particular, we discuss methods for constructing confidence intervals for the proportion of individuals in a population infected with a disease using a sample of antibody test results and measurements of the test's false positive and false negative rates. We begin by documenting erratic behavior in the coverage probabilities of standard Wald and percentile bootstrap intervals when applied to this problem. We then consider two alternative sets of intervals constructed with test inversion. The first set of intervals are approximate, using either asymptotic or bootstrap approximation to the finite-sample distribution of a chosen test statistic. We consider several choices of test statistic, including maximum likelihood estimators and generalized likelihood ratio statistics. We show with simulation that, at empirically relevant parameter values and sample sizes, the coverage probabilities for these intervals are close to their nominal level and are approximately equi-tailed. The second set of intervals are shown to contain the true parameter value with probability at least equal to the nominal level, but can be conservative in finite samples.
The success of deep learning techniques over the last decades has opened up a new avenue of research for weather forecasting. Here, we take the novel approach of using a neural network to predict full probability density functions at each point in space and time rather than a single output value, thus producing a probabilistic weather forecast. This enables the calculation of both uncertainty and skill metrics for the neural network predictions, and overcomes the common difficulty of inferring uncertainty from these predictions. This approach is data-driven and the neural network is trained on the WeatherBench dataset (processed ERA5 data) to forecast geopotential and temperature 3 and 5 days ahead. Data exploration leads to the identification of the most important input variables, which are also found to agree with physical reasoning, thereby validating our approach. In order to increase computational efficiency further, each neural network is trained on a small subset of these variables. The outputs are then combined through a stacked neural network, the first time such a technique has been applied to weather data. Our approach is found to be more accurate than some numerical weather prediction models and as accurate as more complex alternative neural networks, with the added benefit of providing key probabilistic information necessary for making informed weather forecasts.
Accurate forecasting is one of the fundamental focus in the literature of econometric time-series. Often practitioners and policy makers want to predict outcomes of an entire time horizon in the future instead of just a single $k$-step ahead prediction. These series, apart from their own possible non-linear dependence, are often also influenced by many external predictors. In this paper, we construct prediction intervals of time-aggregated forecasts in a high-dimensional regression setting. Our approach is based on quantiles of residuals obtained by the popular LASSO routine. We allow for general heavy-tailed, long-memory, and nonlinear stationary error process and stochastic predictors. Through a series of systematically arranged consistency results we provide theoretical guarantees of our proposed quantile-based method in all of these scenarios. After validating our approach using simulations we also propose a novel bootstrap based method that can boost the coverage of the theoretical intervals. Finally analyzing the EPEX Spot data, we construct prediction intervals for hourly electricity prices over horizons spanning 17 weeks and contrast them to selected Bayesian and bootstrap interval forecasts.
The cross-lingual language models are typically pretrained with masked language modeling on multilingual text or parallel sentences. In this paper, we introduce denoising word alignment as a new cross-lingual pre-training task. Specifically, the model first self-labels word alignments for parallel sentences. Then we randomly mask tokens in a bitext pair. Given a masked token, the model uses a pointer network to predict the aligned token in the other language. We alternately perform the above two steps in an expectation-maximization manner. Experimental results show that our method improves cross-lingual transferability on various datasets, especially on the token-level tasks, such as question answering, and structured prediction. Moreover, the model can serve as a pretrained word aligner, which achieves reasonably low error rates on the alignment benchmarks. The code and pretrained parameters are available at //github.com/CZWin32768/XLM-Align.
The dominant paradigm for relation prediction in knowledge graphs involves learning and operating on latent representations (i.e., embeddings) of entities and relations. However, these embedding-based methods do not explicitly capture the compositional logical rules underlying the knowledge graph, and they are limited to the transductive setting, where the full set of entities must be known during training. Here, we propose a graph neural network based relation prediction framework, GraIL, that reasons over local subgraph structures and has a strong inductive bias to learn entity-independent relational semantics. Unlike embedding-based models, GraIL is naturally inductive and can generalize to unseen entities and graphs after training. We provide theoretical proof and strong empirical evidence that GraIL can represent a useful subset of first-order logic and show that GraIL outperforms existing rule-induction baselines in the inductive setting. We also demonstrate significant gains obtained by ensembling GraIL with various knowledge graph embedding methods in the transductive setting, highlighting the complementary inductive bias of our method.
RNN models have achieved the state-of-the-art performance in a wide range of text mining tasks. However, these models are often regarded as black-boxes and are criticized due to the lack of interpretability. In this paper, we enhance the interpretability of RNNs by providing interpretable rationales for RNN predictions. Nevertheless, interpreting RNNs is a challenging problem. Firstly, unlike existing methods that rely on local approximation, we aim to provide rationales that are more faithful to the decision making process of RNN models. Secondly, a flexible interpretation method should be able to assign contribution scores to text segments of varying lengths, instead of only to individual words. To tackle these challenges, we propose a novel attribution method, called REAT, to provide interpretations to RNN predictions. REAT decomposes the final prediction of a RNN into additive contribution of each word in the input text. This additive decomposition enables REAT to further obtain phrase-level attribution scores. In addition, REAT is generally applicable to various RNN architectures, including GRU, LSTM and their bidirectional versions. Experimental results demonstrate the faithfulness and interpretability of the proposed attribution method. Comprehensive analysis shows that our attribution method could unveil the useful linguistic knowledge captured by RNNs. Some analysis further demonstrates our method could be utilized as a debugging tool to examine the vulnerability and failure reasons of RNNs, which may lead to several promising future directions to promote generalization ability of RNNs.
There has been much recent work on training neural attention models at the sequence-level using either reinforcement learning-style methods or by optimizing the beam. In this paper, we survey a range of classical objective functions that have been widely used to train linear models for structured prediction and apply them to neural sequence to sequence models. Our experiments show that these losses can perform surprisingly well by slightly outperforming beam search optimization in a like for like setup. We also report new state of the art results on both IWSLT'14 German-English translation as well as Gigaword abstractive summarization. On the larger WMT'14 English-French translation task, sequence-level training achieves 41.5 BLEU which is on par with the state of the art.
Topic models have been widely explored as probabilistic generative models of documents. Traditional inference methods have sought closed-form derivations for updating the models, however as the expressiveness of these models grows, so does the difficulty of performing fast and accurate inference over their parameters. This paper presents alternative neural approaches to topic modelling by providing parameterisable distributions over topics which permit training by backpropagation in the framework of neural variational inference. In addition, with the help of a stick-breaking construction, we propose a recurrent network that is able to discover a notionally unbounded number of topics, analogous to Bayesian non-parametric topic models. Experimental results on the MXM Song Lyrics, 20NewsGroups and Reuters News datasets demonstrate the effectiveness and efficiency of these neural topic models.