Correlated proportions appear in many real-world applications and present a unique challenge in terms of finding an appropriate probabilistic model due to their constrained nature. The bivariate beta is a natural extension of the well-known beta distribution to the space of correlated quantities on $[0, 1]^2$. Its construction is not unique, however. Over the years, many bivariate beta distributions have been proposed, ranging from three to eight or more parameters, and for which the joint density and distribution moments vary in terms of mathematical tractability. In this paper, we investigate the construction proposed by Olkin & Trikalinos (2015), which strikes a balance between parameter-richness and tractability. We provide classical (frequentist) and Bayesian approaches to estimation in the form of method-of-moments and latent variable/data augmentation coupled with Hamiltonian Monte Carlo, respectively. The elicitation of bivariate beta as a prior distribution is also discussed. The development of diagnostics for checking model fit and adequacy is explored in depth with the aid of Monte Carlo experiments under both well-specified and misspecified data-generating settings. Keywords: Bayesian estimation; bivariate beta; correlated proportions; diagnostics; method of moments.
Recent approaches to causal inference have focused on causal effects defined as contrasts between the distribution of counterfactual outcomes under hypothetical interventions on the nodes of a graphical model. In this article we develop theory for causal effects defined with respect to a different type of intervention, one which alters the information propagated through the edges of the graph. These information transfer interventions may be more useful than node interventions in settings in which causes are non-manipulable, for example when considering race or genetics as a causal agent. Furthermore, information transfer interventions allow us to define path-specific decompositions which are identified in the presence of treatment-induced mediator-outcome confounding, a practical problem whose general solution remains elusive. We prove that the proposed effects provide valid statistical tests of mechanisms, unlike popular methods based on randomized interventions on the mediator. We propose efficient non-parametric estimators for a covariance version of the proposed effects, using data-adaptive regression coupled with semi-parametric efficiency theory to address model misspecification bias while retaining $\sqrt{n}$-consistency and asymptotic normality. We illustrate the use of our methods in two examples using publicly available data.
In many recommender systems and search problems, presenting a well balanced set of results can be an important goal in addition to serving highly relevant content. For example, in a movie recommendation system, it may be helpful to achieve a certain balance of different genres, likewise, it may be important to balance between highly popular versus highly personalized shows. Such balances could be thought across many categories and may be required for enhanced user experience, business considerations, fairness objectives etc. In this paper, we consider the problem of calibrating with respect to any given categories over items. We propose a way to balance a trade-off between relevance and calibration via a Linear Programming optimization problem where we learn a doubly stochastic matrix to achieve optimal balance in expectation. We then realize the learned policy using the Birkhoff-von Neumann decomposition of a doubly stochastic matrix. Several optimizations are considered over the proposed basic approach to make it fast. The experiments show that the proposed formulation can achieve a much better trade-off compared to many other baselines. This paper does not prescribe the exact categories to calibrate over (such as genres) universally for applications. This is likely dependent on the particular task or business objective. The main contribution of the paper is that it proposes a framework that can be applied to a variety of problems and demonstrates the efficacy of the proposed method using a few use-cases.
Inference tasks in signal processing are often characterized by the availability of reliable statistical modeling with some missing instance-specific parameters. One conventional approach uses data to estimate these missing parameters and then infers based on the estimated model. Alternatively, data can also be leveraged to directly learn the inference mapping end-to-end. These approaches for combining partially-known statistical models and data in inference are related to the notions of generative and discriminative models used in the machine learning literature, typically considered in the context of classifiers. The goal of this lecture note is to introduce the concepts of generative and discriminative learning for inference with a partially-known statistical model. While machine learning systems often lack the interpretability of traditional signal processing methods, we focus on a simple setting where one can interpret and compare the approaches in a tractable manner that is accessible and relevant to signal processing readers. In particular, we exemplify the approaches for the task of Bayesian signal estimation in a jointly Gaussian setting with the mean-squared error (MSE) objective, i.e., a linear estimation setting.
Noiseless compressive sensing is a protocol that enables undersampling and later recovery of a signal without loss of information. This compression is possible because the signal is usually sufficiently sparse in a given basis. Currently, the algorithm offering the best tradeoff between compression rate, robustness, and speed for compressive sensing is the LASSO (l1-norm bias) algorithm. However, many studies have pointed out the possibility that the implementation of lp-norms biases, with p smaller than one, could give better performance while sacrificing convexity. In this work, we focus specifically on the extreme case of the l0-based reconstruction, a task that is complicated by the discontinuity of the loss. In the first part of the paper, we describe via statistical physics methods, and in particular the replica method, how the solutions to this optimization problem are arranged in a clustered structure. We observe two distinct regimes: one at low compression rate where the signal can be recovered exactly, and one at high compression rate where the signal cannot be recovered accurately. In the second part, we present two message-passing algorithms based on our first results for the l0-norm optimization problem. The proposed algorithms are able to recover the signal at compression rates higher than the ones achieved by LASSO while being computationally efficient.
Obtaining guarantees on the convergence of the minimizers of empirical risks to the ones of the true risk is a fundamental matter in statistical learning. Instead of deriving guarantees on the usual estimation error, the goal of this paper is to provide concentration inequalities on the distance between the sets of minimizers of the risks for a broad spectrum of estimation problems. In particular, the risks are defined on metric spaces through probability measures that are also supported on metric spaces. A particular attention will therefore be given to include unbounded spaces and non-convex cost functions that might also be unbounded. This work identifies a set of assumptions allowing to describe a regime that seem to govern the concentration in many estimation problems, where the empirical minimizers are stable. This stability can then be leveraged to prove parametric concentration rates in probability and in expectation. The assumptions are verified, and the bounds showcased, on a selection of estimation problems such as barycenters on metric space with positive or negative curvature, subspaces of covariance matrices, regression problems and entropic-Wasserstein barycenters.
The task of predicting the publication period of text documents, such as news articles, is an important but less studied problem in the field of natural language processing. Predicting the year of a news article can be useful in various contexts, such as historical research, sentiment analysis, and media monitoring. In this work, we investigate the problem of predicting the publication period of a text document, specifically a news article, based on its textual content. In order to do so, we created our own extensive labeled dataset of over 350,000 news articles published by The New York Times over six decades. In our approach, we use a pretrained BERT model fine-tuned for the task of text classification, specifically for time period prediction.This model exceeds our expectations and provides some very impressive results in terms of accurately classifying news articles into their respective publication decades. The results beat the performance of the baseline model for this relatively unexplored task of time prediction from text.
Many social events and policy interventions generate treatment effects that persistently spill over into neighboring areas, causing interference in both time and space. In this paper, we propose a design-based framework to identify and estimate these spillover effects in panel data, when temporal and spatial interference intertwine with each other in complex ways that are unknown to researchers. Our framework defines estimands that enable researchers to measure the influence of each type of interference, and we propose estimators that are consistent and asymptotically normal under the assumption of sequential ignorability and mild regularity conditions. We show that conventional methods in panel data analysis, such as the difference-in-differences (DID) estimator or fixed effects models, can lead to significant biases in such scenarios. We test the method's performance on both simulated datasets and the replication of an empirical study from political science.
Causal discovery and causal reasoning are classically treated as separate and consecutive tasks: one first infers the causal graph, and then uses it to estimate causal effects of interventions. However, such a two-stage approach is uneconomical, especially in terms of actively collected interventional data, since the causal query of interest may not require a fully-specified causal model. From a Bayesian perspective, it is also unnatural, since a causal query (e.g., the causal graph or some causal effect) can be viewed as a latent quantity subject to posterior inference -- other unobserved quantities that are not of direct interest (e.g., the full causal model) ought to be marginalized out in this process and contribute to our epistemic uncertainty. In this work, we propose Active Bayesian Causal Inference (ABCI), a fully-Bayesian active learning framework for integrated causal discovery and reasoning, which jointly infers a posterior over causal models and queries of interest. In our approach to ABCI, we focus on the class of causally-sufficient, nonlinear additive noise models, which we model using Gaussian processes. We sequentially design experiments that are maximally informative about our target causal query, collect the corresponding interventional data, and update our beliefs to choose the next experiment. Through simulations, we demonstrate that our approach is more data-efficient than several baselines that only focus on learning the full causal graph. This allows us to accurately learn downstream causal queries from fewer samples while providing well-calibrated uncertainty estimates for the quantities of interest.
We present prompt distribution learning for effectively adapting a pre-trained vision-language model to address downstream recognition tasks. Our method not only learns low-bias prompts from a few samples but also captures the distribution of diverse prompts to handle the varying visual representations. In this way, we provide high-quality task-related content for facilitating recognition. This prompt distribution learning is realized by an efficient approach that learns the output embeddings of prompts instead of the input embeddings. Thus, we can employ a Gaussian distribution to model them effectively and derive a surrogate loss for efficient training. Extensive experiments on 12 datasets demonstrate that our method consistently and significantly outperforms existing methods. For example, with 1 sample per category, it relatively improves the average result by 9.1% compared to human-crafted prompts.
The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, if not better than, the original dense networks. Sparsity can reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.