With historic misses in the 2016 and 2020 US Presidential elections, interest in measuring polling errors has increased. The most common method for measuring directional errors and non-sampling excess variability during a postmortem for an election is by assessing the difference between the poll result and election result for polls conducted within a few days of the day of the election. Analyzing such polling error data is notoriously difficult with typical models being extremely sensitive to the time between the poll and the election. We leverage hidden Markov models traditionally used for election forecasting to flexibly capture time-varying preferences and treat the election result as a peak at the typically hidden Markovian process. Our results are much less sensitive to the choice of time window, avoid conflating shifting preferences with polling error, and are more interpretable despite a highly flexible model. We demonstrate these results with data on polls from the 2004 through 2020 US Presidential elections and 1992 through 2020 US Senate elections, concluding that previously reported estimates of bias in Presidential elections were too extreme by 10\%, estimated bias in Senatorial elections was too extreme by 25\%, and excess variability estimates were also too large.
Bayesian nonparametric hierarchical priors provide flexible models for sharing of information within and across groups. We focus on latent feature allocation models, where the data structures correspond to multisets or unbounded sparse matrices. The fundamental development in this regard is the Hierarchical Indian Buffet process (HIBP), devised by Thibaux and Jordan (2007). However, little is known in terms of explicit tractable descriptions of the joint, marginal, posterior and predictive distributions of the HIBP. We provide explicit novel descriptions of these quantities, in the Bernoulli HIBP and general spike and slab HIBP settings, which allows for exact sampling and simpler practical implementation. We then extend these results to the more complex setting of hierarchies of general HIBP (HHIBP). The generality of our framework allows one to recognize important structure that may otherwise be masked in the Bernoulli setting, and involves characterizations via dynamic mixed Poisson random count matrices. Our analysis shows that the standard choice of hierarchical Beta processes for modeling across group sharing is not ideal in the classic Bernoulli HIBP setting proposed by Thibaux and Jordan (2007), or other spike and slab HIBP settings, and we thus indicate tractable alternative priors.
In Offline Model Learning for Planning and in Offline Reinforcement Learning, the limited data set hinders the estimate of the Value function of the relative Markov Decision Process (MDP). Consequently, the performance of the obtained policy in the real world is bounded and possibly risky, especially when the deployment of a wrong policy can lead to catastrophic consequences. For this reason, several pathways are being followed with the scope of reducing the model error (or the distributional shift between the learned model and the true one) and, more broadly, obtaining risk-aware solutions with respect to model uncertainty. But when it comes to the final application which baseline should a practitioner choose? In an offline context where computational time is not an issue and robustness is the priority we propose Exploitation vs Caution (EvC), a paradigm that (1) elegantly incorporates model uncertainty abiding by the Bayesian formalism, and (2) selects the policy that maximizes a risk-aware objective over the Bayesian posterior between a fixed set of candidate policies provided, for instance, by the current baselines. We validate EvC with state-of-the-art approaches in different discrete, yet simple, environments offering a fair variety of MDP classes. In the tested scenarios EvC manages to select robust policies and hence stands out as a useful tool for practitioners that aim to apply offline planning and reinforcement learning solvers in the real world.
Using administrative patient-care data such as Electronic Health Records and medical/pharmaceutical claims for population-based scientific research has become increasingly common. With vast sample sizes leading to very small standard errors, researchers need to pay more attention to potential biases in the estimates of association parameters of interest, specifically to biases that do not diminish with increasing sample size. Of these multiple sources of biases, in this paper, we focus on understanding selection bias. We present an analytic framework using directed acyclic graphs for guiding applied researchers to dissect how different sources of selection bias may affect their parameter estimates of interest. We review four easy-to-implement weighting approaches to reduce selection bias and explain through a simulation study when they can rescue us in practice with analysis of real world data. We provide annotated R codes to implement these methods.
Adhesive joints are increasingly used in industry for a wide variety of applications because of their favorable characteristics such as high strength-to-weight ratio, design flexibility, limited stress concentrations, planar force transfer, good damage tolerance, and fatigue resistance. Finding the optimal process parameters for an adhesive bonding process is challenging: the optimization is inherently multi-objective (aiming to maximize break strength while minimizing cost), constrained (the process should not result in any visual damage to the materials, and stress tests should not result in failures that are adhesion-related), and uncertain (testing the same process parameters several times may lead to different break strengths). Real-life physical experiments in the lab are expensive to perform. Traditional evolutionary approaches (such as genetic algorithms) are then ill-suited to solve the problem, due to the prohibitive amount of experiments required for evaluation. Although Bayesian optimization-based algorithms are preferred to solve such expensive problems, few methods consider the optimization of more than one (noisy) objective and several constraints at the same time. In this research, we successfully applied specific machine learning techniques (Gaussian Process Regression) to emulate the objective and constraint functions based on a limited amount of experimental data. The techniques are embedded in a Bayesian optimization algorithm, which succeeds in detecting Pareto-optimal process settings in a highly efficient way (i.e., requiring a limited number of physical experiments).
In the literature on deep neural networks, there is considerable interest in developing activation functions that can enhance neural network performance. In recent years, there has been renewed scientific interest in proposing activation functions that can be trained throughout the learning process, as they appear to improve network performance, especially by reducing overfitting. In this paper, we propose a trainable activation function whose parameters need to be estimated. A fully Bayesian model is developed to automatically estimate from the learning data both the model weights and activation function parameters. An MCMC-based optimization scheme is developed to build the inference. The proposed method aims to solve the aforementioned problems and improve convergence time by using an efficient sampling scheme that guarantees convergence to the global maximum. The proposed scheme is tested on three datasets with three different CNNs. Promising results demonstrate the usefulness of our proposed approach in improving model accuracy due to the proposed activation function and Bayesian estimation of the parameters.
Numerical vector aggregation plays a crucial role in privacy-sensitive applications, such as distributed gradient estimation in federated learning and statistical analysis of key-value data. In the context of local differential privacy, this study provides a tight minimax error bound of $O(\frac{ds}{n\epsilon^2})$, where $d$ represents the dimension of the numerical vector and $s$ denotes the number of non-zero entries. By converting the conditional/unconditional numerical mean estimation problem into a frequency estimation problem, we develop an optimal and efficient mechanism called Collision. In contrast, existing methods exhibit sub-optimal error rates of $O(\frac{d^2}{n\epsilon^2})$ or $O(\frac{ds^2}{n\epsilon^2})$. Specifically, for unconditional mean estimation, we leverage the negative correlation between two frequencies in each dimension and propose the CoCo mechanism, which further reduces estimation errors for mean values compared to Collision. Moreover, to surpass the error barrier in local privacy, we examine privacy amplification in the shuffle model for the proposed mechanisms and derive precisely tight amplification bounds. Our experiments validate and compare our mechanisms with existing approaches, demonstrating significant error reductions for frequency estimation and mean estimation on numerical vectors.
This paper proposes a cell-free massive multiple-input multiple-output (CF-mMIMO) architecture with joint list-based detection with soft interference cancelation (soft-IC) and access points (APs) selection. In particular, we derive a new closed-form expression for the minimum mean-square error receive filter while taking the uplink transmit powers and APs selection into account. This is achieved by optimizing the receive combining vector by minimizing the mean square error between the detected symbol estimate and transmitted symbol, after canceling the multi-user interference (MUI). By using low-density parity check (LDPC) codes, an iterative detection and decoding (IDD) scheme based on a message passing is devised. In order to perform joint detection at the central processing unit (CPU), the access points locally estimate the channel and send their received sample data to the CPU via the front haul links. In order to enhance the system's bit error rate performance, the detected symbols are iteratively exchanged between the joint detector and the LDPC decoder in log likelihood ratio form. Furthermore, we draw insights into the derived detector as the number of IDD iterations increase. Finally, the proposed list detector is compared with existing detection techniques.
In video action recognition, shortcut static features can interfere with the learning of motion features, resulting in poor out-of-distribution (OOD) generalization. The video background is clearly a source of static bias, but the video foreground, such as the clothing of the actor, can also provide static bias. In this paper, we empirically verify the existence of foreground static bias by creating test videos with conflicting signals from the static and moving portions of the video. To tackle this issue, we propose a simple yet effective technique, StillMix, to learn robust action representations. Specifically, StillMix identifies bias-inducing video frames using a 2D reference network and mixes them with videos for training, serving as effective bias suppression even when we cannot explicitly extract the source of bias within each video frame or enumerate types of bias. Finally, to precisely evaluate static bias, we synthesize two new benchmarks, SCUBA for static cues in the background, and SCUFO for static cues in the foreground. With extensive experiments, we demonstrate that StillMix mitigates both types of static bias and improves video representations for downstream applications.
Minimizing cross-entropy over the softmax scores of a linear map composed with a high-capacity encoder is arguably the most popular choice for training neural networks on supervised learning tasks. However, recent works show that one can directly optimize the encoder instead, to obtain equally (or even more) discriminative representations via a supervised variant of a contrastive objective. In this work, we address the question whether there are fundamental differences in the sought-for representation geometry in the output space of the encoder at minimal loss. Specifically, we prove, under mild assumptions, that both losses attain their minimum once the representations of each class collapse to the vertices of a regular simplex, inscribed in a hypersphere. We provide empirical evidence that this configuration is attained in practice and that reaching a close-to-optimal state typically indicates good generalization performance. Yet, the two losses show remarkably different optimization behavior. The number of iterations required to perfectly fit to data scales superlinearly with the amount of randomly flipped labels for the supervised contrastive loss. This is in contrast to the approximately linear scaling previously reported for networks trained with cross-entropy.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.