Discrete choice experiments are frequently used to quantify consumer preferences by having respondents choose between different alternatives. Choice experiments involving mixtures of ingredients have been largely overlooked in the literature, even though many products and services can be described as mixtures of ingredients. As a consequence, little research has been done on the optimal design of choice experiments involving mixtures. The only existing research has focused on D-optimal designs, which means that an estimation-based approach was adopted. However, in experiments with mixtures, it is crucial to obtain models that yield precise predictions for any combination of ingredient proportions. This is because the goal of mixture experiments generally is to find the mixture that optimizes the respondents' utility. As a result, the I-optimality criterion is more suitable for designing choice experiments with mixtures than the D-optimality criterion because the I-optimality criterion focuses on getting precise predictions with the estimated statistical model. In this paper, we study Bayesian I-optimal designs, compare them with their Bayesian D-optimal counterparts, and show that the former designs perform substantially better than the latter in terms of the variance of the predicted utility.
It is a common phenomenon that for high-dimensional and nonparametric statistical models, rate-optimal estimators balance squared bias and variance. Although this balancing is widely observed, little is known whether methods exist that could avoid the trade-off between bias and variance. We propose a general strategy to obtain lower bounds on the variance of any estimator with bias smaller than a prespecified bound. This shows to which extent the bias-variance trade-off is unavoidable and allows to quantify the loss of performance for methods that do not obey it. The approach is based on a number of abstract lower bounds for the variance involving the change of expectation with respect to different probability measures as well as information measures such as the Kullback-Leibler or chi-square-divergence. Some of these inequalities rely on a new concept of information matrices. In a second part of the article, the abstract lower bounds are applied to several statistical models including the Gaussian white noise model, a boundary estimation problem, the Gaussian sequence model and the high-dimensional linear regression model. For these specific statistical applications, different types of bias-variance trade-offs occur that vary considerably in their strength. For the trade-off between integrated squared bias and integrated variance in the Gaussian white noise model, we combine the general strategy for lower bounds with a reduction technique. This allows us to link the original problem to the bias-variance trade-off for estimators with additional symmetry properties in a simpler statistical model. In the Gaussian sequence model, different phase transitions of the bias-variance trade-off occur. Although there is a non-trivial interplay between bias and variance, the rate of the squared bias and the variance do not have to be balanced in order to achieve the minimax estimation rate.
We propose a pairs trading model that incorporates a time-varying volatility of the Constant Elasticity of Variance type. Our approach is based on stochastic control techniques; given a fixed time horizon and a portfolio of two co-integrated assets, we define the trading strategies as the portfolio weights maximizing the expected power utility from terminal wealth. We compute the optimal pairs strategies by using a Finite Difference method. Finally, we illustrate our results by conducting tests on historical market data at daily frequency. The parameters are estimated by the Generalized Method of Moments.
In settings ranging from weather forecasts to political prognostications to financial projections, probability estimates of future binary outcomes often evolve over time. For example, the estimated likelihood of rain on a specific day changes by the hour as new information becomes available. Given a collection of such probability paths, we introduce a Bayesian framework -- which we call the Gaussian latent information martingale, or GLIM -- for modeling the structure of dynamic predictions over time. Suppose, for example, that the likelihood of rain in a week is 50 %, and consider two hypothetical scenarios. In the first, one expects the forecast to be equally likely to become either 25 % or 75 % tomorrow; in the second, one expects the forecast to stay constant for the next several days. A time-sensitive decision-maker might select a course of action immediately in the latter scenario, but may postpone their decision in the former, knowing that new information is imminent. We model these trajectories by assuming predictions update according to a latent process of information flow, which is inferred from historical data. In contrast to general methods for time series analysis, this approach preserves important properties of probability paths such as the martingale structure and appropriate amount of volatility and better quantifies future uncertainties around probability paths. We show that GLIM outperforms three popular baseline methods, producing better estimated posterior probability path distributions measured by three different metrics. By elucidating the dynamic structure of predictions over time, we hope to help individuals make more informed choices.
Randomized experiments are often performed to study the causal effects of interest. Blocking is a technique to precisely estimate the causal effects when the experimental material is not homogeneous. We formalize the problem of obtaining a statistically optimal set of covariates to be used to create blocks while performing a randomized experiment. We provide a graphical test to obtain such a set for a general semi-Markovian causal model. We also propose and provide ideas towards solving a more general problem of obtaining an optimal blocking set that considers both the statistical and economic costs of blocking.
We consider an input-to-response (ItR) system characterized by (1) parameterized input with a known probability distribution and (2) stochastic ItR function with heteroscedastic randomness. Our purpose is to efficiently quantify the extreme response probability when the ItR function is expensive to evaluate. The problem setup arises often in physics and engineering problems, with randomness in ItR coming from either intrinsic uncertainties (say, as a solution to a stochastic equation) or additional (critical) uncertainties that are not incorporated in a low-dimensional input parameter space (as a result of dimension reduction applied to the original high-dimensional input space). To reduce the required sampling numbers, we develop a sequential Bayesian experimental design method leveraging the variational heteroscedastic Gaussian process regression (VHGPR) to account for the stochastic ItR, along with a new criterion to select the next-best samples sequentially. The validity of our new method is first tested in two synthetic problems with the stochastic ItR functions defined artificially. Finally, we demonstrate the application of our method to an engineering problem of estimating the extreme ship motion probability in irregular waves, where the uncertainty in ItR naturally originates from standard wave group parameterization, which reduces the original high-dimensional wave field into a two-dimensional parameter space.
Modern computational models in supervised machine learning are often highly parameterized universal approximators. As such, the value of the parameters is unimportant, and only the out of sample performance is considered. On the other hand much of the literature on model estimation assumes that the parameters themselves have intrinsic value, and thus is concerned with bias and variance of parameter estimates, which may not have any simple relationship to out of sample model performance. Therefore, within supervised machine learning, heavy use is made of ridge regression (i.e., L2 regularization), which requires the the estimation of hyperparameters and can be rendered ineffective by certain model parameterizations. We introduce an objective function which we refer to as Information-Corrected Estimation (ICE) that reduces KL divergence based generalization error for supervised machine learning. ICE attempts to directly maximize a corrected likelihood function as an estimator of the KL divergence. Such an approach is proven, theoretically, to be effective for a wide class of models, with only mild regularity restrictions. Under finite sample sizes, this corrected estimation procedure is shown experimentally to lead to significant reduction in generalization error compared to maximum likelihood estimation and L2 regularization.
In this paper, we apply a Bayesian perspective to the sampling of alternatives for multinomial logit (MNL) and mixed multinomial logit (MMNL) models. A sampling of alternatives reduces the computational challenge of evaluating the denominator of the logit choice probability for large choice sets by only using a smaller subset of sampled alternatives including the chosen alternative. To correct for the resulting overestimation of the choice probability, a correction factor has to be applied. McFadden (1978) proposes a correction factor to the utility of each alternative which is based on the probability of sampling the smaller subset of alternatives and that alternative being chosen. McFadden's correction factor ensures consistency of parameter estimates under a wide range of sampling protocols. A special sampling protocol discussed by McFadden is uniform conditioning, which assigns the same sampling probability and therefore the same correction factor to each alternative in the sampled choice set. Since a constant is added to each alternative the correction factor cancels out, but consistent estimates are still obtained. Bayesian estimation is focused on describing the full posterior distributions of the parameters of interest instead of the consistency of their point estimates. We theoretically show that uniform conditioning is sufficient to minimise the loss of information from a sampling of alternatives on the parameters of interest over the full posterior distribution in Bayesian MNL models. Minimum loss of information is, however, not guaranteed for other sampling protocols. This result extends to Bayesian MMNL models estimated using the principle of data augmentation. The application of uniform conditioning, a more restrictive sampling protocol, is thus sufficient in a Bayesian estimation context to achieve finite sample properties of MNL and MMNL parameter estimates.
There is growing interest in object detection in advanced driver assistance systems and autonomous robots and vehicles. To enable such innovative systems, we need faster object detection. In this work, we investigate the trade-off between accuracy and speed with domain-specific approximations, i.e. category-aware image size scaling and proposals scaling, for two state-of-the-art deep learning-based object detection meta-architectures. We study the effectiveness of applying approximation both statically and dynamically to understand the potential and the applicability of them. By conducting experiments on the ImageNet VID dataset, we show that domain-specific approximation has great potential to improve the speed of the system without deteriorating the accuracy of object detectors, i.e. up to 7.5x speedup for dynamic domain-specific approximation. To this end, we present our insights toward harvesting domain-specific approximation as well as devise a proof-of-concept runtime, AutoFocus, that exploits dynamic domain-specific approximation.
While Generative Adversarial Networks (GANs) have empirically produced impressive results on learning complex real-world distributions, recent work has shown that they suffer from lack of diversity or mode collapse. The theoretical work of Arora et al.~\cite{AroraGeLiMaZh17} suggests a dilemma about GANs' statistical properties: powerful discriminators cause overfitting, whereas weak discriminators cannot detect mode collapse. In contrast, we show in this paper that GANs can in principle learn distributions in Wasserstein distance (or KL-divergence in many cases) with polynomial sample complexity, if the discriminator class has strong distinguishing power against the particular generator class (instead of against all possible generators). For various generator classes such as mixture of Gaussians, exponential families, and invertible neural networks generators, we design corresponding discriminators (which are often neural nets of specific architectures) such that the Integral Probability Metric (IPM) induced by the discriminators can provably approximate the Wasserstein distance and/or KL-divergence. This implies that if the training is successful, then the learned distribution is close to the true distribution in Wasserstein distance or KL divergence, and thus cannot drop modes. Our preliminary experiments show that on synthetic datasets the test IPM is well correlated with KL divergence, indicating that the lack of diversity may be caused by the sub-optimality in optimization instead of statistical inefficiency.
We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.