The paper describes a new class of capture-recapture models for closed populations when individual covariates are available. The novelty consists in combining a latent class model for the distribution of the capture history, where the class weights and the conditional distributions given the latent may depend on covariates, with a model for the marginal distribution of the available covariates as in \cite{Liu2017}. In addition, any general form of serial dependence is allowed when modeling capture histories conditionally on the latent and covariates. A Fisher-scoring algorithm for maximum likelihood estimation is proposed, and the Implicit Function Theorem is used to show that the mapping between the marginal distribution of the observed covariates and the probabilities of being never captured is one-to-one. Asymptotic results are outlined, and a procedure for constructing likelihood based confidence intervals for the population size is presented. Two examples based on real data are used to illustrate the proposed approach
Given many different items in a product category, each with its own fixed price point, which subset should a retailer offer to its customers? Assortment optimization describes the process of finding a subset that maximizes average revenue, based on a model for how customers choose between items in that product category. In this paper we ask whether offering an assortment is actually optimal, given the emergence of more sophisticated selling practices, such as offering certain items only through lotteries. To formalize this question, we introduce a mechanism design problem where the items have fixed prices and the seller optimizes over (randomized) allocations, given a distribution for how a buyer ranks the items. Under our formulation, deterministic mechanisms correspond to assortments, while randomized mechanisms correspond to lotteries for selling items with fixed prices. We derive a sufficient condition, based purely on the buyer's preference distribution, that guarantees assortments to be optimal within the larger class of randomized mechanisms. Our sufficient condition captures many preference distributions commonly studied in assortment optimization, including Multi-Nomial Logit (MNL), Markov Chain, a mixture of MNL with an Independent Demand model, and simple cases of Nested Logit. When our condition does not hold, we also bound the suboptimality of assortments compared to lotteries. Finally, from our paper emerges two results of independent interest: an example showing that Nested Logit is not captured by Markov Chain choice models, and a tighter Linear Programming relaxation for assortment optimization.
This study develops an asymptotic theory for estimating the time-varying characteristics of locally stationary functional time series. We investigate a kernel-based method to estimate the time-varying covariance operator and the time-varying mean function of a locally stationary functional time series. In particular, we derive the convergence rate of the kernel estimator of the covariance operator and associated eigenvalue and eigenfunctions and establish a central limit theorem for the kernel-based locally weighted sample mean. As applications of our results, we discuss the prediction of locally stationary functional time series and methods for testing the equality of time-varying mean functions in two functional samples.
The mathematical forces at work behind Generative Adversarial Networks raise challenging theoretical issues. Motivated by the important question of characterizing the geometrical properties of the generated distributions, we provide a thorough analysis of Wasserstein GANs (WGANs) in both the finite sample and asymptotic regimes. We study the specific case where the latent space is univariate and derive results valid regardless of the dimension of the output space. We show in particular that for a fixed sample size, the optimal WGANs are closely linked with connected paths minimizing the sum of the squared Euclidean distances between the sample points. We also highlight the fact that WGANs are able to approach (for the 1-Wasserstein distance) the target distribution as the sample size tends to infinity, at a given convergence rate and provided the family of generative Lipschitz functions grows appropriately. We derive in passing new results on optimal transport theory in the semi-discrete setting.
In this paper we model the spreading of the SARS-CoV-2 in Mexico by introducing a new stochastic approximation constructed from first principles, structured on the basis of a Latent-Infectious- (Recovered or Deceased) (LI(RD)) compartmental approximation, where the number of new infected individuals caused by a single infectious individual per unit time (a day), is a random variable of a Poisson distribution and whose parameter is modulated through a weight-like time-dependent function. The weight function serves to introduce a time dependence to the average number of new infections and as we will show, this information can be extracted from empirical data, giving to the model self-consistency and provides a tool to study information about periodic patterns encoded in the epidemiological dynamics
Fairness aware data mining (FADM) aims to prevent algorithms from discriminating against protected groups. The literature has come to an impasse as to what constitutes explainable variability as opposed to discrimination. This distinction hinges on a rigorous understanding of the role of proxy variables; i.e., those variables which are associated both the protected feature and the outcome of interest. We demonstrate that fairness is achieved by ensuring impartiality with respect to sensitive characteristics and provide a framework for impartiality by accounting for different perspectives on the data generating process. In particular, fairness can only be precisely defined in a full-data scenario in which all covariates are observed. We then analyze how these models may be conservatively estimated via regression in partial-data settings. Decomposing the regression estimates provides insights into previously unexplored distinctions between explainable variability and discrimination that illuminate the use of proxy variables in fairness aware data mining.
A new clustering accuracy measure is proposed to determine the unknown number of clusters and to assess the quality of clustering of a data set given in any dimensional space. Our validity index applies the classical nonparametric univariate kernel density estimation method to the interpoint distances computed between the members of data. Being based on interpoint distances, it is free of the curse of dimensionality and therefore efficiently computable for high-dimensional situations where the number of study variables can be larger than the sample size. The proposed measure is compatible with any clustering algorithm and with every kind of data set where the interpoint distance measure can be defined to have a density function. Simulation study proves its superiority over widely used cluster validity indices like the average silhouette width and the Dunn index, whereas its applicability is shown with respect to a high-dimensional Biostatistical study of Alon data set and a large Astrostatistical application of time series with light curves of new variable stars.
The entropy is a measure of uncertainty that plays a central role in information theory. When the distribution of the data is unknown, an estimate of the entropy needs be obtained from the data sample itself. We propose a semi-parametric estimate, based on a mixture model approximation of the distribution of interest. The estimate can rely on any type of mixture, but we focus on Gaussian mixture model to demonstrate its accuracy and versatility. Performance of the proposed approach is assessed through a series of simulation studies. We also illustrate its use on two real-life data examples.
Estimation of heterogeneous treatment effects is an active area of research in causal inference. Most of the existing methods, however, focus on estimating the conditional average treatment effects of a single, binary treatment given a set of pre-treatment covariates. In this paper, we propose a method to estimate the heterogeneous causal effects of high-dimensional treatments, which poses unique challenges in terms of estimation and interpretation. The proposed approach is based on a Bayesian mixture of regularized regressions to identify groups of units who exhibit similar patterns of treatment effects. By directly modeling cluster membership with covariates, the proposed methodology allows one to explore the unit characteristics that are associated with different patterns of treatment effects. Our motivating application is conjoint analysis, which is a popular survey experiment in social science and marketing research and is based on a high-dimensional factorial design. We apply the proposed methodology to the conjoint data, where survey respondents are asked to select one of two immigrant profiles with randomly selected attributes. We find that a group of respondents with a relatively high degree of prejudice appears to discriminate against immigrants from non-European countries like Iraq. An open-source software package is available for implementing the proposed methodology.
Behavioral science researchers have shown strong interest in disaggregating within-person relations from between-person differences (stable traits) using longitudinal data. In this paper, we propose a method of within-person variability score-based causal inference for estimating joint effects of time-varying continuous treatments by effectively controlling for stable traits. After explaining the assumed data-generating process and providing formal definitions of stable trait factors, within-person variability scores, and joint effects of time-varying treatments at the within-person level, we introduce the proposed method, which consists of a two-step analysis. Within-person variability scores for each person, which are disaggregated from stable traits of that person, are first calculated using weights based on a best linear correlation preserving predictor through structural equation modeling (SEM). Causal parameters are then estimated via a potential outcome approach, either marginal structural models (MSMs) or structural nested mean models (SNMMs), using calculated within-person variability scores. Unlike the approach that relies entirely on SEM, the present method does not assume linearity for observed time-varying confounders at the within-person level. We emphasize the use of SNMMs with G-estimation because of its property of being doubly robust to model misspecifications in how observed time-varying confounders are functionally related with treatments/predictors and outcomes at the within-person level. Through simulation, we show that the proposed method can recover causal parameters well and that causal estimates might be severely biased if one does not properly account for stable traits. An empirical application using data regarding sleep habits and mental health status from the Tokyo Teen Cohort study is also provided.
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results.