Quality of Life (QOL) outcomes are important in the management of chronic illnesses. In studies of efficacies of treatments or intervention modalities, QOL scales multi-dimensional constructs are routinely used as primary endpoints. The standard data analysis strategy computes composite (average) overall and domain scores, and conducts a mixed-model analysis for evaluating efficacy or monitoring medical conditions as if these scores were in continuous metric scale. However, assumptions of parametric models like continuity and homoscedastivity can be violated in many cases. Furthermore, it is even more challenging when there are missing values on some of the variables. In this paper, we propose a purely nonparametric approach in the sense that meaningful and, yet, nonparametric effect size measures are developed. We propose estimator for the effect size and develop the asymptotic properties. Our methods are shown to be particularly effective in the presence of some form of clustering and/or missing values. Inferential procedures are derived from the asymptotic theory. The Asthma Randomized Trial of Indoor Wood Smoke data will be used to illustrate the applications of the proposed methods. The data was collected from a three-arm randomized trial which evaluated interventions targeting biomass smoke particulate matter from older model residential wood stoves in homes that have children with asthma.
We investigate robust linear regression where data may be contaminated by an oblivious adversary, i.e., an adversary than may know the data distribution but is otherwise oblivious to the realizations of the data samples. This model has been previously analyzed under strong assumptions. Concretely, $\textbf{(i)}$ all previous works assume that the covariance matrix of the features is positive definite; and $\textbf{(ii)}$ most of them assume that the features are centered (i.e. zero mean). Additionally, all previous works make additional restrictive assumption, e.g., assuming that the features are Gaussian or that the corruptions are symmetrically distributed. In this work we go beyond these assumptions and investigate robust regression under a more general set of assumptions: $\textbf{(i)}$ we allow the covariance matrix to be either positive definite or positive semi definite, $\textbf{(ii)}$ we do not necessarily assume that the features are centered, $\textbf{(iii)}$ we make no further assumption beyond boundedness (sub-Gaussianity) of features and measurement noise. Under these assumption we analyze a natural SGD variant for this problem and show that it enjoys a fast convergence rate when the covariance matrix is positive definite. In the positive semi definite case we show that there are two regimes: if the features are centered we can obtain a standard convergence rate; otherwise the adversary can cause any learner to fail arbitrarily.
Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify "what causes what?" Contingent on assumptions, it is sometimes possible to identify an exact causal Directed Acyclic Graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is on one such case: a linear structural equation model with non-Gaussian noise, a model known as the Linear Non-Gaussian Acyclic Model (LiNGAM). Given a specified parametric noise model, we develop a novel sequential approach to estimate the causal ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying causal DAG. We provide extensive numerical evidence to demonstrate that our sequential procedure is scalable to cases with possibly thousands of nodes and works well for high-dimensional data. We also conduct an application to a single-cell gene expression dataset to demonstrate our estimation procedure.
Fairly dividing a set of indivisible resources to a set of agents is of utmost importance in some applications. However, after an allocation has been implemented the preferences of agents might change and envy might arise. We study the following problem to cope with such situations: Given an allocation of indivisible resources to agents with additive utility-based preferences, is it possible to socially donate some of the resources (which means removing these resources from the allocation instance) such that the resulting modified allocation is envy-free (up to one good). We require that the number of deleted resources and/or the caused utilitarian welfare loss of the allocation are bounded. We conduct a thorough study of the (parameterized) computational complexity of this problem considering various natural and problem-specific parameters (e.g., the number of agents, the number of deleted resources, or the maximum number of resources assigned to an agent in the initial allocation) and different preference models, including unary and 0/1-valuations. In our studies, we obtain a rich set of (parameterized) tractability and intractability results and discover several surprising contrasts, for instance, between the two closely related fairness concepts envy-freeness and envy-freeness up to one good and between the influence of the parameters maximum number and welfare of the deleted resources.
Classical two-sample permutation tests for equality of distributions have exact size in finite samples, but they fail to control size for testing equality of parameters that summarize each distribution. This paper proposes permutation tests for equality of parameters that are estimated at root-$n$ or slower rates. Our general framework applies to both parametric and nonparametric models, with two samples or one sample split into two subsamples. Our tests have correct size asymptotically while preserving exact size in finite samples when distributions are equal. They have no loss in local-asymptotic power compared to tests that use asymptotic critical values. We propose confidence sets with correct coverage in large samples that also have exact coverage in finite samples if distributions are equal up to a transformation. We apply our theory to four commonly-used hypothesis tests of nonparametric functions evaluated at a point. Lastly, simulations show good finite sample properties, and two empirical examples illustrate our tests in practice.
A connected and automated vehicle safety metric determines the performance of a subject vehicle (SV) by analyzing the data involving the interactions among the SV and other dynamic road users and environmental features. When the data set contains only a finite set of samples collected from the naturalistic mixed-traffic driving environment, a metric is expected to generalize the safety assessment outcome from the observed finite samples to the unobserved cases by specifying in what domain the SV is expected to be safe and how safe the SV is, statistically, in that domain. However, to the best of our knowledge, none of the existing safety metrics are able to justify the above properties with an operational domain specific, guaranteed complete, and provably unbiased safety evaluation outcome. In this paper, we propose a novel safety metric that involves the $\alpha$-shape and the $\epsilon$-almost robustly forward invariant set to characterize the SV's almost safe operable domain and the probability for the SV to remain inside the safe domain indefinitely, respectively. The empirical performance of the proposed method is demonstrated in several different operational design domains through a series of cases covering a variety of fidelity levels (real-world and simulators), driving environments (highway, urban, and intersections), road users (car, truck, and pedestrian), and SV driving behaviors (human driver and self driving algorithms).
In this work, we study a random orthogonal projection based least squares estimator for the stable solution of a multivariate nonparametric regression (MNPR) problem. More precisely, given an integer $d\geq 1$ corresponding to the dimension of the MNPR problem, a positive integer $N\geq 1$ and a real parameter $\alpha\geq -\frac{1}{2},$ we show that a fairly large class of $d-$variate regression functions are well and stably approximated by its random projection over the orthonormal set of tensor product $d-$variate Jacobi polynomials with parameters $(\alpha,\alpha).$ The associated uni-variate Jacobi polynomials have degree at most $N$ and their tensor products are orthonormal over $\mathcal U=[0,1]^d,$ with respect to the associated multivariate Jacobi weights. In particular, if we consider $n$ random sampling points $\mathbf X_i$ following the $d-$variate Beta distribution, with parameters $(\alpha+1,\alpha+1),$ then we give a relation involving $n, N, \alpha$ to ensure that the resulting $(N+1)^d\times (N+1)^d$ random projection matrix is well conditioned. Moreover, we provide squared integrated as well as $L^2-$risk errors of this estimator. Precise estimates of these errors are given in the case where the regression function belongs to an isotropic Sobolev space $H^s(I^d),$ with $s> \frac{d}{2}.$ Also, to handle the general and practical case of an unknown distribution of the $\mathbf X_i,$ we use Shepard's scattered interpolation scheme in order to generate fairly precise approximations of the observed data at $n$ i.i.d. sampling points $\mathbf X_i$ following a $d-$variate Beta distribution. Finally, we illustrate the performance of our proposed multivariate nonparametric estimator by some numerical simulations with synthetic as well as real data.
We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples. Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptical distributions, and analyze depth or scoring rule based GAN losses for the problem. Our work extend these to robust mean estimation, second moment estimation, and robust linear regression when the true distribution only has bounded Orlicz norms, which includes the broad family of sub-Gaussian, sub-Exponential and bounded moment distributions. We also provide a different set of sufficient conditions for the GAN loss to work: we only require its induced distance function to be a cumulative density function of some light-tailed distribution, which is easily satisfied by neural networks with sigmoid activation. In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance, which overcomes the computational intractability of the original Kolmogorov-Smirnov distance used in the prior work.
A surrogate endpoint S in a clinical trial is an outcome that may be measured earlier or more easily than the true outcome of interest T. In this work, we extend causal inference approaches to validate such a surrogate using potential outcomes. The causal association paradigm assesses the relationship of the treatment effect on the surrogate with the treatment effect on the true endpoint. Using the principal surrogacy criteria, we utilize the joint conditional distribution of the potential outcomes T, given the potential outcomes S. In particular, our setting of interest allows us to assume the surrogate under the placebo, S(0), is zero-valued, and we incorporate baseline covariates in the setting of normally-distributed endpoints. We develop Bayesian methods to incorporate conditional independence and other modeling assumptions and explore their impact on the assessment of surrogacy. We demonstrate our approach via simulation and data that mimics an ongoing study of a muscular dystrophy gene therapy.
Large margin nearest neighbor (LMNN) is a metric learner which optimizes the performance of the popular $k$NN classifier. However, its resulting metric relies on pre-selected target neighbors. In this paper, we address the feasibility of LMNN's optimization constraints regarding these target points, and introduce a mathematical measure to evaluate the size of the feasible region of the optimization problem. We enhance the optimization framework of LMNN by a weighting scheme which prefers data triplets which yield a larger feasible region. This increases the chances to obtain a good metric as the solution of LMNN's problem. We evaluate the performance of the resulting feasibility-based LMNN algorithm using synthetic and real datasets. The empirical results show an improved accuracy for different types of datasets in comparison to regular LMNN.
In this paper we introduce a covariance framework for the analysis of EEG and MEG data that takes into account observed temporal stationarity on small time scales and trial-to-trial variations. We formulate a model for the covariance matrix, which is a Kronecker product of three components that correspond to space, time and epochs/trials, and consider maximum likelihood estimation of the unknown parameter values. An iterative algorithm that finds approximations of the maximum likelihood estimates is proposed. We perform a simulation study to assess the performance of the estimator and investigate the influence of different assumptions about the covariance factors on the estimated covariance matrix and on its components. Apart from that, we illustrate our method on real EEG and MEG data sets. The proposed covariance model is applicable in a variety of cases where spontaneous EEG or MEG acts as source of noise and realistic noise covariance estimates are needed for accurate dipole localization, such as in evoked activity studies, or where the properties of spontaneous EEG or MEG are themselves the topic of interest, such as in combined EEG/fMRI experiments in which the correlation between EEG and fMRI signals is investigated.