Health outcomes, such as body mass index and cholesterol levels, are known to be dependent on age and exhibit varying effects with their associated risk factors. In this paper, we propose a novel framework for dynamic modeling of the associations between health outcomes and risk factors using varying-coefficients (VC) regional quantile regression via K-nearest neighbors (KNN) fused Lasso, which captures the time-varying effects of age. The proposed method has strong theoretical properties, including a tight estimation error bound and the ability to detect exact clustered patterns under certain regularity conditions. To efficiently solve the resulting optimization problem, we develop an alternating direction method of multipliers (ADMM) algorithm. Our empirical results demonstrate the efficacy of the proposed method in capturing the complex age-dependent associations between health outcomes and their risk factors.
We propose a linear algebraic method, rooted in the spectral properties of graphs, that can be used to prove lower bounds in communication complexity. Our proof technique effectively marries spectral bounds with information-theoretic inequalities. The key insight is the observation that, in specific settings, even when data sets $X$ and $Y$ are closely correlated and have high mutual information, the owner of $X$ cannot convey a reasonably short message that maintains substantial mutual information with $Y$. In essence, from the perspective of the owner of $Y$, any sufficiently brief message $m=m(X)$ would appear nearly indistinguishable from a random bit sequence. We employ this argument in several problems of communication complexity. Our main result concerns cryptographic protocols. We establish a lower bound for communication complexity of multi-party secret key agreement with unconditional, i.e., information-theoretic security. Specifically, for one-round protocols (simultaneous messages model) of secret key agreement with three participants we obtain an asymptotically tight lower bound. This bound implies optimality of the previously known omniscience communication protocol (this result applies to a non-interactive secret key agreement with three parties and input data sets with an arbitrary symmetric information profile). We consider communication problems in one-shot scenarios when the parties' inputs are not produced by any i.i.d. sources, and there are no ergodicity assumptions on the input data. In this setting, we found it natural to present our results using the framework of Kolmogorov complexity.
Auxiliary data sources have become increasingly important in epidemiological surveillance, as they are often available at a finer spatial and temporal resolution, larger coverage, and lower latency than traditional surveillance signals. We describe the problem of spatial and temporal heterogeneity in these signals derived from these data sources, where spatial and/or temporal biases are present. We present a method to use a ``guiding'' signal to correct for these biases and produce a more reliable signal that can be used for modeling and forecasting. The method assumes that the heterogeneity can be approximated by a low-rank matrix and that the temporal heterogeneity is smooth over time. We also present a hyperparameter selection algorithm to choose the parameters representing the matrix rank and degree of temporal smoothness of the corrections. In the absence of ground truth, we use maps and plots to argue that this method does indeed reduce heterogeneity. Reducing heterogeneity from auxiliary data sources greatly increases their utility in modeling and forecasting epidemics.
Langevin dynamics are widely used in sampling high-dimensional, non-Gaussian distributions whose densities are known up to a normalizing constant. In particular, there is strong interest in unadjusted Langevin algorithms (ULA), which directly discretize Langevin dynamics to estimate expectations over the target distribution. We study the use of transport maps that approximately normalize a target distribution as a way to precondition and accelerate the convergence of Langevin dynamics. We show that in continuous time, when a transport map is applied to Langevin dynamics, the result is a Riemannian manifold Langevin dynamics (RMLD) with metric defined by the transport map. We also show that applying a transport map to an irreversibly-perturbed ULA results in a geometry-informed irreversible perturbation (GiIrr) of the original dynamics. These connections suggest more systematic ways of learning metrics and perturbations, and also yield alternative discretizations of the RMLD described by the map, which we study. Under appropriate conditions, these discretized processes can be endowed with non-asymptotic bounds describing convergence to the target distribution in 2-Wasserstein distance. Illustrative numerical results complement our theoretical claims.
Flexible estimation of the mean outcome under a treatment regimen (i.e., value function) is the key step toward personalized medicine. We define our target parameter as a conditional value function given a set of baseline covariates which we refer to as a stratum based value function. We focus on semiparametric class of decision rules and propose a sieve based nonparametric covariate adjusted regimen-response curve estimator within that class. Our work contributes in several ways. First, we propose an inverse probability weighted nonparametrically efficient estimator of the smoothed regimen-response curve function. We show that asymptotic linearity is achieved when the nuisance functions are undersmoothed sufficiently. Asymptotic and finite sample criteria for undersmoothing are proposed. Second, using Gaussian process theory, we propose simultaneous confidence intervals for the smoothed regimen-response curve function. Third, we provide consistency and convergence rate for the optimizer of the regimen-response curve estimator; this enables us to estimate an optimal semiparametric rule. The latter is important as the optimizer corresponds with the optimal dynamic treatment regimen. Some finite-sample properties are explored with simulations.
Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.
The g-formula can be used to estimate the treatment effect while accounting for confounding bias in observational studies. With regard to time-to-event endpoints, possibly subject to competing risks, the construction of valid pointwise confidence intervals and time-simultaneous confidence bands for the causal risk difference is complicated, however. A convenient solution is to approximate the asymptotic distribution of the corresponding stochastic process by means of resampling approaches. In this paper, we consider three different resampling methods, namely the classical nonparametric bootstrap, the influence function equipped with a resampling approach as well as a martingale-based bootstrap version. We set up a simulation study to compare the accuracy of the different techniques, which reveals that the wild bootstrap should in general be preferred if the sample size is moderate and sufficient data on the event of interest have been accrued. For illustration, the three resampling methods are applied to data on the long-term survival in patients with early-stage Hodgkin's disease.
Citation metrics are the best tools for research assessments. However, current metrics may be misleading in research systems that pursue simultaneously different goals, such as the advance of science and incremental innovations, because their publications have different citation distributions. We estimate the contribution to the progress of knowledge by studying only a limited number of the most cited papers, which are dominated by publications pursuing this progress. To field-normalize the metrics, we substitute the number of citations by the rank position of papers from one country in the global list of papers. Using synthetic series of lognormally distributed numbers, we developed the Rk-index, which is calculated from the global ranks of the 10 highest numbers in each series, and demonstrate its equivalence to the number of papers in top percentiles, P_top0.1% and P_top0.01% . In real cases, the Rk-index is simple and easy to calculate, and evaluates the contribution to the progress of knowledge better than less stringent metrics. Although further research is needed, rank analysis of the most cited papers is a promising approach for research evaluation. It is also demonstrated that, for this purpose, domestic and collaborative papers should be studied independently.
This note presents a refined local approximation for the logarithm of the ratio between the negative multinomial probability mass function and a multivariate normal density, both having the same mean-covariance structure. This approximation, which is derived using Stirling's formula and a meticulous treatment of Taylor expansions, yields an upper bound on the Hellinger distance between the jittered negative multinomial distribution and the corresponding multivariate normal distribution. Upper bounds on the Le Cam distance between negative multinomial and multivariate normal experiments ensue.
In causal inference studies, interest often lies in understanding the mechanisms through which a treatment affects an outcome. One approach is principal stratification (PS), which introduces well-defined causal effects in the presence of confounded post-treatment variables, or mediators, and clearly defines the assumptions for identification and estimation of those effects. The goal of this paper is to extend the PS framework to studies with continuous treatments and continuous post-treatment variables, which introduces a number of unique challenges both in terms of defining causal effects and performing inference. This manuscript provides three key methodological contributions: 1) we introduce novel principal estimands for continuous treatments that provide valuable insights into different causal mechanisms, 2) we utilize Bayesian nonparametric approaches to model the joint distribution of the potential mediating variables based on both Gaussian processes and Dirichlet process mixtures to ensure our approach is robust to model misspecification, and 3) we provide theoretical and numerical justification for utilizing a model for the potential outcomes to identify the joint distribution of the potential mediating variables. Lastly, we apply our methodology to a novel study of the relationship between the economy and arrest rates, and how this is potentially mediated by police capacity.
We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model's generalization on three datasets: Colored MNIST, Princeton ModelNet40, and NVIDIA Dynamic Hand Gesture.