Irregular visit times in longitudinal studies can jeopardise marginal regression analyses of longitudinal data by introducing selection bias when the visit and outcome processes are associated. Inverse intensity weighting is a useful approach to addressing such selection bias when the visiting at random assumption is satisfied, i.e., visiting at time $t$ is independent of the longitudinal outcome at $t$, given the observed covariate and outcome histories up to $t$. However, the visiting at random assumption is unverifiable from the observed data, and informative visit times often arise in practice, e.g., when patients' visits to clinics are driven by ongoing disease activities. Therefore, it is necessary to perform sensitivity analyses for inverse intensity weighted estimators (IIWEs) when the visit times are likely informative. However, research on such sensitivity analyses is limited in the literature. In this paper, we propose a new sensitivity analysis approach to accommodating informative visit times in marginal regression analysis of irregular longitudinal data. Our sensitivity analysis is anchored at the visiting at random assumption and can be easily applied to existing IIWEs using standard software such as the coxph function of the R package Survival. Moreover, we develop novel balancing weights estimators of regression coefficients by exactly balancing the covariate distributions that drive the visit and outcome processes to remove the selection bias after weighting. Simulations show that, under both correct and incorrect model specifications, our balancing weights estimators perform better than the existing IIWEs using weights estimated by maximum partial likelihood. We applied our methods to data from a clinic-based cohort study of psoriatic arthritis and provide an R Markdown tutorial to demonstrate their implementation.
In the metric distortion problem there is a set of candidates and a set of voters, all residing in the same metric space. The objective is to choose a candidate with minimum social cost, defined as the total distance of the chosen candidate from all voters. The challenge is that the algorithm receives only ordinal input from each voter, in the form of a ranked list of candidates in non-decreasing order of their distances from her, whereas the objective function is cardinal. The distortion of an algorithm is its worst-case approximation factor with respect to the optimal social cost. A series of papers culminated in a 3-distortion algorithm, which is tight with respect to all deterministic algorithms. Aiming to overcome the limitations of worst-case analysis, we revisit the metric distortion problem through the learning-augmented framework, where the algorithm is provided with some prediction regarding the optimal candidate. The quality of this prediction is unknown, and the goal is to evaluate the performance of the algorithm under a accurate prediction (known as consistency), while simultaneously providing worst-case guarantees even for arbitrarily inaccurate predictions (known as robustness). For our main result, we characterize the robustness-consistency Pareto frontier for the metric distortion problem. We first identify an inevitable trade-off between robustness and consistency. We then devise a family of learning-augmented algorithms that achieves any desired robustness-consistency pair on this Pareto frontier. Furthermore, we provide a more refined analysis of the distortion bounds as a function of the prediction error (with consistency and robustness being two extremes). Finally, we also prove distortion bounds that integrate the notion of $\alpha$-decisiveness, which quantifies the extent to which a voter prefers her favorite candidate relative to the rest.
In uncertainty quantification, variance-based global sensitivity analysis quantitatively determines the effect of each input random variable on the output by partitioning the total output variance into contributions from each input. However, computing conditional expectations can be prohibitively costly when working with expensive-to-evaluate models. Surrogate models can accelerate this, yet their accuracy depends on the quality and quantity of training data, which is expensive to generate (experimentally or computationally) for complex engineering systems. Thus, methods that work with limited data are desirable. We propose a diffeomorphic modulation under observable response preserving homotopy (D-MORPH) regression to train a polynomial dimensional decomposition surrogate of the output that minimizes the number of training data. The new method first computes a sparse Lasso solution and uses it to define the cost function. A subsequent D-MORPH regression minimizes the difference between the D-MORPH and Lasso solution. The resulting D-MORPH surrogate is more robust to input variations and more accurate with limited training data. We illustrate the accuracy and computational efficiency of the new surrogate for global sensitivity analysis using mathematical functions and an expensive-to-simulate model of char combustion. The new method is highly efficient, requiring only 15% of the training data compared to conventional regression.
We describe a simple deterministic near-linear time approximation scheme for uncapacitated minimum cost flow in undirected graphs with real edge weights, a problem also known as transshipment. Specifically, our algorithm takes as input a (connected) undirected graph $G = (V, E)$, vertex demands $b \in \mathbb{R}^V$ such that $\sum_{v \in V} b(v) = 0$, positive edge costs $c \in \mathbb{R}_{>0}^E$, and a parameter $\varepsilon > 0$. In $O(\varepsilon^{-2} m \log^{O(1)} n)$ time, it returns a flow $f$ such that the net flow out of each vertex is equal to the vertex's demand and the cost of the flow is within a $(1 + \varepsilon)$ factor of optimal. Our algorithm is combinatorial and has no running time dependency on the demands or edge costs. With the exception of a recent result presented at STOC 2022 for polynomially bounded edge weights, all almost- and near-linear time approximation schemes for transshipment relied on randomization in two main ways: 1) to embed the problem instance into low-dimensional space and 2) to randomly pick target locations to send flow so nearby opposing demands can be satisfied. Our algorithm instead deterministically approximates the cost of routing decisions that would be made if the input were subject to a random tree embedding. To avoid computing the $\Omega(n^2)$ vertex-vertex distances that an approximation of this kind suggests, we also limit the available routing decisions using distances explicitly stored in the well-known Thorup-Zwick distance oracle.
The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.
Analysis of high-dimensional data, where the number of covariates is larger than the sample size, is a topic of current interest. In such settings, an important goal is to estimate the signal level $\tau^2$ and noise level $\sigma^2$, i.e., to quantify how much variation in the response variable can be explained by the covariates, versus how much of the variation is left unexplained. This thesis considers the estimation of these quantities in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no responses $Y$. Our main research question is: how can one use the unlabeled data to better estimate $\tau^2$ and $\sigma^2$? We consider two frameworks: a linear regression model and a linear projection model in which linearity is not assumed. In the first framework, while linear regression is used, no sparsity assumptions on the coefficients are made. In the second framework, the linearity assumption is also relaxed and we aim to estimate the signal and noise levels defined by the linear projection. We first propose a naive estimator which is unbiased and consistent, under some assumptions, in both frameworks. We then show how the naive estimator can be improved by using zero-estimators, where a zero-estimator is a statistic arising from the unlabeled data, whose expected value is zero. In the first framework, we calculate the optimal zero-estimator improvement and discuss ways to approximate the optimal improvement. In the second framework, such optimality does no longer hold and we suggest two zero-estimators that improve the naive estimator although not necessarily optimally. Furthermore, we show that our approach reduces the variance for general initial estimators and we present an algorithm that potentially improves any initial estimator. Lastly, we consider four datasets and study the performance of our suggested methods.
For predictive modeling relying on Bayesian inversion, fully independent, or ``mean-field'', Gaussian distributions are often used as approximate probability density functions in variational inference since the number of variational parameters is twice the number of unknown model parameters. The resulting diagonal covariance structure coupled with unimodal behavior can be too restrictive when dealing with highly non-Gaussian behavior, including multimodality. High-fidelity surrogate posteriors in the form of Gaussian mixtures can capture any distribution to an arbitrary degree of accuracy while maintaining some analytical tractability. Variational inference with Gaussian mixtures with full-covariance structures suffers from a quadratic growth in variational parameters with the number of model parameters. Coupled with the existence of multiple local minima due to nonconvex trends in the loss functions often associated with variational inference, these challenges motivate the need for robust initialization procedures to improve the performance and scalability of variational inference with mixture models. In this work, we propose a method for constructing an initial Gaussian mixture model approximation that can be used to warm-start the iterative solvers for variational inference. The procedure begins with an optimization stage in model parameter space in which local gradient-based optimization, globalized through multistart, is used to determine a set of local maxima, which we take to approximate the mixture component centers. Around each mode, a local Gaussian approximation is constructed via the Laplace method. Finally, the mixture weights are determined through constrained least squares regression. Robustness and scalability are demonstrated using synthetic tests. The methodology is applied to an inversion problem in structural dynamics involving unknown viscous damping coefficients.
Despite the progress in medical data collection the actual burden of SARS-CoV-2 remains unknown due to under-ascertainment of cases. This was apparent in the acute phase of the pandemic and the use of reported deaths has been pointed out as a more reliable source of information, likely less prone to under-reporting. Since daily deaths occur from past infections weighted by their probability of death, one may infer the total number of infections accounting for their age distribution, using the data on reported deaths. We adopt this framework and assume that the dynamics generating the total number of infections can be described by a continuous time transmission model expressed through a system of non-linear ordinary differential equations where the transmission rate is modelled as a diffusion process allowing to reveal both the effect of control strategies and the changes in individuals behavior. We develop this flexible Bayesian tool in Stan and study 3 pairs of European countries, estimating the time-varying reproduction number($R_t$) as well as the true cumulative number of infected individuals. As we estimate the true number of infections we offer a more accurate estimate of $R_t$. We also provide an estimate of the daily reporting ratio and discuss the effects of changes in mobility and testing on the inferred quantities.
Modeling longitudinal and survival data jointly offers many advantages such as addressing measurement error and missing data in the longitudinal processes, understanding and quantifying the association between the longitudinal markers and the survival events and predicting the risk of events based on the longitudinal markers. A joint model involves multiple submodels (one for each longitudinal/survival outcome) usually linked together through correlated or shared random effects. Their estimation is computationally expensive (particularly due to a multidimensional integration of the likelihood over the random effects distribution) so that inference methods become rapidly intractable, and restricts applications of joint models to a small number of longitudinal markers and/or random effects. We introduce a Bayesian approximation based on the Integrated Nested Laplace Approximation algorithm implemented in the R package R-INLA to alleviate the computational burden and allow the estimation of multivariate joint models with fewer restrictions. Our simulation studies show that R-INLA substantially reduces the computation time and the variability of the parameter estimates compared to alternative estimation strategies. We further apply the methodology to analyze 5 longitudinal markers (3 continuous, 1 count, 1 binary, and 16 random effects) and competing risks of death and transplantation in a clinical trial on primary biliary cholangitis. R-INLA provides a fast and reliable inference technique for applying joint models to the complex multivariate data encountered in health research.
The COVID-19 pandemic has prompted countries around the world to introduce smartphone apps to support disease control efforts. Their purposes range from digital contact tracing to quarantine enforcement to vaccination passports, and their effectiveness often depends on widespread adoption. While previous work has identified factors that promote or hinder adoption, it has typically examined data collected at a single point in time or focused exclusively on digital contact tracing apps. In this work, we conduct the first representative study that examines changes in people's attitudes towards COVID-19-related smartphone apps for five different purposes over the first 1.5 years of the pandemic. In three survey rounds conducted between Summer 2020 and Summer 2021 in the United States and Germany, with approximately 1,000 participants per round and country, we investigate people's willingness to use such apps, their perceived utility, and people's attitudes towards them in different stages of the pandemic. Our results indicate that privacy is a consistent concern for participants, even in a public health crisis, and the collection of identity-related data significantly decreases acceptance of COVID-19 apps. Trust in authorities is essential to increase confidence in government-backed apps and foster citizens' willingness to contribute to crisis management. There is a need for continuous communication with app users to emphasize the benefits of health crisis apps both for individuals and society, thus counteracting decreasing willingness to use them and perceived usefulness as the pandemic evolves.
We consider error-correction coding schemes for adversarial wiretap channels (AWTCs) in which the channel can a) read a fraction of the codeword bits up to a bound $r$ and b) flip a fraction of the bits up to a bound $p$. The channel can freely choose the locations of the bit reads and bit flips via a process with unbounded computational power. Codes for the AWTC are of broad interest in the area of information security, as they can provide data resiliency in settings where an attacker has limited access to a storage or transmission medium. We investigate a family of non-linear codes known as pseudolinear codes, which were first proposed by Guruswami and Indyk (FOCS 2001) for constructing list-decodable codes independent of the AWTC setting. Unlike general non-linear codes, pseudolinear codes admit efficient encoders and have succinct representations. We focus on unique decoding and show that random pseudolinear codes can achieve rates up to the binary symmetric channel (BSC) capacity $1-H_2(p)$ for any $p,r$ in the less noisy region: $p<1/2$ and $r<1-H_2(p)$ where $H_2(\cdot)$ is the binary entropy function. Thus, pseudolinear codes are the first known optimal-rate binary code family for the less noisy AWTC that admit efficient encoders. The above result can be viewed as a derandomization result of random general codes in the AWTC setting, which in turn opens new avenues for applying derandomization techniques to randomized constructions of AWTC codes. Our proof applies a novel concentration inequality for sums of random variables with limited independence which may be of interest as an analysis tool more generally.