International Classification of Disease (ICD) codes are widely used for encoding diagnoses in electronic health records (EHR). Automated methods have been developed over the years for predicting biomedical responses using EHR that borrow information among diagnostically similar patients. Relatively less attention has been paid to developing patient similarity measures that model the structure of ICD codes and the presence of multiple chronic conditions, where a chronic condition is defined as a set of ICD codes. Motivated by this problem, we first develop a type of string kernel function for measuring similarity between a pair of subsets of ICD codes, which uses the definition of chronic conditions. Second, we extend this similarity measure to define a family of covariance functions on subsets of ICD codes. Using this family, we develop Gaussian process (GP) priors for Bayesian nonparametric regression and classification using diagnoses and other demographic information as covariates. Markov chain Monte Carlo (MCMC) algorithms are used for posterior inference and predictions. The proposed methods are free of any tuning parameters and are well-suited for automated prediction of continuous and categorical biomedical responses that depend on chronic conditions. We evaluate the practical performance of our method on EHR data collected from 1660 patients at the University of Iowa Hospitals and Clinics (UIHC) with six different primary cancer sites. Our method has better sensitivity and specificity than its competitors in classifying different primary cancer sites and estimates the marginal associations between chronic conditions and primary cancer sites.
A meaningful understanding of clinical protocols and patient pathways helps improve healthcare outcomes. Electronic health records (EHR) reflect real-world treatment behaviours that are used to enhance healthcare management but present challenges; protocols and pathways are often loosely defined and with elements frequently not recorded in EHRs, complicating the enhancement. To solve this challenge, healthcare objectives associated with healthcare management activities can be indirectly observed in EHRs as latent topics. Topic models, such as Latent Dirichlet Allocation (LDA), are used to identify latent patterns in EHR data. However, they do not examine the ordered nature of EHR sequences, nor do they appraise individual events in isolation. Our novel approach, the Categorical Sequence Encoder (CaSE) addresses these shortcomings. The sequential nature of EHRs is captured by CaSE's event-level representations, revealing latent healthcare objectives. In synthetic EHR sequences, CaSE outperforms LDA by up to 37% at identifying healthcare objectives. In the real-world MIMIC-III dataset, CaSE identifies meaningful representations that could critically enhance protocol and pathway development.
Information geometry is concerned with the application of differential geometry concepts in the study of the parametric spaces of statistical models. When the random variables are independent and identically distributed, the underlying parametric space exhibit constant curvature, which makes the geometry hyperbolic (negative) or spherical (positive). In this paper, we derive closed-form expressions for the components of the first and second fundamental forms regarding pairwise isotropic Gaussian-Markov random field manifolds, allowing the computation of the Gaussian, mean and principal curvatures. Computational simulations using Markov Chain Monte Carlo dynamics indicate that a change in the sign of the Gaussian curvature is related to the emergence of phase transitions in the field. Moreover, the curvatures are highly asymmetrical for positive and negative displacements in the inverse temperature parameter, suggesting the existence of irreversible geometric properties in the parametric space along the dynamics. Furthermore, these asymmetric changes in the curvature of the space induces an intrinsic notion of time in the evolution of the random field.
Research in NLP is often supported by experimental results, and improved reporting of such results can lead to better understanding and more reproducible science. In this paper we analyze three statistical estimators for expected validation performance, a tool used for reporting performance (e.g., accuracy) as a function of computational budget (e.g., number of hyperparameter tuning experiments). Where previous work analyzing such estimators focused on the bias, we also examine the variance and mean squared error (MSE). In both synthetic and realistic scenarios, we evaluate three estimators and find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias; the estimator with the smallest MSE strikes a balance between bias and variance, displaying a classic bias-variance tradeoff. We use expected validation performance to compare between different models, and analyze how frequently each estimator leads to drawing incorrect conclusions about which of two models performs best. We find that the two biased estimators lead to the fewest incorrect conclusions, which hints at the importance of minimizing variance and MSE.
We study the problem of maximizing the probability that (i) an electric component or financial institution $X$ does not default before another component or institution $Y$ and (ii) that $X$ and $Y$ default jointly within the class of all random variables $X,Y$ with given univariate continuous distribution functions $F$ and $G$, respectively, and show that the maximization problems correspond to finding copulas maximizing the mass of the endograph $\Gamma^\leq(T)$ and the graph $\Gamma(T)$ of $T=G \circ F^-$, respectively. After providing simple, copula-based proofs for the existence of copulas attaining the two maxima $\overline{m}_T$ and $\overline{w}_T$ we generalize the obtained results to the case of general (not necessarily monotonic) transformations $T:[0,1] \rightarrow [0,1]$ and derive simple and easily calculable formulas for $\overline{m}_T$ and $\overline{w}_T$ involving the distribution function $F_T$ of $T$ (interpreted as random variable on $[0,1]$). The latter are then used to charac\-terize all non-decreasing transformations $T:[0,1] \rightarrow [0,1]$ for which $\overline{m}_T$ and $\overline{w}_T$ coincide. A strongly consistent estimator for the maximum probability that $X$ does not default before $Y$ is derived and proven to be asymptotically normal under very mild regularity conditions. Several examples and graphics illustrate the main results and falsify some seemingly natural conjectures.
We study the rank of the instantaneous or spot covariance matrix $\Sigma_X(t)$ of a multidimensional continuous semi-martingale $X(t)$. Given high-frequency observations $X(i/n)$, $i=0,\ldots,n$, we test the null hypothesis $rank(\Sigma_X(t))\le r$ for all $t$ against local alternatives where the average $(r+1)$st eigenvalue is larger than some signal detection rate $v_n$. A major problem is that the inherent averaging in local covariance statistics produces a bias that distorts the rank statistics. We show that the bias depends on the regularity and a spectral gap of $\Sigma_X(t)$. We establish explicit matrix perturbation and concentration results that provide non-asymptotic uniform critical values and optimal signal detection rates $v_n$. This leads to a rank estimation method via sequential testing. For a class of stochastic volatility models, we determine data-driven critical values via normed p-variations of estimated local covariance matrices. The methods are illustrated by simulations and an application to high-frequency data of U.S. government bonds.
Patients suffering from multiple diseases (multi-morbid patients) often have complex clinical pathways. They are diagnosed and treated by different specialties and undergo other clinical actions related to various diagnoses. Coordination of care for these patients is often challenging, and it would be of great benefit to get better insight into how the clinical pathways develop in reality. Discovering these pathways using traditional process mining techniques and standard event logs may be difficult because the patient is involved in several highly independent clinical processes. Our objective is to explore the potential of analyzing these pathways using an event log representation reflecting the independent clinical processes. Our main research question is: How can we identify valuable insights by using a multi-entity event data representation for clinical pathways of multi-morbid patients? Our method was built on the idea to represent multiple entities in event logs as event graphs. The MIMIC-III data-set was used to evaluate the feasibility of this approach. Several clinical entities were identified and then mapped into an event graph. Finally, multi-entity directly follows graphs were discovered by querying the event graph visualizing them. Our result shows that paths involving multiple entities include traditional process mining concepts not for one clinical process but all involved processes. In addition, the relationship between activities of different clinical processes, which was not recognizable in traditional models, is visible in the event graph representation.
The paper proposes a supervised machine learning algorithm to uncover treatment effect heterogeneity in classical regression discontinuity (RD) designs. Extending Athey and Imbens (2016), I develop a criterion for building an honest "regression discontinuity tree", where each leaf of the tree contains the RD estimate of a treatment (assigned by a common cutoff rule) conditional on the values of some pre-treatment covariates. It is a priori unknown which covariates are relevant for capturing treatment effect heterogeneity, and it is the task of the algorithm to discover them, without invalidating inference. I study the performance of the method through Monte Carlo simulations and apply it to the data set compiled by Pop-Eleches and Urquiola (2013) to uncover various sources of heterogeneity in the impact of attending a better secondary school in Romania.
As data-driven methods are deployed in real-world settings, the processes that generate the observed data will often react to the decisions of the learner. For example, a data source may have some incentive for the algorithm to provide a particular label (e.g. approve a bank loan), and manipulate their features accordingly. Work in strategic classification and decision-dependent distributions seeks to characterize the closed-loop behavior of deploying learning algorithms by explicitly considering the effect of the classifier on the underlying data distribution. More recently, works in performative prediction seek to classify the closed-loop behavior by considering general properties of the mapping from classifier to data distribution, rather than an explicit form. Building on this notion, we analyze repeated risk minimization as the perturbed trajectories of the gradient flows of performative risk minimization. We consider the case where there may be multiple local minimizers of performative risk, motivated by situations where the initial conditions may have significant impact on the long-term behavior of the system. We provide sufficient conditions to characterize the region of attraction for the various equilibria in this settings. Additionally, we introduce the notion of performative alignment, which provides a geometric condition on the convergence of repeated risk minimization to performative risk minimizers.
Learning the relationships between various entities from time-series data is essential in many applications. Gaussian graphical models have been studied to infer these relationships. However, existing algorithms process data in a batch at a central location, limiting their applications in scenarios where data is gathered by different agents. In this paper, we propose a distributed sparse inverse covariance algorithm to learn the network structure (i.e., dependencies among observed entities) in real-time from data collected by distributed agents. Our approach is built on an online graphical alternating minimization algorithm, augmented with a consensus term that allows agents to learn the desired structure cooperatively. We allow the system designer to select the number of communication rounds and optimization steps per data point. We characterize the rate of convergence of our algorithm and provide simulations on synthetic datasets.
In many real-world applications, we want to exploit multiple source datasets of similar tasks to learn a model for a different but related target dataset -- e.g., recognizing characters of a new font using a set of different fonts. While most recent research has considered ad-hoc combination rules to address this problem, we extend previous work on domain discrepancy minimization to develop a finite-sample generalization bound, and accordingly propose a theoretically justified optimization procedure. The algorithm we develop, Domain AggRegation Network (DARN), is able to effectively adjust the weight of each source domain during training to ensure relevant domains are given more importance for adaptation. We evaluate the proposed method on real-world sentiment analysis and digit recognition datasets and show that DARN can significantly outperform the state-of-the-art alternatives.