In recent years, research interest in personalised treatments has been growing. However, treatment effect heterogeneity and possibly time-varying treatment effects are still often overlooked in clinical studies. Statistical tools are needed for the identification of treatment response patterns, taking into account that treatment response is not constant over time. We aim to provide an innovative method to obtain dynamic treatment effect phenotypes on a time-to-event outcome, conditioned on a set of relevant effect modifiers. The proposed method does not require the assumption of proportional hazards for the treatment effect, which is rarely realistic. We propose a spline-based survival neural network, inspired by the Royston-Parmar survival model, to estimate time-varying conditional treatment effects. We then exploit the functional nature of the resulting estimates to apply a functional clustering of the treatment effect curves in order to identify different patterns of treatment effects. The application that motivated this work is the discontinuation of treatment with Mineralocorticoid receptor Antagonists (MRAs) in patients with heart failure, where there is no clear evidence as to which patients it is the safest choice to discontinue treatment and, conversely, when it leads to a higher risk of adverse events. The data come from an electronic health record database. A simulation study was performed to assess the performance of the spline-based neural network and the stability of the treatment response phenotyping procedure. We provide a novel method to inform individualized medical decisions by characterising subject-specific treatment responses over time.
This work is concerned with solving high-dimensional Fokker-Planck equations with the novel perspective that solving the PDE can be reduced to independent instances of density estimation tasks based on the trajectories sampled from its associated particle dynamics. With this approach, one sidesteps error accumulation occurring from integrating the PDE dynamics on a parameterized function class. This approach significantly simplifies deployment, as one is free of the challenges of implementing loss terms based on the differential equation. In particular, we introduce a novel class of high-dimensional functions called the functional hierarchical tensor (FHT). The FHT ansatz leverages a hierarchical low-rank structure, offering the advantage of linearly scalable runtime and memory complexity relative to the dimension count. We introduce a sketching-based technique that performs density estimation over particles simulated from the particle dynamics associated with the equation, thereby obtaining a representation of the Fokker-Planck solution in terms of our ansatz. We apply the proposed approach successfully to three challenging time-dependent Ginzburg-Landau models with hundreds of variables.
The impact of outliers and anomalies on model estimation and data processing is of paramount importance, as evidenced by the extensive body of research spanning various fields over several decades: thousands of research papers have been published on the subject. As a consequence, numerous reviews, surveys, and textbooks have sought to summarize the existing literature, encompassing a wide range of methods from both the statistical and data mining communities. While these endeavors to organize and summarize the research are invaluable, they face inherent challenges due to the pervasive nature of outliers and anomalies in all data-intensive applications, irrespective of the specific application field or scientific discipline. As a result, the resulting collection of papers remains voluminous and somewhat heterogeneous. To address the need for knowledge organization in this domain, this paper implements the first systematic meta-survey of general surveys and reviews on outlier and anomaly detection. Employing a classical systematic survey approach, the study collects nearly 500 papers using two specialized scientific search engines. From this comprehensive collection, a subset of 56 papers that claim to be general surveys on outlier detection is selected using a snowball search technique to enhance field coverage. A meticulous quality assessment phase further refines the selection to a subset of 25 high-quality general surveys. Using this curated collection, the paper investigates the evolution of the outlier detection field over a 20-year period, revealing emerging themes and methods. Furthermore, an analysis of the surveys sheds light on the survey writing practices adopted by scholars from different communities who have contributed to this field. Finally, the paper delves into several topics where consensus has emerged from the literature. These include taxonomies of outlier types, challenges posed by high-dimensional data, the importance of anomaly scores, the impact of learning conditions, difficulties in benchmarking, and the significance of neural networks. Non-consensual aspects are also discussed, particularly the distinction between local and global outliers and the challenges in organizing detection methods into meaningful taxonomies.
Background: The detection and extraction of causality from natural language sentences have shown great potential in various fields of application. The field of requirements engineering is eligible for multiple reasons: (1) requirements artifacts are primarily written in natural language, (2) causal sentences convey essential context about the subject of requirements, and (3) extracted and formalized causality relations are usable for a (semi-)automatic translation into further artifacts, such as test cases. Objective: We aim at understanding the value of interactive causality extraction based on syntactic criteria for the context of requirements engineering. Method: We developed a prototype of a system for automatic causality extraction and evaluate it by applying it to a set of publicly available requirements artifacts, determining whether the automatic extraction reduces the manual effort of requirements formalization. Result: During the evaluation we analyzed 4457 natural language sentences from 18 requirements documents, 558 of which were causal (12.52%). The best evaluation of a requirements document provided an automatic extraction of 48.57% cause-effect graphs on average, which demonstrates the feasibility of the approach. Limitation: The feasibility of the approach has been proven in theory but lacks exploration of being scaled up for practical use. Evaluating the applicability of the automatic causality extraction for a requirements engineer is left for future research. Conclusion: A syntactic approach for causality extraction is viable for the context of requirements engineering and can aid a pipeline towards an automatic generation of further artifacts from requirements artifacts.
Model reduction is the construction of simple yet predictive descriptions of the dynamics of many-body systems in terms of a few relevant variables. A prerequisite to model reduction is the identification of these relevant variables, a task for which no general method exists. Here, we develop a systematic approach based on the information bottleneck to identify the relevant variables, defined as those most predictive of the future. We elucidate analytically the relation between these relevant variables and the eigenfunctions of the transfer operator describing the dynamics. Further, we show that in the limit of high compression, the relevant variables are directly determined by the slowest-decaying eigenfunctions. Our information-based approach indicates when to optimally stop increasing the complexity of the reduced model. Further, it provides a firm foundation to construct interpretable deep learning tools that perform model reduction. We illustrate how these tools work on benchmark dynamical systems and deploy them on uncurated datasets, such as satellite movies of atmospheric flows downloaded directly from YouTube.
We consider the problem of chance constrained optimization where it is sought to optimize a function and satisfy constraints, both of which are affected by uncertainties. The real world declinations of this problem are particularly challenging because of their inherent computational cost. To tackle such problems, we propose a new Bayesian optimization method. It applies to the situation where the uncertainty comes from some of the inputs, so that it becomes possible to define an acquisition criterion in the joint controlled-uncontrolled input space. The main contribution of this work is an acquisition criterion that accounts for both the average improvement in objective function and the constraint reliability. The criterion is derived following the Stepwise Uncertainty Reduction logic and its maximization provides both optimal controlled and uncontrolled parameters. Analytical expressions are given to efficiently calculate the criterion. Numerical studies on test functions are presented. It is found through experimental comparisons with alternative sampling criteria that the adequation between the sampling criterion and the problem contributes to the efficiency of the overall optimization. As a side result, an expression for the variance of the improvement is given.
We adopt the integral definition of the fractional Laplace operator and study an optimal control problem on Lipschitz domains that involves a fractional elliptic partial differential equation (PDE) as state equation and a control variable that enters the state equation as a coefficient; pointwise constraints on the control variable are considered as well. We establish the existence of optimal solutions and analyze first and, necessary and sufficient, second order optimality conditions. Regularity estimates for optimal variables are also analyzed. We develop two finite element discretization strategies: a semidiscrete scheme in which the control variable is not discretized, and a fully discrete scheme in which the control variable is discretized with piecewise constant functions. For both schemes, we analyze the convergence properties of discretizations and derive error estimates.
Estimating the prevalence of a medical condition, or the proportion of the population in which it occurs, is a fundamental problem in healthcare and public health. Accurate estimates of the relative prevalence across groups -- capturing, for example, that a condition affects women more frequently than men -- facilitate effective and equitable health policy which prioritizes groups who are disproportionately affected by a condition. However, it is difficult to estimate relative prevalence when a medical condition is underreported. In this work, we provide a method for accurately estimating the relative prevalence of underreported medical conditions, building upon the positive unlabeled learning framework. We show that under the commonly made covariate shift assumption -- i.e., that the probability of having a disease conditional on symptoms remains constant across groups -- we can recover the relative prevalence, even without restrictive assumptions commonly made in positive unlabeled learning and even if it is impossible to recover the absolute prevalence. We conduct experiments on synthetic and real health data which demonstrate our method's ability to recover the relative prevalence more accurately than do baselines, and demonstrate the method's robustness to plausible violations of the covariate shift assumption. We conclude by illustrating the applicability of our method to case studies of intimate partner violence and hate speech.
There are various applications, where companies need to decide to which individuals they should best allocate treatment. To support such decisions, uplift models are applied to predict treatment effects on an individual level. Based on the predicted treatment effects, individuals can be ranked and treatment allocation can be prioritized according to this ranking. An implicit assumption, which has not been doubted in the previous uplift modeling literature, is that this treatment prioritization approach tends to bring individuals with high treatment effects to the top and individuals with low treatment effects to the bottom of the ranking. In our research, we show that heteroskedastictity in the training data can cause a bias of the uplift model ranking: individuals with the highest treatment effects can get accumulated in large numbers at the bottom of the ranking. We explain theoretically how heteroskedasticity can bias the ranking of uplift models and show this process in a simulation and on real-world data. We argue that this problem of ranking bias due to heteroskedasticity might occur in many real-world applications and requires modification of the treatment prioritization to achieve an efficient treatment allocation.
We investigate long-term cognitive effects of an intervention, where systolic blood pressure (sBP) is monitored at more optimal levels, in a large representative sample. A limitation with previous research on the potential risk reduction of such interventions is that they do not properly account for the reduction of mortality rates. Hence, one can only speculate whether the effect is a result from changes in cognition or changes in mortality. As such, we extend previous research by providing both an etiological and a prognostic effect estimate. To do this we propose a Bayesian semi-parametric estimation approach for an incremental intervention, using the extended G-formula. We also introduce a novel sparsity-inducing Dirichlet hyperprior for longitudinal data, demonstrate the usefulness of our approach in simulations, and compare the performance relative to other Bayesian decision tree ensemble approaches. In our study, there were no significant prognostic- or etiological effects across all ages, indicating that sBP interventions likely do not have a strong effect on memory neither at the population level nor at the individual level.
We discuss avoidance of sure loss and coherence results for semicopulas and standardized functions, i.e., for grounded, 1-increasing functions with value $1$ at $(1,1,\ldots, 1)$. We characterize the existence of a $k$-increasing $n$-variate function $C$ fulfilling $A\leq C\leq B$ for standardized $n$-variate functions $A,B$ and discuss the method for constructing this function. Our proofs also include procedures for extending functions on some countably infinite mesh to functions on the unit box. We provide a characterization when $A$ respectively $B$ coincides with the pointwise infimum respectively supremum of the set of all $k$-increasing $n$-variate functions $C$ fulfilling $A\leq C\leq B$.