In epidemiology research with cancer registry data, it is often of primary interest to make inference on cancer death, not overall survival. Since cause of death is not easy to collect or is not necessarily reliable in cancer registries, some special methodologies have been introduced and widely used by using the concepts of the relative survival ratio and the net survival. In making inference of those measures, external life tables of the general population are utilized to adjust the impact of non-cancer death on overall survival. The validity of this adjustment relies on the assumption that mortality in the external life table approximates non-cancer mortality of cancer patients. However, the population used to calculate a life table may include cancer death and cancer patients. Sensitivity analysis proposed by Talb\"{a}ck and Dickman to address it requires additional information which is often not easily available. We propose a method to make inference on the net survival accounting for potential presence of cancer patients and cancer death in the life table for the general population. The idea of adjustment is to consider correspondence of cancer mortality in the life table and that in the cancer registry. We realize a novel method to adjust cancer mortality in the cancer registry without any additional information to the standard analyses of cancer registries. Our simulation study revealed that the proposed method successfully removed the bias. We illustrate the proposed method with the cancer registry data in England.
Let us consider the deconvolution problem, that is, to recover a latent source $x(\cdot)$ from the observations $\mathbf{y} = [y_1,\ldots,y_N]$ of a convolution process $y = x\star h + \eta$, where $\eta$ is an additive noise, the observations in $\mathbf{y}$ might have missing parts with respect to $y$, and the filter $h$ could be unknown. We propose a novel strategy to address this task when $x$ is a continuous-time signal: we adopt a Gaussian process (GP) prior on the source $x$, which allows for closed-form Bayesian nonparametric deconvolution. We first analyse the direct model to establish the conditions under which the model is well defined. Then, we turn to the inverse problem, where we study i) some necessary conditions under which Bayesian deconvolution is feasible, and ii) to which extent the filter $h$ can be learnt from data or approximated for the blind deconvolution case. The proposed approach, termed Gaussian process deconvolution (GPDC) is compared to other deconvolution methods conceptually, via illustrative examples, and using real-world datasets.
We consider the problem of information aggregation in federated decision making, where a group of agents collaborate to infer the underlying state of nature without sharing their private data with the central processor or each other. We analyze the non-Bayesian social learning strategy in which agents incorporate their individual observations into their opinions (i.e., soft-decisions) with Bayes rule, and the central processor aggregates these opinions by arithmetic or geometric averaging. Building on our previous work, we establish that both pooling strategies result in asymptotic normality characterization of the system, which, for instance, can be utilized to derive approximate expressions for the error probability. We verify the theoretical findings with simulations and compare both strategies.
Information-theoretic quantities reveal dependencies among variables in the structure of joint, marginal, and conditional entropies, but leave some fundamentally different systems indistinguishable. Furthermore, there is no consensus on how to construct and interpret a higher-order generalisation of mutual information (MI). In this manuscript, we show that a recently proposed model-free definition of higher-order interactions amongst binary variables (MFIs), like mutual information, is a M\"obius inversion on a Boolean algebra, but of surprisal instead of entropy. This gives an information-theoretic interpretation to the MFIs, and by extension to Ising interactions. We study the dual objects to MI and MFIs on the order-reversed lattice, and find that dual MI is related to the previously studied differential mutual information, while dual interactions (outeractions) are interactions with respect to a different background state. Unlike mutual information, in- and outeractions uniquely identify all six 2-input logic gates, the dy- and triadic distributions, and different causal dynamics that are identical in terms of their Shannon-information content.
We study a data-driven approach to the bee identification problem for DNA strands. The bee-identification problem, introduced by Tandon et al. (2019), requires one to identify $M$ bees, each tagged by a unique barcode, via a set of $M$ noisy measurements. Later, Chrisnata et al. (2022) extended the model to case where one observes $N$ noisy measurements of each bee, and applied the model to address the unordered nature of DNA storage systems. In such systems, a unique address is typically prepended to each DNA data block to form a DNA strand, but the address may possibly be corrupted. While clustering is usually used to identify the address of a DNA strand, this requires $\mathcal{M}^2$ data comparisons (when $\mathcal{M}$ is the number of reads). In contrast, the approach of Chrisnata et al. (2022) avoids data comparisons completely. In this work, we study an intermediate, data-driven approach to this identification task. For the binary erasure channel, we first show that we can almost surely correctly identify all DNA strands under certain mild assumptions. Then we propose a data-driven pruning procedure and demonstrate that on average the procedure uses only a fraction of $\mathcal{M}^2$ data comparisons. Specifically, for $\mathcal{M}= 2^n$ and erasure probability $p$, the expected number of data comparisons performed by the procedure is $\kappa\mathcal{M}^2$, where $\left(\frac{1+2p-p^2}{2}\right)^n \leq \kappa \leq \left(\frac{1+p}{2}\right)^n $.
Turing's 1950 paper introduced the famed "imitation game", a test originally proposed to capture the notion of machine intelligence. Over the years, the Turing test spawned a large amount of interest, which resulted in several variants, as well as heated discussions and controversy. Here we sidestep the question of whether a particular machine can be labeled intelligent, or can be said to match human capabilities in a given context. Instead, but inspired by Turing, we draw attention to the seemingly simpler challenge of determining whether one is interacting with a human or with a machine, in the context of everyday life. We are interested in reflecting upon the importance of this Human-or-Machine question and the use one may make of a reliable answer thereto. Whereas Turing's original test is widely considered to be more of a thought experiment, the Human-or-Machine question as discussed here has obvious practical significance. And while the jury is still not in regarding the possibility of machines that can mimic human behavior with high fidelity in everyday contexts, we argue that near-term exploration of the issues raised here can contribute to development methods for computerized systems, and may also improve our understanding of human behavior in general.
Gun violence is a major problem in contemporary American society, with tens of thousands injured each year. However, relatively little is known about the effects on family members and how effects vary across subpopulations. To study these questions and, more generally, to address a gap in the causal inference literature, we present a framework for the study of effect modification or heterogeneous treatment effects in difference-in-differences designs. We implement a new matching technique, which combines profile matching and risk set matching, to (i) preserve the time alignment of covariates, exposure, and outcomes, avoiding pitfalls of other common approaches for difference-in-differences, and (ii) explicitly control biases due to imbalances in observed covariates in subgroups discovered from the data. Our case study shows significant and persistent effects of nonfatal firearm injuries on several health outcomes for those injured and on the mental health of their family members. Sensitivity analyses reveal that these results are moderately robust to unmeasured confounding bias. Finally, while the effects for those injured are modified largely by the severity of the injury and its documented intent, for families, effects are strongest for those whose relative's injury is documented as resulting from an assault, self-harm, or law enforcement intervention.
Localizing root causes for multi-dimensional data is critical to ensure online service systems' reliability. When a fault occurs, only the measure values within specific attribute combinations are abnormal. Such attribute combinations are substantial clues to the underlying root causes and thus are called root causes of multidimensional data. This paper proposes a generic and robust root cause localization approach for multi-dimensional data, PSqueeze. We propose a generic property of root cause for multi-dimensional data, generalized ripple effect (GRE). Based on it, we propose a novel probabilistic cluster method and a robust heuristic search method. Moreover, we identify the importance of determining external root causes and propose an effective method for the first time in literature. Our experiments on two real-world datasets with 5400 faults show that the F1-score of PSqueeze outperforms baselines by 32.89%, while the localization time is around 10 seconds across all cases. The F1-score in determining external root causes of PSqueeze achieves 0.90. Furthermore, case studies in several production systems demonstrate that PSqueeze is helpful to fault diagnosis in the real world.
The aim of this paper is to study the shape optimization method for solving the Bernoulli free boundary problem, a well-known ill-posed problem that seeks the unknown free boundary through Cauchy data. Different formulations have been proposed in the literature that differ in the choice of the objective functional. Specifically, it was shown respectively in [14] and [16] that tracking Neumann data is well-posed but tracking Dirichlet data is not. In this paper we propose a new well-posed objective functional that tracks Dirichlet data at the free boundary. By calculating the Euler derivative and the shape Hessian of the objective functional we show that the new formulation is well-posed, i.e., the shape Hessian is coercive at the minimizers. The coercivity of the shape Hessian may ensure the existence of optimal solutions for the nonlinear Ritz-Galerkin approximation method and its convergence, thus is crucial for the formulation. As a summary, we conclude that tracking Dirichlet or Neumann data in its energy norm is not sufficient, but tracking it in a half an order higher norm will be well-posed. To support our theoretical results we carry out extensive numerical experiments.
Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks. When such models are deployed in real world environments, they inevitably interface with other entities and agents. For example, language models are often used to interact with human beings through dialogue, and visual perception models are used to autonomously navigate neighborhood streets. In response to these developments, new paradigms are emerging for training foundation models to interact with other agents and perform long-term reasoning. These paradigms leverage the existence of ever-larger datasets curated for multimodal, multitask, and generalist interaction. Research at the intersection of foundation models and decision making holds tremendous promise for creating powerful new systems that can interact effectively across a diverse range of applications such as dialogue, autonomous driving, healthcare, education, and robotics. In this manuscript, we examine the scope of foundation models for decision making, and provide conceptual tools and technical background for understanding the problem space and exploring new research directions. We review recent approaches that ground foundation models in practical decision making applications through a variety of methods such as prompting, conditional generative modeling, planning, optimal control, and reinforcement learning, and discuss common challenges and open problems in the field.
Human-in-the-loop aims to train an accurate prediction model with minimum cost by integrating human knowledge and experience. Humans can provide training data for machine learning applications and directly accomplish some tasks that are hard for computers in the pipeline with the help of machine-based approaches. In this paper, we survey existing works on human-in-the-loop from a data perspective and classify them into three categories with a progressive relationship: (1) the work of improving model performance from data processing, (2) the work of improving model performance through interventional model training, and (3) the design of the system independent human-in-the-loop. Using the above categorization, we summarize major approaches in the field, along with their technical strengths/ weaknesses, we have simple classification and discussion in natural language processing, computer vision, and others. Besides, we provide some open challenges and opportunities. This survey intends to provide a high-level summarization for human-in-the-loop and motivates interested readers to consider approaches for designing effective human-in-the-loop solutions.