Localizing root causes for multi-dimensional data is critical to ensure online service systems' reliability. When a fault occurs, only the measure values within specific attribute combinations are abnormal. Such attribute combinations are substantial clues to the underlying root causes and thus are called root causes of multidimensional data. This paper proposes a generic and robust root cause localization approach for multi-dimensional data, PSqueeze. We propose a generic property of root cause for multi-dimensional data, generalized ripple effect (GRE). Based on it, we propose a novel probabilistic cluster method and a robust heuristic search method. Moreover, we identify the importance of determining external root causes and propose an effective method for the first time in literature. Our experiments on two real-world datasets with 5400 faults show that the F1-score of PSqueeze outperforms baselines by 32.89%, while the localization time is around 10 seconds across all cases. The F1-score in determining external root causes of PSqueeze achieves 0.90. Furthermore, case studies in several production systems demonstrate that PSqueeze is helpful to fault diagnosis in the real world.
Multi-agent dynamical systems refer to scenarios where multiple units interact with each other and evolve collectively over time. To make informed decisions in multi-agent dynamical systems, such as determining the optimal vaccine distribution plan, it is essential for decision-makers to estimate the continuous-time counterfactual outcomes. However, existing studies of causal inference over time rely on the assumption that units are mutually independent, which is not valid for multi-agent dynamical systems. In this paper, we aim to bridge this gap and study how to estimate counterfactual outcomes in multi-agent dynamical systems. Causal inference in a multi-agent dynamical system has unique challenges: 1) Confounders are time-varying and are present in both individual unit covariates and those of other units; 2) Units are affected by not only their own but also others' treatments; 3) The treatments are naturally dynamic, such as receiving vaccines and boosters in a seasonal manner. We model a multi-agent dynamical system as a graph and propose CounterFactual GraphODE (CF-GODE), a causal model that estimates continuous-time counterfactual outcomes in the presence of inter-dependencies between units. To facilitate continuous-time estimation, we propose Treatment-Induced GraphODE, a novel ordinary differential equation based on GNN, which incorporates dynamical treatments as additional inputs to predict potential outcomes over time. To remove confounding bias, we propose two domain adversarial learning based objectives that learn balanced continuous representation trajectories, which are not predictive of treatments and interference. We further provide theoretical justification to prove their effectiveness. Experiments on two semi-synthetic datasets confirm that CF-GODE outperforms baselines on counterfactual estimation. We also provide extensive analyses to understand how our model works.
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at //github.com/ncbi/BioREx.
Consider a two-person zero-sum search game between a Hider and a Searcher. The Hider chooses to hide in one of $n$ discrete locations (or "boxes") and the Searcher chooses a search sequence specifying which order to look in these boxes until finding the Hider. A search at box $i$ takes $t_i$ time units and finds the Hider - if hidden there - independently with probability $q_i$, for $i=1,\ldots,n$. The Searcher wants to minimize the expected total time needed to find the Hider, while the Hider wants to maximize it. It is shown in the literature that the Searcher has an optimal search strategy that mixes up to $n$ distinct search sequences with appropriate probabilities. This paper investigates the existence of optimal pure strategies for the Searcher - a single deterministic search sequence that achieves the optimal expected total search time regardless of where the Hider hides. We identify several cases in which the Searcher has an optimal pure strategy, and several cases in which such optimal pure strategy does not exist. An optimal pure search strategy has significant practical value because the Searcher does not need to randomize their actions and will avoid second guessing themselves if the chosen search sequence from an optimal mixed strategy does not turn out well.
In multivariate time series analysis, the coherence measures the linear dependency between two-time series at different frequencies. However, real data applications often exhibit nonlinear dependency in the frequency domain. Conventional coherence analysis fails to capture such dependency. The quantile coherence, on the other hand, characterizes nonlinear dependency by defining the coherence at a set of quantile levels based on trigonometric quantile regression. Although quantile coherence is a more powerful tool, its estimation remains challenging due to the high level of noise. This paper introduces a new estimation technique for quantile coherence. The proposed method is semi-parametric, which uses the parametric form of the spectrum of the vector autoregressive (VAR) model as an approximation to the quantile spectral matrix, along with nonparametric smoothing across quantiles. For each fixed quantile level, we obtain the VAR parameters from the quantile periodograms, then, using the Durbin-Levinson algorithm, we calculate the preliminary estimate of quantile coherence using the VAR parameters. Finally, we smooth the preliminary estimate of quantile coherence across quantiles using a nonparametric smoother. Numerical results show that the proposed estimation method outperforms nonparametric methods. We show that quantile coherence-based bivariate time series clustering has advantages over the ordinary VAR coherence. For applications, the identified clusters of financial stocks by quantile coherence with a market benchmark are shown to have an intriguing and more accurate structure of diversified investment portfolios that may be used by investors to make better decisions.
Learning interpretable representations of neural dynamics at a population level is a crucial first step to understanding how observed neural activity relates to perception and behavior. Models of neural dynamics often focus on either low-dimensional projections of neural activity, or on learning dynamical systems that explicitly relate to the neural state over time. We discuss how these two approaches are interrelated by considering dynamical systems as representative of flows on a low-dimensional manifold. Building on this concept, we propose a new decomposed dynamical system model that represents complex non-stationary and nonlinear dynamics of time series data as a sparse combination of simpler, more interpretable components. Our model is trained through a dictionary learning procedure, where we leverage recent results in tracking sparse vectors over time. The decomposed nature of the dynamics is more expressive than previous switched approaches for a given number of parameters and enables modeling of overlapping and non-stationary dynamics. In both continuous-time and discrete-time instructional examples we demonstrate that our model can well approximate the original system, learn efficient representations, and capture smooth transitions between dynamical modes, focusing on intuitive low-dimensional non-stationary linear and nonlinear systems. Furthermore, we highlight our model's ability to efficiently capture and demix population dynamics generated from multiple independent subnetworks, a task that is computationally impractical for switched models. Finally, we apply our model to neural "full brain" recordings of C. elegans data, illustrating a diversity of dynamics that is obscured when classified into discrete states.
The entropy production rate is a central quantity in non-equilibrium statistical physics, scoring how far a stochastic process is from being time-reversible. In this paper, we compute the entropy production of diffusion processes at non-equilibrium steady-state under the condition that the time-reversal of the diffusion remains a diffusion. We start by characterising the entropy production of both discrete and continuous-time Markov processes. We investigate the time-reversal of time-homogeneous stationary diffusions and recall the most general conditions for the reversibility of the diffusion property, which includes hypoelliptic and degenerate diffusions, and locally Lipschitz vector fields. We decompose the drift into its time-reversible and irreversible parts, or equivalently, the generator into symmetric and antisymmetric operators. We show the equivalence with a decomposition of the backward Kolmogorov equation considered in hypocoercivity theory, and a decomposition of the Fokker-Planck equation in GENERIC form. The main result shows that when the time-irreversible part of the drift is in the range of the volatility matrix (almost everywhere) the forward and time-reversed path space measures of the process are mutually equivalent, and evaluates the entropy production. When this does not hold, the measures are mutually singular and the entropy production is infinite. We verify these results using exact numerical simulations of linear diffusions. We illustrate the discrepancy between the entropy production of non-linear diffusions and their numerical simulations in several examples and illustrate how the entropy production can be used for accurate numerical simulation. Finally, we discuss the relationship between time-irreversibility and sampling efficiency, and how we can modify the definition of entropy production to score how far a process is from being generalised reversible.
This paper addresses the problem of data-driven model discrimination for unknown switched systems with unknown linear temporal logic (LTL) specifications, representing tasks, that govern their mode sequences, where only sampled data of the unknown dynamics and tasks are available. To tackle this problem, we propose data-driven methods to over-approximate the unknown dynamics and to infer the unknown specifications such that both set-membership models of the unknown dynamics and LTL formulas are guaranteed to include the ground truth model and specification/task. Moreover, we present an optimization-based algorithm for analyzing the distinguishability of a set of learned/inferred model-task pairs as well as a model discrimination algorithm for ruling out model-task pairs from this set that are inconsistent with new observations at run time. Further, we present an approach for reducing the size of inferred specifications to increase the computational efficiency of the model discrimination algorithms.
The direct parametrisation method for invariant manifold is a model-order reduction technique that can be directly applied to finite element problems in order to derive efficient and converged reduced-order models (ROMs) for non-linear structures. In the field of nonlinear vibrations, it has already been applied to autonomous and non-autonomous problems in order to propose ROMs that can compute backbones and frequency-response curves of structures with geometric nonlinearity. While previous developments used a first-order development in order to cope with the non-autonomous term, this assumption is here relaxed by proposing a completely different treatment. The key idea is to enlarge the dimension of the dynamical system to make it autonomous and treat the added coordinates related to the forcing as already being written with normal coordinates. The parametrisation method is derived with this starting assumption and, as a key consequence, the resonance relationships appearing through the homological equations involve multiple occurrences of the forcing frequency, showing that with this new development, one is able to compute ROMs for superharmonic resonance. The method is implemented and validated on academic test cases involving beams and arches. It is numerically demonstrated that the method generates efficient ROMs for 3:1 and 2:1 superharmonic resonances, as well as converged results for problems where the first-order truncation on the non-autonomous terms used in previous developments showed a clear limitation.
We revisit the problem of sampling from a target distribution that has a smooth strongly log-concave density everywhere in $\mathbb R^p$. In this context, if no additional density information is available, the randomized midpoint discretization for the kinetic Langevin diffusion is known to be the most scalable method in high dimensions with large condition numbers. Our main result is a nonasymptotic and easy to compute upper bound on the Wasserstein-2 error of this method. To provide a more thorough explanation of our method for establishing the computable upper bound, we conduct an analysis of the midpoint discretization for the vanilla Langevin process. This analysis helps to clarify the underlying principles and provides valuable insights that we use to establish an improved upper bound for the kinetic Langevin process with the midpoint discretization. Furthermore, by applying these techniques we establish new guarantees for the kinetic Langevin process with Euler discretization, which have a better dependence on the condition number than existing upper bounds.
PhotoBook is a collaborative dialogue game where two players receive private, partially-overlapping sets of images and resolve which images they have in common. It presents machines with a great challenge to learn how people build common ground around multimodal context to communicate effectively. Methods developed in the literature, however, cannot be deployed to real gameplay since they only tackle some subtasks of the game, and they require additional reference chains inputs, whose extraction process is imperfect. Therefore, we propose a reference chain-free listener model that directly addresses the game's predictive task, i.e., deciding whether an image is shared with partner. Our DeBERTa-based listener model reads the full dialogue, and utilizes CLIPScore features to assess utterance-image relevance. We achieve >77% accuracy on unseen sets of images/game themes, outperforming baseline by >17 points.