We propose a MANOVA test for semicontinuous data that is applicable also when the dimensionality exceeds the sample size. The test statistic is obtained as a likelihood ratio, where numerator and denominator are computed at the maxima of penalized likelihood functions under each hypothesis. Closed form solutions for the regularized estimators allow us to avoid computational overheads. We derive the null distribution using a permutation scheme. The power and level of the resulting test are evaluated in a simulation study. We illustrate the new methodology with two original data analyses, one regarding microRNA expression in human blastocyst cultures, and another regarding alien plant species invasion in the island of Socotra (Yemen).
Learning unknown stochastic differential equations (SDEs) from observed data is a significant and challenging task with applications in various fields. Current approaches often use neural networks to represent drift and diffusion functions, and construct likelihood-based loss by approximating the transition density to train these networks. However, these methods often rely on one-step stochastic numerical schemes, necessitating data with sufficiently high time resolution. In this paper, we introduce novel approximations to the transition density of the parameterized SDE: a Gaussian density approximation inspired by the random perturbation theory of dynamical systems, and its extension, the dynamical Gaussian mixture approximation (DynGMA). Benefiting from the robust density approximation, our method exhibits superior accuracy compared to baseline methods in learning the fully unknown drift and diffusion functions and computing the invariant distribution from trajectory data. And it is capable of handling trajectory data with low time resolution and variable, even uncontrollable, time step sizes, such as data generated from Gillespie's stochastic simulations. We then conduct several experiments across various scenarios to verify the advantages and robustness of the proposed method.
The classical approach to analyzing extreme value data is the generalized Pareto distribution (GPD). When the GPD is used to explain a target variable with the large dimension of covariates, the shape and scale function of covariates included in GPD are sometimes modeled using the generalized additive models (GAM). In contrast to many results of application, there are no theoretical results on the hybrid technique of GAM and GPD, which motivates us to develop its asymptotic theory. We provide the rate of convergence of the estimator of shape and scale functions, as well as its local asymptotic normality.
Several mixed-effects models for longitudinal data have been proposed to accommodate the non-linearity of late-life cognitive trajectories and assess the putative influence of covariates on it. No prior research provides a side-by-side examination of these models to offer guidance on their proper application and interpretation. In this work, we examined five statistical approaches previously used to answer research questions related to non-linear changes in cognitive aging: the linear mixed model (LMM) with a quadratic term, LMM with splines, the functional mixed model, the piecewise linear mixed model, and the sigmoidal mixed model. We first theoretically describe the models. Next, using data from two prospective cohorts with annual cognitive testing, we compared the interpretation of the models by investigating associations of education on cognitive change before death. Lastly, we performed a simulation study to empirically evaluate the models and provide practical recommendations. Except for the LMM-quadratic, the fit of all models was generally adequate to capture non-linearity of cognitive change and models were relatively robust. Although spline-based models have no interpretable nonlinearity parameters, their convergence was easier to achieve, and they allow graphical interpretation. In contrast, piecewise and sigmoidal models, with interpretable non-linear parameters, may require more data to achieve convergence.
Genome assembly is a prominent problem studied in bioinformatics, which computes the source string using a set of its overlapping substrings. Classically, genome assembly uses assembly graphs built using this set of substrings to compute the source string efficiently, having a tradeoff between scalability and avoiding information loss. The scalable de Bruijn graphs come at the price of losing crucial overlap information. The complete overlap information is stored in overlap graphs using quadratic space. Hierarchical overlap graphs [IPL20] (HOG) overcome these limitations, avoiding information loss despite using linear space. After a series of suboptimal improvements, Khan and Park et al. simultaneously presented two optimal algorithms [CPM2021], where only the former was seemingly practical. We empirically analyze all the practical algorithms for computing HOG, where the optimal algorithm [CPM2021] outperforms the previous algorithms as expected, though at the expense of extra memory. However, it uses non-intuitive approach and non-trivial data structures. We present arguably the most intuitive algorithm, using only elementary arrays, which is also optimal. Our algorithm empirically proves even better for both time and memory over all the algorithms, highlighting its significance in both theory and practice. We further explore the applications of hierarchical overlap graphs to solve various forms of suffix-prefix queries on a set of strings. Loukides et al. [CPM2023] recently presented state-of-the-art algorithms for these queries. However, these algorithms require complex black-box data structures and are seemingly impractical. Our algorithms, despite failing to match the state-of-the-art algorithms theoretically, answer different queries ranging from 0.01-100 milliseconds for a data set having around a billion characters.
By computing a feedback control via the linear quadratic regulator (LQR) approach and simulating a non-linear non-autonomous closed-loop system using this feedback, we combine two numerically challenging tasks. For the first task, the computation of the feedback control, we use the non-autonomous generalized differential Riccati equation (DRE), whose solution determines the time-varying feedback gain matrix. Regarding the second task, we want to be able to simulate non-linear closed-loop systems for which it is known that the regulator is only valid for sufficiently small perturbations. Thus, one easily runs into numerical issues in the integrators when the closed-loop control varies greatly. For these systems, e.g., the A-stable implicit Euler methods fails.\newline On the one hand, we implement non-autonomous versions of splitting schemes and BDF methods for the solution of our non-autonomous DREs. These are well-established DRE solvers in the autonomous case. On the other hand, to tackle the numerical issues in the simulation of the non-linear closed-loop system, we apply a fractional-step-theta scheme with time-adaptivity tuned specifically to this kind of challenge. That is, we additionally base the time-adaptivity on the activity of the control. We compare this approach to the more classical error-based time-adaptivity.\newline We describe techniques to make these two tasks computable in a reasonable amount of time and are able to simulate closed-loop systems with strongly varying controls, while avoiding numerical issues. Our time-adaptivity approach requires fewer time steps than the error-based alternative and is more reliable.
New biological assays like Perturb-seq link highly parallel CRISPR interventions to a high-dimensional transcriptomic readout, providing insight into gene regulatory networks. Causal gene regulatory networks can be represented by directed acyclic graph (DAGs), but learning DAGs from observational data is complicated by lack of identifiability and a combinatorial solution space. Score-based structure learning improves practical scalability of inferring DAGs. Previous score-based methods are sensitive to error variance structure; on the other hand, estimation of error variance is difficult without prior knowledge of structure. Accordingly, we present $\texttt{dotears}$ [doo-tairs], a continuous optimization framework which leverages observational and interventional data to infer a single causal structure, assuming a linear Structural Equation Model (SEM). $\texttt{dotears}$ exploits structural consequences of hard interventions to give a marginal estimate of exogenous error structure, bypassing the circular estimation problem. We show that $\texttt{dotears}$ is a provably consistent estimator of the true DAG under mild assumptions. $\texttt{dotears}$ outperforms other methods in varied simulations, and in real data infers edges that validate with higher precision and recall than state-of-the-art methods through differential expression tests and high-confidence protein-protein interactions.
Many recent works address the question of characterizing induced obstructions to bounded treewidth. In 2022, Lozin and Razgon completely answered this question for graph classes defined by finitely many forbidden induced subgraphs. Their result also implies a characterization of graph classes defined by finitely many forbidden induced subgraphs that are $(tw,\omega)$-bounded, that is, treewidth can only be large due to the presence of a large clique. This condition is known to be satisfied for any graph class with bounded tree-independence number, a graph parameter introduced independently by Yolov in 2018 and by Dallard, Milani\v{c}, and \v{S}torgel in 2024. Dallard et al. conjectured that $(tw,\omega)$-boundedness is actually equivalent to bounded tree-independence number. We address this conjecture in the context of graph classes defined by finitely many forbidden induced subgraphs and prove it for the case of graph classes excluding an induced star. We also prove it for subclasses of the class of line graphs, determine the exact values of the tree-independence numbers of line graphs of complete graphs and line graphs of complete bipartite graphs, and characterize the tree-independence number of $P_4$-free graphs, which implies a linear-time algorithm for its computation. Applying the algorithmic framework provided in a previous paper of the series leads to polynomial-time algorithms for the Maximum Weight Independent Set problem in an infinite family of graph classes.
We introduce a fine-grained framework for uncertainty quantification of predictive models under distributional shifts. This framework distinguishes the shift in covariate distributions from that in the conditional relationship between the outcome (Y) and the covariates (X). We propose to reweight the training samples to adjust for an identifiable covariate shift while protecting against worst-case conditional distribution shift bounded in an $f$-divergence ball. Based on ideas from conformal inference and distributionally robust learning, we present an algorithm that outputs (approximately) valid and efficient prediction intervals in the presence of distributional shifts. As a use case, we apply the framework to sensitivity analysis of individual treatment effects with hidden confounding. The proposed methods are evaluated in simulation studies and three real data applications, demonstrating superior robustness and efficiency compared with existing benchmarks.
Time-series models typically assume untainted and legitimate streams of data. However, a self-interested adversary may have incentive to corrupt this data, thereby altering a decision maker's inference. Within the broader field of adversarial machine learning, this research provides a novel, probabilistic perspective toward the manipulation of hidden Markov model inferences via corrupted data. In particular, we provision a suite of corruption problems for filtering, smoothing, and decoding inferences leveraging an adversarial risk analysis approach. Multiple stochastic programming models are set forth that incorporate realistic uncertainties and varied attacker objectives. Three general solution methods are developed by alternatively viewing the problem from frequentist and Bayesian perspectives. The efficacy of each method is illustrated via extensive, empirical testing. The developed methods are characterized by their solution quality and computational effort, resulting in a stratification of techniques across varying problem-instance architectures. This research highlights the weaknesses of hidden Markov models under adversarial activity, thereby motivating the need for robustification techniques to ensure their security.
Hashing has been widely used in approximate nearest search for large-scale database retrieval for its computation and storage efficiency. Deep hashing, which devises convolutional neural network architecture to exploit and extract the semantic information or feature of images, has received increasing attention recently. In this survey, several deep supervised hashing methods for image retrieval are evaluated and I conclude three main different directions for deep supervised hashing methods. Several comments are made at the end. Moreover, to break through the bottleneck of the existing hashing methods, I propose a Shadow Recurrent Hashing(SRH) method as a try. Specifically, I devise a CNN architecture to extract the semantic features of images and design a loss function to encourage similar images projected close. To this end, I propose a concept: shadow of the CNN output. During optimization process, the CNN output and its shadow are guiding each other so as to achieve the optimal solution as much as possible. Several experiments on dataset CIFAR-10 show the satisfying performance of SRH.