In this paper, we propose a general subgroup analysis framework based on semiparametric additive mixed effect models in longitudinal analysis, which can identify subgroups on each covariate and estimate the corresponding regression functions simultaneously. In addition, the proposed procedure is applicable for both balanced and unbalanced longitudinal data. A backfitting combined with k-means algorithm is developed to estimate each semiparametric additive component across subgroups and detect subgroup structure on each covariate respectively. The actual number of groups is estimated by minimizing a Bayesian information criteria. The numerical studies demonstrate the efficacy and accuracy of the proposed procedure in identifying the subgroups and estimating the regression functions. In addition, we illustrate the usefulness of our method with an application to PBC data and provide a meaningful partition of the population.
One of the most studied models of SAT is random SAT. In this model, instances are composed from clauses chosen uniformly randomly and independently of each other. This model may be unsatisfactory in that it fails to describe various features of SAT instances, arising in real-world applications. Various modifications have been suggested to define models of industrial SAT. Here, we focus mainly on the aspect of community structure. Namely, here the set of variables consists of a number of disjoint communities, and clauses tend to consist of variables from the same community. Thus, we suggest a model of random industrial SAT, in which the central generalization with respect to random SAT is the additional community structure. There has been a lot of work on the satisfiability threshold of random $k$-SAT, starting with the calculation of the threshold of $2$-SAT, up to the recent result that the threshold exists for sufficiently large $k$. In this paper, we endeavor to study the satisfiability threshold for the proposed model of random industrial SAT. Our main result is that the threshold in this model tends to be smaller than its counterpart for random SAT. Moreover, under some conditions, this threshold even vanishes.
A central question in multi-agent strategic games deals with learning the underlying utilities driving the agents' behaviour. Motivated by the increasing availability of large data-sets, we develop an unifying data-driven technique to estimate agents' utility functions from their observed behaviour, irrespective of whether the observations correspond to (Nash) equilibrium configurations or to action profile trajectories. Under standard assumptions on the parametrization of the utilities, the proposed inference method is computationally efficient and finds all the parameters that rationalize the observed behaviour best. We numerically validate our theoretical findings on the market share estimation problem under advertising competition, using historical data from the Coca-Cola Company and Pepsi Inc. duopoly.
The hydrodynamic performance of a sea-going ship can be analysed using the data obtained from the ship. Such data can be gathered from different sources, like onboard recorded in-service data, AIS data, and noon reports. Each of these sources is known to have their inherent problems. The current work gives a brief introduction to these data sources as well as the common problems associated with them, along with some examples. In order to resolve most of these problems, a streamlined semi-automatic data processing framework for fast data processing is developed and presented here. The data processing framework can be used to process the data obtained from any of the above three mentioned sources. The framework incorporates processing steps like interpolating weather hindcast (metocean) data to ship's location in time, deriving additional features, validating data, estimating resistance components, data cleaning, and outlier detection. A brief description of each of the processing steps is provided with examples from existing datasets. The processed data can be further used to analyse the hydrodynamic performance of a ship.
We derive minimax testing errors in a distributed framework where the data is split over multiple machines and their communication to a central machine is limited to $b$ bits. We investigate both the $d$- and infinite-dimensional signal detection problem under Gaussian white noise. We also derive distributed testing algorithms reaching the theoretical lower bounds. Our results show that distributed testing is subject to fundamentally different phenomena that are not observed in distributed estimation. Among our findings, we show that testing protocols that have access to shared randomness can perform strictly better in some regimes than those that do not. Furthermore, we show that consistent nonparametric distributed testing is always possible, even with as little as $1$-bit of communication and the corresponding test outperforms the best local test using only the information available at a single local machine.
Latent class analysis (LCA) is a useful tool to investigate the heterogeneity of a disease population with time-to-event data. We propose a new method based on non-parametric maximum likelihood estimator (NPMLE), which facilitates theoretically validated inference procedure for covariate effects and cumulative hazard functions. We assess the proposed method via extensive simulation studies and demonstrate improved predictive performance over standard Cox regression model. We further illustrate the practical utility of the proposed method through an application to a mild cognitive impairment (MCI) cohort dataset.
Semi-competing risks refers to the survival analysis setting where the occurrence of a non-terminal event is subject to whether a terminal event has occurred, but not vice versa. Semi-competing risks arise in a broad range of clinical contexts, with a novel example being the pregnancy condition preeclampsia, which can only occur before the `terminal' event of giving birth. Models that acknowledge semi-competing risks enable investigation of relationships between covariates and the joint timing of the outcomes, but methods for model selection and prediction of semi-competing risks in high dimensions are lacking. Instead, researchers commonly analyze only a single or composite outcome, losing valuable information and limiting clinical utility -- in the obstetric setting, this means ignoring valuable insight into timing of delivery after preeclampsia has onset. To address this gap we propose a novel penalized estimation framework for frailty-based illness-death multi-state modeling of semi-competing risks. Our approach combines non-convex and structured fusion penalization, inducing global sparsity as well as parsimony across submodels. We perform estimation and model selection via a pathwise routine for non-convex optimization, and prove the first statistical error bound results in this setting. We present a simulation study investigating estimation error and model selection performance, and a comprehensive application of the method to joint risk modeling of preeclampsia and timing of delivery using pregnancy data from an electronic health record.
Large longitudinal studies provide lots of valuable information, especially in medical applications. A problem which must be taken care of in order to utilize their full potential is that of correlation between intra-subject measurements taken at different times. For data in Euclidean space this can be done with hierarchical models, that is, models that consider intra-subject and between-subject variability in two different stages. Nevertheless, data from medical studies often takes values in nonlinear manifolds. Here, as a first step, geodesic hierarchical models have been developed that generalize the linear ansatz by assuming that time-induced intra-subject variations occur along a generalized straight line in the manifold. However, this is often not the case (e.g., periodic motion or processes with saturation). We propose a hierarchical model for manifold-valued data that extends this to include trends along higher-order curves, namely B\'ezier splines in the manifold. To this end, we present a principled way of comparing shape trends in terms of a functional-based Riemannian metric. Remarkably, this metric allows efficient, yet simple computations by virtue of a variational time discretization requiring only the solution of regression problems. We validate our model on longitudinal data from the osteoarthritis initiative, including classification of disease \emph{progression}.
The author's recent research papers, "Cumulative deviation of a subpopulation from the full population" and "A graphical method of cumulative differences between two subpopulations" (both published in volume 8 of Springer's open-access "Journal of Big Data" during 2021), propose graphical methods and summary statistics, without extensively calibrating formal significance tests. The summary metrics and methods can measure the calibration of probabilistic predictions and can assess differences in responses between a subpopulation and the full population while controlling for a covariate or score via conditioning on it. These recently published papers construct significance tests based on the scalar summary statistics, but only sketch how to calibrate the attained significance levels (also known as "P-values") for the tests. The present article reviews and synthesizes work spanning many decades in order to detail how to calibrate the P-values. The present paper presents computationally efficient, easily implemented numerical methods for evaluating properly calibrated P-values, together with rigorous mathematical proofs guaranteeing their accuracy, and illustrates and validates the methods with open-source software and numerical examples.
Person re-identification is being widely used in the forensic, and security and surveillance system, but person re-identification is a challenging task in real life scenario. Hence, in this work, a new feature descriptor model has been proposed using a multilayer framework of Gaussian distribution model on pixel features, which include color moments, color space values and Schmid filter responses. An image of a person usually consists of distinct body regions, usually with differentiable clothing followed by local colors and texture patterns. Thus, the image is evaluated locally by dividing the image into overlapping regions. Each region is further fragmented into a set of local Gaussians on small patches. A global Gaussian encodes, these local Gaussians for each region creating a multi-level structure. Hence, the global picture of a person is described by local level information present in it, which is often ignored. Also, we have analyzed the efficiency of earlier metric learning methods on this descriptor. The performance of the descriptor is evaluated on four public available challenging datasets and the highest accuracy achieved on these datasets are compared with similar state-of-the-arts, which demonstrate the superior performance.
Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalising to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by product. The results and their inferential implications are showcased on synthetic and real data.