Model-based unsupervised learning, as any learning task, stalls as soon asmissing data occurs. This is even more true when the missing data are infor-mative, or said missing not at random (MNAR). In this paper, we proposemodel-based clustering algorithms designed to handle very general typesof missing data, including MNAR data. To do so, we introduce a mixturemodel for different types of data (continuous, count, categorical and mixed)to jointly model the data distribution and the MNAR mechanism, remainingvigilant to the degrees of freedom of each. Eight different MNAR modelswhich depend on the class membership and/or on the values of the missingvariables themselves are proposed. For a particular type of MNAR mod-els, for which the missingness depends on the class membership, we showthat the statistical inference can be carried out on the data matrix concate-nated with the missing mask considering a MAR mechanism instead; thisspecifically underlines the versatility of the studied MNAR models. Then,we establish sufficient conditions for identifiability of parameters of both thedata distribution and the mechanism. Regardless of the type of data and themechanism, we propose to perform clustering using EM or stochastic EMalgorithms specially developed for the purpose. Finally, we assess the nu-merical performances of the proposed methods on synthetic data and on thereal medical registry TraumaBase as well.
Spatially correlated data with an excess of zeros, usually referred to as zero-inflated spatial data, arise in many disciplines. Examples include count data, for instance, abundance (or lack thereof) of animal species and disease counts, as well as semi-continuous data like observed precipitation. Spatial two-part models are a flexible class of models for such data. Fitting two-part models can be computationally expensive for large data due to high-dimensional dependent latent variables, costly matrix operations, and slow mixing Markov chains. We describe a flexible, computationally efficient approach for modeling large zero-inflated spatial data using the projection-based intrinsic conditional autoregression (PICAR) framework. We study our approach, which we call PICAR-Z, through extensive simulation studies and two environmental data sets. Our results suggest that PICAR-Z provides accurate predictions while remaining computationally efficient. An important goal of our work is to allow researchers who are not experts in computation to easily build computationally efficient extensions to zero-inflated spatial models; this also allows for a more thorough exploration of modeling choices in two-part models than was previously possible. We show that PICAR-Z is easy to implement and extend in popular probabilistic programming languages such as nimble and stan.
We study the balanced $k$-way hypergraph partitioning problem, with a special focus on its practical applications to manycore scheduling. Given a hypergraph on $n$ nodes, our goal is to partition the node set into $k$ parts of size at most $(1+\epsilon)\cdot \frac{n}{k}$ each, while minimizing the cost of the partitioning, defined as the number of cut hyperedges, possibly also weighted by the number of partitions they intersect. We show that this problem cannot be approximated to within a $n^{1/\text{poly} \log\log n}$ factor of the optimal solution in polynomial time if the Exponential Time Hypothesis holds, even for hypergraphs of maximal degree 2. We also study the hardness of the partitioning problem from a parameterized complexity perspective, and in the more general case when we have multiple balance constraints. Furthermore, we consider two extensions of the partitioning problem that are motivated from practical considerations. Firstly, we introduce the concept of hyperDAGs to model precedence-constrained computations as hypergraphs, and we analyze the adaptation of the balanced partitioning problem to this case. Secondly, we study the hierarchical partitioning problem to model hierarchical NUMA (non-uniform memory access) effects in modern computer architectures, and we show that ignoring this hierarchical aspect of the communication cost can yield significantly weaker solutions.
Randomized controlled trials are commonly regarded as the gold standard for causal inference and play a pivotal role in modern evidence-based medicine. However, the sample sizes they use are often too limited to draw significant causal conclusions for subgroups that are less prevalent in the population. In contrast, observational data are becoming increasingly accessible in large volumes but can be subject to bias as a result of hidden confounding. Given these complementary features, we propose a power likelihood approach to augmenting RCTs with observational data for robust estimation of heterogeneous treatment effects. We provide a data-adaptive procedure for maximizing the Expected Log Predictive Density (ELPD) to select the influence factor that best regulates the information from the observational data. We conduct a simulation study to illustrate the efficacy of our method and its favourable features compared to existing approaches. Lastly, we apply the proposed method to data from Tennessee's Student Teacher Achievement Ratio (STAR) Study to demonstrate its usefulness and practicality in real-world data analysis.
Missing data are an unavoidable complication in many machine learning tasks. When data are `missing at random' there exist a range of tools and techniques to deal with the issue. However, as machine learning studies become more ambitious, and seek to learn from ever-larger volumes of heterogeneous data, an increasingly encountered problem arises in which missing values exhibit an association or structure, either explicitly or implicitly. Such `structured missingness' raises a range of challenges that have not yet been systematically addressed, and presents a fundamental hindrance to machine learning at scale. Here, we outline the current literature and propose a set of grand challenges in learning from data with structured missingness.
Repairing inconsistent knowledge bases is a task that has been assessed, with great advances over several decades, from within the knowledge representation and reasoning and the database theory communities. As information becomes more complex and interconnected, new types of repositories, representation languages and semantics are developed in order to be able to query and reason about it. Graph databases provide an effective way to represent relationships among data, and allow processing and querying these connections efficiently. In this work, we focus on the problem of computing preferred (subset and superset) repairs for graph databases with data values, using a notion of consistency based on a set of Reg-GXPath expressions as integrity constraints. Specifically, we study the problem of computing preferred repairs based on two different preference criteria, one based on weights and the other based on multisets, showing that in most cases it is possible to retain the same computational complexity as in the case where no preference criterion is available for exploitation.
Comparison-based learning addresses the problem of learning when, instead of explicit features or pairwise similarities, one only has access to comparisons of the form: \emph{Object $A$ is more similar to $B$ than to $C$.} Recently, it has been shown that, in Hierarchical Clustering, single and complete linkage can be directly implemented using only such comparisons while several algorithms have been proposed to emulate the behaviour of average linkage. Hence, finding hierarchies (or dendrograms) using only comparisons is a well understood problem. However, evaluating their meaningfulness when no ground-truth nor explicit similarities are available remains an open question. In this paper, we bridge this gap by proposing a new revenue function that allows one to measure the goodness of dendrograms using only comparisons. We show that this function is closely related to Dasgupta's cost for hierarchical clustering that uses pairwise similarities. On the theoretical side, we use the proposed revenue function to resolve the open problem of whether one can approximately recover a latent hierarchy using few triplet comparisons. On the practical side, we present principled algorithms for comparison-based hierarchical clustering based on the maximisation of the revenue and we empirically compare them with existing methods.
Modern robotics often involves multiple embodied agents operating within a shared environment. Path planning in these cases is considerably more challenging than in single-agent scenarios. Although standard Sampling-based Algorithms (SBAs) can be used to search for solutions in the robots' joint space, this approach quickly becomes computationally intractable as the number of agents increases. To address this issue, we integrate the concept of factorization into sampling-based algorithms, which requires only minimal modifications to existing methods. During the search for a solution we can decouple (i.e., factorize) different subsets of agents into independent lower-dimensional search spaces once we certify that their future solutions will be independent of each other using a factorization heuristic. Consequently, we progressively construct a lean hypergraph where certain (hyper-)edges split the agents to independent subgraphs. In the best case, this approach can reduce the growth in dimensionality of the search space from exponential to linear in the number of agents. On average, fewer samples are needed to find high-quality solutions while preserving the optimality, completeness, and anytime properties of SBAs. We present a general implementation of a factorized SBA, derive an analytical gain in terms of sample complexity for PRM*, and showcase empirical results for RRG.
We develop flexible and nonparametric estimators of the average treatment effect (ATE) transported to a new population that offer potential efficiency gains by incorporating only a sufficient subset of effect modifiers that are differentially distributed between the source and target populations into the transport step. We develop both a one-step estimator when this sufficient subset of effect modifiers is known and a collaborative one-step estimator when it is unknown. We discuss when we would expect our estimators to be more efficient than those that assume all covariates may be relevant effect modifiers and the exceptions when we would expect worse efficiency. We use simulation to compare finite sample performance across our proposed estimators and existing estimators of the transported ATE, including in the presence of practical violations of the positivity assumption. Lastly, we apply our proposed estimators to a large-scale housing trial.
Data in Knowledge Graphs often represents part of the current state of the real world. Thus, to stay up-to-date the graph data needs to be updated frequently. To utilize information from Knowledge Graphs, many state-of-the-art machine learning approaches use embedding techniques. These techniques typically compute an embedding, i.e., vector representations of the nodes as input for the main machine learning algorithm. If a graph update occurs later on -- specifically when nodes are added or removed -- the training has to be done all over again. This is undesirable, because of the time it takes and also because downstream models which were trained with these embeddings have to be retrained if they change significantly. In this paper, we investigate embedding updates that do not require full retraining and evaluate them in combination with various embedding models on real dynamic Knowledge Graphs covering multiple use cases. We study approaches that place newly appearing nodes optimally according to local information, but notice that this does not work well. However, we find that if we continue the training of the old embedding, interleaved with epochs during which we only optimize for the added and removed parts, we obtain good results in terms of typical metrics used in link prediction. This performance is obtained much faster than with a complete retraining and hence makes it possible to maintain embeddings for dynamic Knowledge Graphs.
In this paper, we propose a one-stage online clustering method called Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning. To be specific, for a given dataset, the positive and negative instance pairs are constructed through data augmentations and then projected into a feature space. Therein, the instance- and cluster-level contrastive learning are respectively conducted in the row and column space by maximizing the similarities of positive pairs while minimizing those of negative ones. Our key observation is that the rows of the feature matrix could be regarded as soft labels of instances, and accordingly the columns could be further regarded as cluster representations. By simultaneously optimizing the instance- and cluster-level contrastive loss, the model jointly learns representations and cluster assignments in an end-to-end manner. Extensive experimental results show that CC remarkably outperforms 17 competitive clustering methods on six challenging image benchmarks. In particular, CC achieves an NMI of 0.705 (0.431) on the CIFAR-10 (CIFAR-100) dataset, which is an up to 19\% (39\%) performance improvement compared with the best baseline.