Clustering multidimensional points is a fundamental data mining task, with applications in many fields, such as astronomy, neuroscience, bioinformatics, and computer vision. The goal of clustering algorithms is to group similar objects together. Density-based clustering is a clustering approach that defines clusters as dense regions of points. It has the advantage of being able to detect clusters of arbitrary shapes, rendering it useful in many applications. In this paper, we propose fast parallel algorithms for Density Peaks Clustering (DPC), a popular version of density-based clustering. Existing exact DPC algorithms suffer from low parallelism both in theory and in practice, which limits their application to large-scale data sets. Our most performant algorithm, which is based on priority search kd-trees, achieves $O(\log n\log\log n)$ span (parallel time complexity) for a data set of $n$ points. Our algorithm is also work-efficient, achieving a work complexity matching the best existing sequential exact DPC algorithm. In addition, we present another DPC algorithm based on a Fenwick tree that makes fewer assumptions for its average-case complexity to hold. We provide optimized implementations of our algorithms and evaluate their performance via extensive experiments. On a 30-core machine with two-way hyperthreading, we find that our best algorithm achieves a 10.8--13169x speedup over the previous best parallel exact DPC algorithm. Compared to the state-of-the-art parallel approximate DPC algorithm, our best algorithm achieves a 1.5--4206x speedup, while being exact.
Time series data, spanning applications ranging from climatology to finance to healthcare, presents significant challenges in data mining due to its size and complexity. One open issue lies in time series clustering, which is crucial for processing large volumes of unlabeled time series data and unlocking valuable insights. Traditional and modern analysis methods, however, often struggle with these complexities. To address these limitations, we introduce R-Clustering, a novel method that utilizes convolutional architectures with randomly selected parameters. Through extensive evaluations, R-Clustering demonstrates superior performance over existing methods in terms of clustering accuracy, computational efficiency and scalability. Empirical results obtained using the UCR archive demonstrate the effectiveness of our approach across diverse time series datasets. The findings highlight the significance of R-Clustering in various domains and applications, contributing to the advancement of time series data mining.
The problem of comparing probability distributions is at the heart of many tasks in statistics and machine learning and the most classical comparison methods assume that the distributions occur in spaces of the same dimension. Recently, a new geometric solution has been proposed to address this problem when the measures live in Euclidean spaces of differing dimensions. Here, we study the same problem of comparing probability distributions of different dimensions in the tropical geometric setting, which is becoming increasingly relevant in computations and applications involving complex, geometric data structures. Specifically, we construct a Wasserstein distance between measures on different tropical projective tori - the focal metric spaces in both theory and applications of tropical geometry - via tropical mappings between probability measures. We prove equivalence of the directionality of the maps, whether starting from the lower dimensional space and mapping to the higher dimensional space or vice versa. As an important practical implication, our work provides a framework for comparing probability distributions on the spaces of phylogenetic trees with different leaf sets.
Collaborative robots (cobots) are machines designed to work safely alongside people in human-centric environments. Providing cobots with the ability to quickly infer the inertial parameters of manipulated objects will improve their flexibility and enable greater usage in manufacturing and other areas. To ensure safety, cobots are subject to kinematic limits that result in low signal-to-noise ratios (SNR) for velocity, acceleration, and force-torque data. This renders existing inertial parameter identification algorithms prohibitively slow and inaccurate. Motivated by the desire for faster model acquisition, we investigate the use of an approximation of rigid body dynamics to improve the SNR. Additionally, we introduce a mass discretization method that can make use of shape information to quickly identify plausible inertial parameters for a manipulated object. We present extensive simulation studies and real-world experiments demonstrating that our approach complements existing inertial parameter identification methods by specifically targeting the typical cobot operating regime.
The paradigm of self-supervision focuses on representation learning from raw data without the need of labor-consuming annotations, which is the main bottleneck of current data-driven methods. Self-supervision tasks are often used to pre-train a neural network with a large amount of unlabeled data and extract generic features of the dataset. The learned model is likely to contain useful information which can be transferred to the downstream main task and improve performance compared to random parameter initialization. In this paper, we propose a new self-supervision task called source identification (SI), which is inspired by the classic blind source separation problem. Synthetic images are generated by fusing multiple source images and the network's task is to reconstruct the original images, given the fused images. A proper understanding of the image content is required to successfully solve the task. We validate our method on two medical image segmentation tasks: brain tumor segmentation and white matter hyperintensities segmentation. The results show that the proposed SI task outperforms traditional self-supervision tasks for dense predictions including inpainting, pixel shuffling, intensity shift, and super-resolution. Among variations of the SI task fusing images of different types, fusing images from different patients performs best.
The Independent Cutset problem asks whether there is a set of vertices in a given graph that is both independent and a cutset. Such a problem is $\textsf{NP}$-complete even when the input graph is planar and has maximum degree five. In this paper, we first present a $\mathcal{O}^*(1.4423^{n})$-time algorithm for the problem. We also show how to compute a minimum independent cutset (if any) in the same running time. Since the property of having an independent cutset is MSO$_1$-expressible, our main results are concerned with structural parameterizations for the problem considering parameters that are not bounded by a function of the clique-width of the input. We present $\textsf{FPT}$-time algorithms for the problem considering the following parameters: the dual of the maximum degree, the dual of the solution size, the size of a dominating set (where a dominating set is given as an additional input), the size of an odd cycle transversal, the distance to chordal graphs, and the distance to $P_5$-free graphs. We close by introducing the notion of $\alpha$-domination, which allows us to identify more fixed-parameter tractable and polynomial-time solvable cases.
Dynamic programming on various graph decompositions is one of the most fundamental techniques used in parameterized complexity. Unfortunately, even if we consider concepts as simple as path or tree decompositions, such dynamic programming uses space that is exponential in the decomposition's width, and there are good reasons to believe that this is necessary. However, it has been shown that in graphs of low treedepth it is possible to design algorithms which achieve polynomial space complexity without requiring worse time complexity than their counterparts working on tree decompositions of bounded width. Here, treedepth is a graph parameter that, intuitively speaking, takes into account both the depth and the width of a tree decomposition of the graph, rather than the width alone. Motivated by the above, we consider graphs that admit clique expressions with bounded depth and label count, or equivalently, graphs of low shrubdepth (sd). Here, sd is a bounded-depth analogue of cliquewidth, in the same way as td is a bounded-depth analogue of treewidth. We show that also in this setting, bounding the depth of the decomposition is a deciding factor for improving the space complexity. Precisely, we prove that on $n$-vertex graphs equipped with a tree-model (a decomposition notion underlying sd) of depth $d$ and using $k$ labels, we can solve - Independent Set in time $2^{O(dk)}\cdot n^{O(1)}$ using $O(dk^2\log n)$ space; - Max Cut in time $n^{O(dk)}$ using $O(dk\log n)$ space; and - Dominating Set in time $2^{O(dk)}\cdot n^{O(1)}$ using $n^{O(1)}$ space via a randomized algorithm. We also establish a lower bound, conditional on a certain assumption about the complexity of Longest Common Subsequence, which shows that at least in the case of IS the exponent of the parametric factor in the time complexity has to grow with $d$ if one wishes to keep the space complexity polynomial.
This paper proposes a novel approach to predict epidemiological parameters by integrating new real-time signals from various sources of information, such as novel social media-based population density maps and Air Quality data. We implement an ensemble of Convolutional Neural Networks (CNN) models using various data sources and fusion methodology to build robust predictions and simulate several dynamic parameters that could improve the decision-making process for policymakers. Additionally, we used data assimilation to estimate the state of our system from fused CNN predictions. The combination of meteorological signals and social media-based population density maps improved the performance and flexibility of our prediction of the COVID-19 outbreak in London. While the proposed approach outperforms standard models, such as compartmental models traditionally used in disease forecasting (SEIR), generating robust and consistent predictions allows us to increase the stability of our model while increasing its accuracy.
Tensor ring (TR) decomposition is an efficient approach to discover the hidden low-rank patterns for higher-order tensors, and streaming tensors are becoming highly prevalent in real-world applications. In this paper, we investigate how to track TR decompositions of streaming tensors. An efficient algorithm is first proposed. Then, based on this algorithm and randomized techniques, we present a randomized streaming TR decomposition. The proposed algorithms make full use of the structure of TR decomposition, and the randomized version can allow any sketching type. Theoretical results on sketch size are provided. In addition, the complexity analyses for the obtained algorithms are also given. We compare our proposals with the existing batch methods using both real and synthetic data. Numerical results show that they have better performance in computing time with maintaining similar accuracy.
To comprehend complex systems with multiple states, it is imperative to reveal the identity of these states by system outputs. Nevertheless, the mathematical models describing these systems often exhibit nonlinearity so that render the resolution of the parameter inverse problem from the observed spatiotemporal data a challenging endeavor. Starting from the observed data obtained from such systems, we propose a novel framework that facilitates the investigation of parameter identification for multi-state systems governed by spatiotemporal varying parametric partial differential equations. Our framework consists of two integral components: a constrained self-adaptive physics-informed neural network, encompassing a sub-network, as our methodology for parameter identification, and a finite mixture model approach to detect regions of probable parameter variations. Through our scheme, we can precisely ascertain the unknown varying parameters of the complex multi-state system, thereby accomplishing the inversion of the varying parameters. Furthermore, we have showcased the efficacy of our framework on two numerical cases: the 1D Burgers' equation with time-varying parameters and the 2D wave equation with a space-varying parameter.
Classic algorithms and machine learning systems like neural networks are both abundant in everyday life. While classic computer science algorithms are suitable for precise execution of exactly defined tasks such as finding the shortest path in a large graph, neural networks allow learning from data to predict the most likely answer in more complex tasks such as image classification, which cannot be reduced to an exact algorithm. To get the best of both worlds, this thesis explores combining both concepts leading to more robust, better performing, more interpretable, more computationally efficient, and more data efficient architectures. The thesis formalizes the idea of algorithmic supervision, which allows a neural network to learn from or in conjunction with an algorithm. When integrating an algorithm into a neural architecture, it is important that the algorithm is differentiable such that the architecture can be trained end-to-end and gradients can be propagated back through the algorithm in a meaningful way. To make algorithms differentiable, this thesis proposes a general method for continuously relaxing algorithms by perturbing variables and approximating the expectation value in closed form, i.e., without sampling. In addition, this thesis proposes differentiable algorithms, such as differentiable sorting networks, differentiable renderers, and differentiable logic gate networks. Finally, this thesis presents alternative training strategies for learning with algorithms.