Density peaks clustering has become a nova of clustering algorithm because of its simplicity and practicality. However, there is one main drawback: it is time-consuming due to its high computational complexity. Herein, a density peaks clustering algorithm with sparse search and K-d tree is developed to solve this problem. Firstly, a sparse distance matrix is calculated by using K-d tree to replace the original full rank distance matrix, so as to accelerate the calculation of local density. Secondly, a sparse search strategy is proposed to accelerate the computation of relative-separation with the intersection between the set of $k$ nearest neighbors and the set consisting of the data points with larger local density for any data point. Furthermore, a second-order difference method for decision values is adopted to determine the cluster centers adaptively. Finally, experiments are carried out on datasets with different distribution characteristics, by comparing with other six state-of-the-art clustering algorithms. It is proved that the algorithm can effectively reduce the computational complexity of the original DPC from $O(n^2K)$ to $O(n(n^{1-1/K}+k))$. Especially for larger datasets, the efficiency is elevated more remarkably. Moreover, the clustering accuracy is also improved to a certain extent. Therefore, it can be concluded that the overall performance of the newly proposed algorithm is excellent.
Vehicular communication is an essential part of a smart city. Scalability is a major issue for vehicular communication, specially, when the number of vehicles increases at any given point. Vehicles also suffer some other problems such as broadcast problem. Clustering can solve the issues of vehicular ad hoc network (VANET); however, due to the high mobility of the vehicles, clustering in VANET suffers stability issue. Previously proposed clustering algorithms for VANET are optimized for either straight road or for intersection. Moreover, the absence of the intelligent use of a combination of the mobility parameters, such as direction, movement, position, velocity, degree of vehicle, movement at the intersection etc., results in cluster stability issues. A dynamic clustering algorithm considering the efficient use of all the mobility parameters can solve the stability problem in VANET. To achieve higher stability for VANET, a novel robust and dynamic clustering algorithm stable dynamic predictive clustering (SDPC) for VANET is proposed in this paper. In contrast to previous studies, vehicle relative velocity, vehicle position, vehicle distance, transmission range, and vehicle density are considered in the creation of a cluster, whereas relative distance, movement at the intersection, degree of vehicles are considered to select the cluster head. From the mobility parameters the future road scenario is constructed. The cluster is created, and the cluster head is selected based on the future construction of the road. The performance of SDPC is compared in terms of the average cluster head change rate, the average cluster head duration, the average cluster member duration, and the ratio of clustering overhead in terms of total packet transmission. The simulation result shows SDPC outperforms the existing algorithms and achieved better clustering stability.
Search trees on trees (STTs) generalize the fundamental binary search tree (BST) data structure: in STTs the underlying search space is an arbitrary tree, whereas in BSTs it is a path. An optimal BST of size $n$ can be computed for a given distribution of queries in $O(n^2)$ time [Knuth 1971] and centroid BSTs provide a nearly-optimal alternative, computable in $O(n)$ time [Mehlhorn 1977]. By contrast, optimal STTs are not known to be computable in polynomial time, and the fastest constant-approximation algorithm runs in $O(n^3)$ time [Berendsohn, Kozma 2022]. Centroid trees can be defined for STTs analogously to BSTs, and they have been used in a wide range of algorithmic applications. In the unweighted case (i.e., for a uniform distribution of queries), a centroid tree can be computed in $O(n)$ time [Brodal et al. 2001; Della Giustina et al. 2019]. These algorithms, however, do not readily extend to the weighted case. Moreover, no approximation guarantees were previously known for centroid trees in either the unweighted or weighted cases. In this paper we revisit centroid trees in a general, weighted setting, and we settle both the algorithmic complexity of constructing them, and the quality of their approximation. For constructing a weighted centroid tree, we give an output-sensitive $O(n\log h)\subseteq O(n\log n)$ time algorithm, where $h$ is the height of the resulting centroid tree. If the weights are of polynomial complexity, the running time is $O(n\log\log n)$. We show these bounds to be optimal, in a general decision tree model of computation. For approximation, we prove that the cost of a centroid tree is at most twice the optimum, and this guarantee is best possible, both in the weighted and unweighted cases. We also give tight, fine-grained bounds on the approximation-ratio for bounded-degree trees and on the approximation-ratio of more general $\alpha$-centroid trees.
Positive and unlabelled learning is an important problem which arises naturally in many applications. The significant limitation of almost all existing methods lies in assuming that the propensity score function is constant (SCAR assumption), which is unrealistic in many practical situations. Avoiding this assumption, we consider parametric approach to the problem of joint estimation of posterior probability and propensity score functions. We show that under mild assumptions when both functions have the same parametric form (e.g. logistic with different parameters) the corresponding parameters are identifiable. Motivated by this, we propose two approaches to their estimation: joint maximum likelihood method and the second approach based on alternating maximization of two Fisher consistent expressions. Our experimental results show that the proposed methods are comparable or better than the existing methods based on Expectation-Maximisation scheme.
Computing the edit distance of two strings is one of the most basic problems in computer science and combinatorial optimization. Tree edit distance is a natural generalization of edit distance in which the task is to compute a measure of dissimilarity between two (unweighted) rooted trees with node labels. Perhaps the most notable recent application of tree edit distance is in NoSQL big databases, such as MongoDB, where each row of the database is a JSON document represented as a labeled rooted tree, and finding dissimilarity between two rows is a basic operation. Until recently, the fastest algorithm for tree edit distance ran in cubic time (Demaine, Mozes, Rossman, Weimann; TALG'10); however, Mao (FOCS'21) broke the cubic barrier for the tree edit distance problem using fast matrix multiplication. Given a parameter $k$ as an upper bound on the distance, an $O(n+k^2)$-time algorithm for edit distance has been known since the 1980s due to the works of Myers (Algorithmica'86) and Landau and Vishkin (JCSS'88). The existence of an $\tilde{O}(n+\mathrm{poly}(k))$-time algorithm for tree edit distance has been posed as an open question, e.g., by Akmal and Jin (ICALP'21), who gave a state-of-the-art $\tilde{O}(nk^2)$-time algorithm. In this paper, we answer this question positively.
Mobile apps are becoming an integral part of people's daily life by providing various functionalities, such as messaging and gaming. App developers try their best to ensure user experience during app development and maintenance to improve the rating of their apps on app platforms and attract more user downloads. Previous studies indicated that responding to users' reviews tends to change their attitude towards the apps positively. Users who have been replied are more likely to update the given ratings. However, reading and responding to every user review is not an easy task for developers since it is common for popular apps to receive tons of reviews every day. Thus, automation tools for review replying are needed. To address the need above, the paper introduces a Transformer-based approach, named TRRGen, to automatically generate responses to given user reviews. TRRGen extracts apps' categories, rating, and review text as the input features. By adapting a Transformer-based model, TRRGen can generate appropriate replies for new reviews. Comprehensive experiments and analysis on the real-world datasets indicate that the proposed approach can generate high-quality replies for users' reviews and significantly outperform current state-of-art approaches on the task. The manual validation results on the generated replies further demonstrate the effectiveness of the proposed approach.
In domains where sample sizes are limited, efficient learning algorithms are critical. Learning using privileged information (LuPI) offers increased sample efficiency by allowing prediction models access to types of information at training time which is unavailable when the models are used. In recent work, it was shown that for prediction in linear-Gaussian dynamical systems, a LuPI learner with access to intermediate time series data is never worse and often better in expectation than any unbiased classical learner. We provide new insights into this analysis and generalize it to nonlinear prediction tasks in latent dynamical systems, extending theoretical guarantees to the case where the map connecting latent variables and observations is known up to a linear transform. In addition, we propose algorithms based on random features and representation learning for the case when this map is unknown. A suite of empirical results confirm theoretical findings and show the potential of using privileged time-series information in nonlinear prediction.
We describe a method to obtain spherical parameterizations of arbitrary data through the use of persistent cohomology and variational optimization. We begin by computing the second-degree persistent cohomology of the filtered Vietoris-Rips (VR) complex of a data set $X$. We extract a cocycle $\alpha$ from any significant feature and define an associated map $\alpha: VR(X) \to S^2$. We use this map as an infeasible initialization for a variational optimization problem with a unique minimizer, up to rigid motion. We employ an alternating gradient descent/ M\"{o}bius transformation update method to solve the problem and generate a more suitable, i.e., smoother, representative of the homotopy class of $\alpha$. We show that this process preserves the relevant topological feature of the data and converges to a feasible optimum. Finally, we conduct numerical experiments on both synthetic and real-world data sets to show the efficacy of our proposed approach.
Learning on big data brings success for artificial intelligence (AI), but the annotation and training costs are expensive. In future, learning on small data is one of the ultimate purposes of AI, which requires machines to recognize objectives and scenarios relying on small data as humans. A series of machine learning models is going on this way such as active learning, few-shot learning, deep clustering. However, there are few theoretical guarantees for their generalization performance. Moreover, most of their settings are passive, that is, the label distribution is explicitly controlled by one specified sampling scenario. This survey follows the agnostic active sampling under a PAC (Probably Approximately Correct) framework to analyze the generalization error and label complexity of learning on small data using a supervised and unsupervised fashion. With these theoretical analyses, we categorize the small data learning models from two geometric perspectives: the Euclidean and non-Euclidean (hyperbolic) mean representation, where their optimization solutions are also presented and discussed. Later, some potential learning scenarios that may benefit from small data learning are then summarized, and their potential learning scenarios are also analyzed. Finally, some challenging applications such as computer vision, natural language processing that may benefit from learning on small data are also surveyed.
With the explosive growth of information technology, multi-view graph data have become increasingly prevalent and valuable. Most existing multi-view clustering techniques either focus on the scenario of multiple graphs or multi-view attributes. In this paper, we propose a generic framework to cluster multi-view attributed graph data. Specifically, inspired by the success of contrastive learning, we propose multi-view contrastive graph clustering (MCGC) method to learn a consensus graph since the original graph could be noisy or incomplete and is not directly applicable. Our method composes of two key steps: we first filter out the undesirable high-frequency noise while preserving the graph geometric features via graph filtering and obtain a smooth representation of nodes; we then learn a consensus graph regularized by graph contrastive loss. Results on several benchmark datasets show the superiority of our method with respect to state-of-the-art approaches. In particular, our simple approach outperforms existing deep learning-based methods.
We study how to generate captions that are not only accurate in describing an image but also discriminative across different images. The problem is both fundamental and interesting, as most machine-generated captions, despite phenomenal research progresses in the past several years, are expressed in a very monotonic and featureless format. While such captions are normally accurate, they often lack important characteristics in human languages - distinctiveness for each caption and diversity for different images. To address this problem, we propose a novel conditional generative adversarial network for generating diverse captions across images. Instead of estimating the quality of a caption solely on one image, the proposed comparative adversarial learning framework better assesses the quality of captions by comparing a set of captions within the image-caption joint space. By contrasting with human-written captions and image-mismatched captions, the caption generator effectively exploits the inherent characteristics of human languages, and generates more discriminative captions. We show that our proposed network is capable of producing accurate and diverse captions across images.