Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. A review of the existing literature shows that neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate the central aspects of separability for density-based clustering: between-class separation and within-class connectedness. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not form meaningful clusters.
We present a rigorous and precise analysis of the maximum degree and the average degree in a dynamic duplication-divergence graph model introduced by Sol\'e, Pastor-Satorras et al. in which the graph grows according to a duplication-divergence mechanism, i.e. by iteratively creating a copy of some node and then randomly alternating the neighborhood of a new node with probability $p$. This model captures the growth of some real-world processes e.g. biological or social networks. In this paper, we prove that for some $0 < p < 1$ the maximum degree and the average degree of a duplication-divergence graph on $t$ vertices are asymptotically concentrated with high probability around $t^p$ and $\max\{t^{2 p - 1}, 1\}$, respectively, i.e. they are within at most a polylogarithmic factor from these values with probability at least $1 - t^{-A}$ for any constant $A > 0$.
The prediction accuracy of machine learning methods is steadily increasing, but the calibration of their uncertainty predictions poses a significant challenge. Numerous works focus on obtaining well-calibrated predictive models, but less is known about reliably assessing model calibration. This limits our ability to know when algorithms for improving calibration have a real effect, and when their improvements are merely artifacts due to random noise in finite datasets. In this work, we consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem. The null hypothesis is that the predictive model is calibrated, while the alternative hypothesis is that the deviation from calibration is sufficiently large. We find that detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions. When the conditional class probabilities are H\"older continuous, we propose T-Cal, a minimax optimal test for calibration based on a debiased plug-in estimator of the $\ell_2$-Expected Calibration Error (ECE). We further propose Adaptive T-Cal, a version that is adaptive to unknown smoothness. We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. T-Cal is a practical general-purpose tool, which -- combined with classical tests for discrete-valued predictors -- can be used to test the calibration of virtually any probabilistic classification method.
Simulation-based inference (SBI) provides a powerful framework for inferring posterior distributions of stochastic simulators in a wide range of domains. In many settings, however, the posterior distribution is not the end goal itself -- rather, the derived parameter values and their uncertainties are used as a basis for deciding what actions to take. Unfortunately, because posterior distributions provided by SBI are (potentially crude) approximations of the true posterior, the resulting decisions can be suboptimal. Here, we address the question of how to perform Bayesian decision making on stochastic simulators, and how one can circumvent the need to compute an explicit approximation to the posterior. Our method trains a neural network on simulated data and can predict the expected cost given any data and action, and can, thus, be directly used to infer the action with lowest cost. We apply our method to several benchmark problems and demonstrate that it induces similar cost as the true posterior distribution. We then apply the method to infer optimal actions in a real-world simulator in the medical neurosciences, the Bayesian Virtual Epileptic Patient, and demonstrate that it allows to infer actions associated with low cost after few simulations.
Robust inferential methods based on divergences measures have shown an appealing trade-off between efficiency and robustness in many different statistical models. In this paper, minimum density power divergence estimators (MDPDEs) for the scale and shape parameters of the log-logistic distribution are considered. The log-logistic is a versatile distribution modeling lifetime data which is commonly adopted in survival analysis and reliability engineering studies when the hazard rate is initially increasing but then it decreases after some point. Further, it is shown that the classical estimators based on maximum likelihood (MLE) are included as a particular case of the MDPDE family. Moreover, the corresponding influence function of the MDPDE is obtained, and its boundlessness is proved, thus leading to robust estimators. A simulation study is carried out to illustrate the slight loss in efficiency of MDPDE with respect to MLE and, at besides, the considerable gain in robustness.
An important question in statistical network analysis is how to estimate models of discrete and dependent network data with intractable likelihood functions, without sacrificing computational scalability and statistical guarantees. We demonstrate that scalable estimation of random graph models with dependent edges is possible, by establishing convergence rates of pseudo-likelihood-based $M$-estimators for discrete undirected graphical models with exponential parameterizations and parameter vectors of increasing dimension in single-observation scenarios. We highlight the impact of two complex phenomena on the convergence rate: phase transitions and model near-degeneracy. The main results have possible applications to discrete and dependent network, spatial, and temporal data. To showcase convergence rates, we introduce a novel class of generalized $\beta$-models with dependent edges and parameter vectors of increasing dimension, which leverage additional structure in the form of overlapping subpopulations to control dependence. We establish convergence rates of pseudo-likelihood-based $M$-estimators for generalized $\beta$-models in dense- and sparse-graph settings.
We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau.
Data assimilation algorithms combine information from observations and prior model information to obtain the most likely state of a dynamical system. The linearised weak-constraint four-dimensional variational assimilation problem can be reformulated as a saddle point problem, which admits more scope for preconditioners than the primal form. In this paper we design new terms which can be used within existing preconditioners, such as block diagonal and constraint-type preconditioners. Our novel preconditioning approaches: (i) incorporate model information, and (ii) are designed to target correlated observation error covariance matrices. To our knowledge (i) has not previously been considered for data assimilation problems. We develop new theory demonstrating the effectiveness of the new preconditioners within Krylov subspace methods. Linear and non-linear numerical experiments reveal that our new approach leads to faster convergence than existing state-of-the-art preconditioners for a broader range of problems than indicated by the theory alone. We present a range of numerical experiments performed in serial.
We propose an innovative and generic methodology to analyse individual and collective behaviour through individual trajectory data. The work is motivated by the analysis of GPS trajectories of fishing vessels collected from regulatory tracking data in the context of marine biodiversity conservation and ecosystem-based fisheries management. We build a low-dimensional latent representation of trajectories using convolutional neural networks as non-linear mapping. This is done by training a conditional variational auto-encoder taking into account covariates. The posterior distributions of the latent representations can be linked to the characteristics of the actual trajectories. The latent distributions of the trajectories are compared with the Bhattacharyya coefficient, which is well-suited for comparing distributions. Using this coefficient, we analyse the variation of the individual behaviour of each vessel during time. For collective behaviour analysis, we build proximity graphs and use an extension of the stochastic block model for multiple networks. This model results in a clustering of the individuals based on their set of trajectories. The application to French fishing vessels enables us to obtain groups of vessels whose individual and collective behaviours exhibit spatio-temporal patterns over the period 2014-2018.
Agricultural robotics and automation are facing some challenges rooted in the high variability 9 of products, task complexity, crop quality requirement, and dense vegetation. Such a set of 10 challenges demands a more versatile and safe robotic system. Soft robotics is a young yet 11 promising field of research aimed to enhance these aspects of current rigid robots which 12 makes it a good candidate solution for that challenge. In general, it aimed to provide robots 13 and machines with adaptive locomotion (Ansari et al., 2015), safe and adaptive manipulation 14 (Arleo et al., 2020) and versatile grasping (Langowski et al., 2020). But in agriculture, soft 15 robots have been mainly used in harvesting tasks and more specifically in grasping. In this 16 chapter, we review a candidate group of soft grippers that were used for handling and 17 harvesting crops regarding agricultural challenges i.e. safety in handling and adaptability to 18 the high variation of crops. The review is aimed to show why and to what extent soft grippers 19 have been successful in handling agricultural tasks. The analysis carried out on the results 20 provides future directions for the systematic design of soft robots in agricultural tasks.
The numerical integration of stiff equations is a challenging problem that needs to be approached by specialized numerical methods. Exponential integrators form a popular class of such methods since they are provably robust to stiffness and have been successfully applied to a variety of problems. The dynamical low- \rank approximation is a recent technique for solving high-dimensional differential equations by means of low-rank approximations. However, the domain is lacking numerical methods for stiff equations since existing methods are either not robust-to-stiffness or have unreasonably large hidden constants. In this paper, we focus on solving large-scale stiff matrix differential equations with a Sylvester-like structure, that admit good low-rank approximations. We propose two new methods that have good convergence properties, small memory footprint and that are fast to compute. The theoretical analysis shows that the new methods have order one and two, respectively. We also propose a practical implementation based on Krylov techniques. The approximation error is analyzed, leading to a priori error bounds and, therefore, a mean for choosing the size of the Krylov space. Numerical experiments are performed on several examples, confirming the theory and showing good speedup in comparison to existing techniques.