R is a language and environment for statistical computing and graphics, which provides a wide variety of statistical tools (modeling, statistical testing, time series analysis, classification problems, machine learning, ...), together with amazing graphical techniques and the great advantage that it is highly extensible. Nowadays, there is no doubt that it is the software par excellence in statistical courses for any level, for theoretical and applied subjects alike. Besides, it has become an almost essential tool for every research work that involves any kind of analysis or data visualization. Furthermore, it is one of the most employed programming languages for general purposes. The goal of this work is helping to share ideas and resources to improve teaching and/or research using the statistical software R. We will cover its benefits, show how to get started and where to locate specific resources, and will make interesting recommendations for using R, according to our experience. For the classroom we will develop a curricular and assessment infrastructure to support both dissemination and evaluation, while for research we will offer a broader approach to quantitative studies that provides an excellent support for work in science and technology.
Including a large number of predictors in the imputation model underlying a multiple imputation (MI) procedure is one of the most challenging tasks imputers face. A variety of high-dimensional MI techniques can help, but there has been limited research on their relative performance. In this study, we investigated a wide range of extant high-dimensional MI techniques that can handle a large number of predictors in the imputation models and general missing data patterns. We assessed the relative performance of seven high-dimensional MI methods with a Monte Carlo simulation study and a resampling study based on real survey data. The performance of the methods was defined by the degree to which they facilitate unbiased and confidencevalid estimates of the parameters of complete data analysis models. We found that using lasso penalty or forward selection to select the predictors used in the MI model and using principal component analysis to reduce the dimensionality of auxiliary data produce the best results.
Convex PCA, which was introduced by Bigot et al., is a dimension reduction methodology for data with values in a convex subset of a Hilbert space. This setting arises naturally in many applications, including distributional data in the Wasserstein space of an interval, and ranked compositional data under the Aitchison geometry. Our contribution in this paper is threefold. First, we present several new theoretical results including consistency as well as continuity and differentiability of the objective function in the finite dimensional case. Second, we develop a numerical implementation of finite dimensional convex PCA when the convex set is polyhedral, and show that this provides a natural approximation of Wasserstein geodesic PCA. Third, we illustrate our results with two financial applications, namely distributions of stock returns ranked by size and the capital distribution curve, both of which are of independent interest in stochastic portfolio theory.
Edge computing has become a promising computing paradigm for building IoT (Internet of Things) applications, particularly for applications with specific constraints such as latency or privacy requirements. Due to resource constraints at the edge, it is important to efficiently utilize all available computing resources to satisfy these constraints. A key challenge in utilizing these computing resources is the scheduling of different computing tasks in a dynamically varying, highly hybrid computing environment. This paper described the design, implementation, and evaluation of a distributed scheduler for the edge that constantly monitors the current state of the computing infrastructure and dynamically schedules various computing tasks to ensure that all application constraints are met. This scheduler has been extensively evaluated with real-world AI applications under different scenarios and demonstrates that it outperforms current scheduling approaches in satisfying various application constraints.
We develop sampling algorithms to fit Bayesian hierarchical models, the computational complexity of which scales linearly with the number of observations and the number of parameters in the model. We focus on crossed random effect and nested multilevel models, which are used ubiquitously in applied sciences. The posterior dependence in both classes is sparse: in crossed random effects models it resembles a random graph, whereas in nested multilevel models it is tree-structured. For each class we identify a framework for scalable computation, building on previous work. Methods for crossed models are based on extensions of appropriately designed collapsed Gibbs samplers, where we introduce the idea of local centering; while methods for nested models are based on sparse linear algebra and data augmentation. We provide a theoretical analysis of the proposed algorithms in some simplified settings, including a comparison with previously proposed methodologies and an average-case analysis based on random graph theory. Numerical experiments, including two challenging real data analyses on predicting electoral results and real estate prices, compare with off-the-shelf Hamiltonian Monte Carlo, displaying drastic improvement in performance.
In novelty detection, the objective is to determine whether the test sample contains any outliers, using a sample of controls (inliers). This involves many-to-one comparisons of individual test points against the control sample. A recent approach applies the Benjamini-Hochberg procedure to the conformal $p$-values resulting from these comparisons, ensuring false discovery rate control. In this paper, we suggest using Wilcoxon-Mann-Whitney tests for the comparisons and subsequently applying the closed testing principle to derive post-hoc confidence bounds for the number of outliers in any subset of the test sample. We revisit an elegant result that under a nonparametric alternative known as Lehmann's alternative, Wilcoxon-Mann-Whitney is locally most powerful among rank tests. By combining this result with a simple observation, we demonstrate that the proposed procedure is more powerful for the null hypothesis of no outliers than the Benjamini-Hochberg procedure applied to conformal $p$-values.
With a rapidly increasing amount and diversity of remote sensing (RS) data sources, there is a strong need for multi-view learning modeling. This is a complex task when considering the differences in resolution, magnitude, and noise of RS data. The typical approach for merging multiple RS sources has been input-level fusion, but other - more advanced - fusion strategies may outperform this traditional approach. This work assesses different fusion strategies for crop classification in the CropHarvest dataset. The fusion methods proposed in this work outperform models based on individual views and previous fusion methods. We do not find one single fusion method that consistently outperforms all other approaches. Instead, we present a comparison of multi-view fusion methods for three different datasets and show that, depending on the test region, different methods obtain the best performance. Despite this, we suggest a preliminary criterion for the selection of fusion methods.
Surgical instrument segmentation is recognised as a key enabler to provide advanced surgical assistance and improve computer assisted interventions. In this work, we propose SegMatch, a semi supervised learning method to reduce the need for expensive annotation for laparoscopic and robotic surgical images. SegMatch builds on FixMatch, a widespread semi supervised classification pipeline combining consistency regularization and pseudo labelling, and adapts it for the purpose of segmentation. In our proposed SegMatch, the unlabelled images are weakly augmented and fed into the segmentation model to generate a pseudo-label to enforce the unsupervised loss against the output of the model for the adversarial augmented image on the pixels with a high confidence score. Our adaptation for segmentation tasks includes carefully considering the equivariance and invariance properties of the augmentation functions we rely on. To increase the relevance of our augmentations, we depart from using only handcrafted augmentations and introduce a trainable adversarial augmentation strategy. Our algorithm was evaluated on the MICCAI Instrument Segmentation Challenge datasets Robust-MIS 2019 and EndoVis 2017. Our results demonstrate that adding unlabelled data for training purposes allows us to surpass the performance of fully supervised approaches which are limited by the availability of training data in these challenges. SegMatch also outperforms a range of state-of-the-art semi-supervised learning semantic segmentation models in different labelled to unlabelled data ratios.
Graph-centric artificial intelligence (graph AI) has achieved remarkable success in modeling interacting systems prevalent in nature, from dynamical systems in biology to particle physics. The increasing heterogeneity of data calls for graph neural architectures that can combine multiple inductive biases. However, combining data from various sources is challenging because appropriate inductive bias may vary by data modality. Multimodal learning methods fuse multiple data modalities while leveraging cross-modal dependencies to address this challenge. Here, we survey 140 studies in graph-centric AI and realize that diverse data types are increasingly brought together using graphs and fed into sophisticated multimodal models. These models stratify into image-, language-, and knowledge-grounded multimodal learning. We put forward an algorithmic blueprint for multimodal graph learning based on this categorization. The blueprint serves as a way to group state-of-the-art architectures that treat multimodal data by choosing appropriately four different components. This effort can pave the way for standardizing the design of sophisticated multimodal architectures for highly complex real-world problems.
Most algorithms for representation learning and link prediction in relational data have been designed for static data. However, the data they are applied to usually evolves with time, such as friend graphs in social networks or user interactions with items in recommender systems. This is also the case for knowledge bases, which contain facts such as (US, has president, B. Obama, [2009-2017]) that are valid only at certain points in time. For the problem of link prediction under temporal constraints, i.e., answering queries such as (US, has president, ?, 2012), we propose a solution inspired by the canonical decomposition of tensors of order 4. We introduce new regularization schemes and present an extension of ComplEx (Trouillon et al., 2016) that achieves state-of-the-art performance. Additionally, we propose a new dataset for knowledge base completion constructed from Wikidata, larger than previous benchmarks by an order of magnitude, as a new reference for evaluating temporal and non-temporal link prediction methods.
Deep learning constitutes a recent, modern technique for image processing and data analysis, with promising results and large potential. As deep learning has been successfully applied in various domains, it has recently entered also the domain of agriculture. In this paper, we perform a survey of 40 research efforts that employ deep learning techniques, applied to various agricultural and food production challenges. We examine the particular agricultural problems under study, the specific models and frameworks employed, the sources, nature and pre-processing of data used, and the overall performance achieved according to the metrics used at each work under study. Moreover, we study comparisons of deep learning with other existing popular techniques, in respect to differences in classification or regression performance. Our findings indicate that deep learning provides high accuracy, outperforming existing commonly used image processing techniques.