Artificial intelligence and data access are already mainstream. One of the main challenges when designing an artificial intelligence or disclosing content from a database is preserving the privacy of individuals who participate in the process. Differential privacy for synthetic data generation has received much attention due to the ability of preserving privacy while freely using the synthetic data. Private sampling is the first noise-free method to construct differentially private synthetic data with rigorous bounds for privacy and accuracy. However, this synthetic data generation method comes with constraints which seem unrealistic and not applicable for real-world datasets. In this paper, we provide an implementation of the private sampling algorithm and discuss the realism of its constraints in practical cases.
The human cerebral cortex has many bumps and grooves called gyri and sulci. Even though there is a high inter-individual consistency for the main cortical folds, this is not the case when we examine the exact shapes and details of the folding patterns. Because of this complexity, characterizing the cortical folding variability and relating them to subjects' behavioral characteristics or pathologies is still an open scientific problem. Classical approaches include labeling a few specific patterns, either manually or semi-automatically, based on geometric distances, but the recent availability of MRI image datasets of tens of thousands of subjects makes modern deep-learning techniques particularly attractive. Here, we build a self-supervised deep-learning model to detect folding patterns in the cingulate region. We train a contrastive self-supervised model (SimCLR) on both Human Connectome Project (1101 subjects) and UKBioBank (21070 subjects) datasets with topological-based augmentations on the cortical skeletons, which are topological objects that capture the shape of the folds. We explore several backbone architectures (convolutional network, DenseNet, and PointNet) for the SimCLR. For evaluation and testing, we perform a linear classification task on a database manually labeled for the presence of the "double-parallel" folding pattern in the cingulate region, which is related to schizophrenia characteristics. The best model, giving a test AUC of 0.76, is a convolutional network with 6 layers, a 10-dimensional latent space, a linear projection head, and using the branch-clipping augmentation. This is the first time that a self-supervised deep learning model has been applied to cortical skeletons on such a large dataset and quantitatively evaluated. We can now envisage the next step: applying it to other brain regions to detect other biomarkers.
Among semiparametric regression models, partially linear additive models provide a useful tool to include additive nonparametric components as well as a parametric component, when explaining the relationship between the response and a set of explanatory variables. This paper concerns such models under sparsity assumptions for the covariates included in the linear component. Sparse covariates are frequent in regression problems where the task of variable selection is usually of interest. As in other settings, outliers either in the residuals or in the covariates involved in the linear component have a harmful effect. To simultaneously achieve model selection for the parametric component of the model and resistance to outliers, we combine preliminary robust estimators of the additive component, robust linear $MM-$regression estimators with a penalty such as SCAD on the coefficients in the parametric part. Under mild assumptions, consistency results and rates of convergence for the proposed estimators are derived. A Monte Carlo study is carried out to compare, under different models and contamination schemes, the performance of the robust proposal with its classical counterpart. The obtained results show the advantage of using the robust approach. Through the analysis of a real data set, we also illustrate the benefits of the proposed procedure.
With the development and popularity of sensors installed in manufacturing systems, complex data are collected during manufacturing processes, which brings challenges for traditional process control methods. This paper proposes a novel process control and monitoring method for the complex structure of high-dimensional image-based overlay errors (modeled in tensor form), which are collected in semiconductor manufacturing processes. The proposed method aims to reduce overlay errors using limited control recipes. We first build a high-dimensional process model and propose different tensor-on-vector regression algorithms to estimate parameters in the model to alleviate the curse of dimensionality. Then, based on the estimate of tensor parameters, the exponentially weighted moving average (EWMA) controller for tensor data is designed whose stability is theoretically guaranteed. Considering the fact that low-dimensional control recipes cannot compensate for all high-dimensional disturbances on the image, control residuals are monitored to prevent significant drifts of uncontrollable high-dimensional disturbances. Through extensive simulations and real case studies, the performances of parameter estimation algorithms and the EWMA controller in tensor space are evaluated. Compared with existing image-based feedback controllers, the superiority of our method is verified especially when disturbances are not stable.
Time series and extreme value analyses are two statistical approaches usually applied to study hydrological data. Classical techniques, such as ARIMA models (in the case of mean flow predictions), and parametric generalised extreme value (GEV) fits and nonparametric extreme value methods (in the case of extreme value theory) have been usually employed in this context. In this paper, nonparametric functional data methods are used to perform mean monthly flow predictions and extreme value analysis, which are important for flood risk management. These are powerful tools that take advantage of both, the functional nature of the data under consideration and the flexibility of nonparametric methods, providing more reliable results. Therefore, they can be useful to prevent damage caused by floods and to reduce the likelihood and/or the impact of floods in a specific location. The nonparametric functional approaches are applied to flow samples of two rivers in the U.S. In this way, monthly mean flow is predicted and flow quantiles in the extreme value framework are estimated using the proposed methods. Results show that the nonparametric functional techniques work satisfactorily, generally outperforming the behaviour of classical parametric and nonparametric estimators in both settings.
A statistical emulator can be used as a surrogate of complex physics-based calculations to drastically reduce the computational cost. Its successful implementation hinges on an accurate representation of the nonlinear response surface with a high-dimensional input space. Conventional "space-filling" designs, including random sampling and Latin hypercube sampling, become inefficient as the dimensionality of the input variables increases, and the predictive accuracy of the emulator can degrade substantially for a test input distant from the training input set. To address this fundamental challenge, we develop a reliable emulator for predicting complex functionals by active learning with error control (ALEC). The algorithm is applicable to infinite-dimensional mapping with high-fidelity predictions and a controlled predictive error. The computational efficiency has been demonstrated by emulating the classical density functional theory (cDFT) calculations, a statistical-mechanical method widely used in modeling the equilibrium properties of complex molecular systems. We show that ALEC is much more accurate than conventional emulators based on the Gaussian processes with "space-filling" designs and alternative active learning methods. Besides, it is computationally more efficient than direct cDFT calculations. ALEC can be a reliable building block for emulating expensive functionals owing to its minimal computational cost, controllable predictive error, and fully automatic features.
Regression with random data objects is becoming increasingly common in modern data analysis. Unfortunately, like the traditional regression setting with Euclidean data, random response regression is not immune to the trouble caused by unusual observations. A metric Cook's distance extending the classical Cook's distances of Cook (1977) to general metric-valued response objects is proposed. The performance of the metric Cook's distance in both Euclidean and non-Euclidean response regression with Euclidean predictors is demonstrated in an extensive experimental study. A real data analysis of county-level COVID-19 transmission in the United States also illustrates the usefulness of this method in practice.
The analysis of survey data is a frequently arising issue in clinical trials, particularly when capturing quantities which are difficult to measure. Typical examples are questionnaires about patient's well-being, pain, or consent to an intervention. In these, data is captured on a discrete scale containing only a limited number of possible answers, from which the respondent has to pick the answer which fits best his/her personal opinion. This data is generally located on an ordinal scale as answers can usually be arranged in an ascending order, e.g., "bad", "neutral", "good" for well-being. Since responses are usually stored numerically for data processing purposes, analysis of survey data using ordinary linear regression models are commonly applied. However, assumptions of these models are often not met as linear regression requires a constant variability of the response variable and can yield predictions out of the range of response categories. By using linear models, one only gains insights about the mean response which may affect representativeness. In contrast, ordinal regression models can provide probability estimates for all response categories and yield information about the full response scale beyond the mean. In this work, we provide a concise overview of the fundamentals of latent variable based ordinal models, applications to a real data set, and outline the use of state-of-the-art-software for this purpose. Moreover, we discuss strengths, limitations and typical pitfalls. This is a companion work to a current vignette-based structured interview study in paediatric anaesthesia.
An important problem in project management is determining ways to distribute amongst activities the costs that are incurred when a project is delayed because some activities end later than expected. In this study, we address this problem in stochastic projects, where the durations of activities are unknown but their corresponding probability distributions are known. We propose and characterise an allocation rule based on the Shapley value, illustrate its behaviour by using examples, and analyse features of its calculation for large problems.
In large-scale systems there are fundamental challenges when centralised techniques are used for task allocation. The number of interactions is limited by resource constraints such as on computation, storage, and network communication. We can increase scalability by implementing the system as a distributed task-allocation system, sharing tasks across many agents. However, this also increases the resource cost of communications and synchronisation, and is difficult to scale. In this paper we present four algorithms to solve these problems. The combination of these algorithms enable each agent to improve their task allocation strategy through reinforcement learning, while changing how much they explore the system in response to how optimal they believe their current strategy is, given their past experience. We focus on distributed agent systems where the agents' behaviours are constrained by resource usage limits, limiting agents to local rather than system-wide knowledge. We evaluate these algorithms in a simulated environment where agents are given a task composed of multiple subtasks that must be allocated to other agents with differing capabilities, to then carry out those tasks. We also simulate real-life system effects such as networking instability. Our solution is shown to solve the task allocation problem to 6.7% of the theoretical optimal within the system configurations considered. It provides 5x better performance recovery over no-knowledge retention approaches when system connectivity is impacted, and is tested against systems up to 100 agents with less than a 9% impact on the algorithms' performance.
In recent years, object detection has experienced impressive progress. Despite these improvements, there is still a significant gap in the performance between the detection of small and large objects. We analyze the current state-of-the-art model, Mask-RCNN, on a challenging dataset, MS COCO. We show that the overlap between small ground-truth objects and the predicted anchors is much lower than the expected IoU threshold. We conjecture this is due to two factors; (1) only a few images are containing small objects, and (2) small objects do not appear enough even within each image containing them. We thus propose to oversample those images with small objects and augment each of those images by copy-pasting small objects many times. It allows us to trade off the quality of the detector on large objects with that on small objects. We evaluate different pasting augmentation strategies, and ultimately, we achieve 9.7\% relative improvement on the instance segmentation and 7.1\% on the object detection of small objects, compared to the current state of the art method on MS COCO.