Introduction: The amount of data generated by original research is growing exponentially. Publicly releasing them is recommended to comply with the Open Science principles. However, data collected from human participants cannot be released as-is without raising privacy concerns. Fully synthetic data represent a promising answer to this challenge. This approach is explored by the French Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in the form of a synthetic data generation framework based on Classification and Regression Trees and an original distance-based filtering. The goal of this work was to develop a refined version of this framework and to assess its risk-utility profile with empirical and formal tools, including novel ones developed for the purpose of this evaluation.Materials and Methods: Our synthesis framework consists of four successive steps, each of which is designed to prevent specific risks of disclosure. We assessed its performance by applying two or more of these steps to a rich epidemiological dataset. Privacy and utility metrics were computed for each of the resulting synthetic datasets, which were further assessed using machine learning approaches.Results: Computed metrics showed a satisfactory level of protection against attribute disclosure attacks for each synthetic dataset, especially when the full framework was used. Membership disclosure attacks were formally prevented without significantly altering the data. Machine learning approaches showed a low risk of success for simulated singling out and linkability attacks. Distributional and inferential similarity with the original data were high with all datasets.Discussion: This work showed the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. Formal and empirical tools specifically developed for this demonstration are a valuable contribution to this field. Further research should focus on the extension and validation of these tools, in an effort to specify the intrinsic qualities of alternative data synthesis methods.Conclusion: By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative, which seems ripe for full-scale implementation.
Aberrant respondents are common but yet extremely detrimental to the quality of social surveys or questionnaires. Recently, factor mixture models have been employed to identify individuals providing deceptive or careless responses. We propose a comprehensive factor mixture model that combines confirmatory and exploratory factor models to represent both the non-aberrant and aberrant components of the responses. The flexibility of the proposed solution allows for the identification of two of the most common aberant response styles, namely faking and careless responding. We validated our approach by means of two simulations and two case studies. The results indicate the effectiveness of the proposed model in handling with aberrant responses in social and behavioral surveys.
Sequential neural posterior estimation (SNPE) techniques have been recently proposed for dealing with simulation-based models with intractable likelihoods. Unlike approximate Bayesian computation, SNPE techniques learn the posterior from sequential simulation using neural network-based conditional density estimators by minimizing a specific loss function. The SNPE method proposed by Lueckmann et al. (2017) used a calibration kernel to boost the sample weights around the observed data, resulting in a concentrated loss function. However, the use of calibration kernels may increase the variances of both the empirical loss and its gradient, making the training inefficient. To improve the stability of SNPE, this paper proposes to use an adaptive calibration kernel and several variance reduction techniques. The proposed method greatly speeds up the process of training, and provides a better approximation of the posterior than the original SNPE method and some existing competitors as confirmed by numerical experiments.
Advancements in artificial intelligence (AI) over the last decade demonstrate that machines can exhibit communicative behavior and influence how humans think, feel, and behave. In fact, the recent development of ChatGPT has shown that large language models (LLMs) can be leveraged to generate high-quality communication content at scale and across domains, suggesting that they will be increasingly used in practice. However, many questions remain about how knowing the source of the messages influences recipients' evaluation of and preference for AI-generated messages compared to human-generated messages. This paper investigated this topic in the context of vaping prevention messaging. In Study 1, which was pre-registered, we examined the influence of source disclosure on people's evaluation of AI-generated health prevention messages compared to human-generated messages. We found that source disclosure (i.e., labeling the source of a message as AI vs. human) significantly impacted the evaluation of the messages but did not significantly alter message rankings. In a follow-up study (Study 2), we examined how the influence of source disclosure may vary by the participants' negative attitudes towards AI. We found a significant moderating effect of negative attitudes towards AI on message evaluation, but not for message selection. However, for those with moderate levels of negative attitudes towards AI, source disclosure decreased the preference for AI-generated messages. Overall, the results of this series of studies showed a slight bias against AI-generated messages once the source was disclosed, adding to the emerging area of study that lies at the intersection of AI and communication.
Appendicitis is among the most frequent reasons for pediatric abdominal surgeries. Previous decision support systems for appendicitis have focused on clinical, laboratory, scoring, and computed tomography data and have ignored abdominal ultrasound, despite its noninvasive nature and widespread availability. In this work, we present interpretable machine learning models for predicting the diagnosis, management and severity of suspected appendicitis using ultrasound images. Our approach utilizes concept bottleneck models (CBM) that facilitate interpretation and interaction with high-level concepts understandable to clinicians. Furthermore, we extend CBMs to prediction problems with multiple views and incomplete concept sets. Our models were trained on a dataset comprising 579 pediatric patients with 1709 ultrasound images accompanied by clinical and laboratory data. Results show that our proposed method enables clinicians to utilize a human-understandable and intervenable predictive model without compromising performance or requiring time-consuming image annotation when deployed. For predicting the diagnosis, the extended multiview CBM attained an AUROC of 0.80 and an AUPR of 0.92, performing comparably to similar black-box neural networks trained and tested on the same dataset.
Nonlinear systems arising from time integrators like Backward Euler can sometimes be reformulated as optimization problems, known as incremental potentials. We show through a comprehensive experimental analysis that the widely used Projected Newton method, which relies on unconditional semidefinite projection of Hessian contributions, typically exhibits a reduced convergence rate compared to classical Newton's method. We demonstrate how factors like resolution, element order, projection method, material model and boundary handling impact convergence of Projected Newton and Newton. Drawing on these findings, we propose the hybrid method Project-on-Demand Newton, which projects only conditionally, and show that it enjoys both the robustness of Projected Newton and convergence rate of Newton. We additionally introduce Kinetic Newton, a regularization-based method that takes advantage of the structure of incremental potentials and avoids projection altogether. We compare the four solvers on hyperelasticity and contact problems. We also present a nuanced discussion of convergence criteria, and propose a new acceleration-based criterion that avoids problems associated with existing residual norm criteria and is easier to interpret. We finally address a fundamental limitation of the Armijo backtracking line search that occasionally blocks convergence, especially for stiff problems. We propose a novel parameter-free, robust line search technique to eliminate this issue.
When modeling scientific and industrial problems, geometries are typically modeled by explicit boundary representations obtained from computer-aided design software. Unfitted (also known as embedded or immersed) finite element methods offer a significant advantage in dealing with complex geometries, eliminating the need for generating unstructured body-fitted meshes. However, current unfitted finite elements on nonlinear geometries are restricted to implicit (possibly high-order) level set geometries. In this work, we introduce a novel automatic computational pipeline to approximate solutions of partial differential equations on domains defined by explicit nonlinear boundary representations. For the geometrical discretization, we propose a novel algorithm to generate quadratures for the bulk and surface integration on nonlinear polytopes required to compute all the terms in unfitted finite element methods. The algorithm relies on a nonlinear triangulation of the boundary, a kd-tree refinement of the surface cells that simplify the nonlinear intersections of surface and background cells to simple cases that are diffeomorphically equivalent to linear intersections, robust polynomial root-finding algorithms and surface parameterization techniques. We prove the correctness of the proposed algorithm. We have successfully applied this algorithm to simulate partial differential equations with unfitted finite elements on nonlinear domains described by computer-aided design models, demonstrating the robustness of the geometric algorithm and showing high-order accuracy of the overall method.
Previous researchers conducting Just-In-Time (JIT) defect prediction tasks have primarily focused on the performance of individual pre-trained models, without exploring the relationship between different pre-trained models as backbones. In this study, we build six models: RoBERTaJIT, CodeBERTJIT, BARTJIT, PLBARTJIT, GPT2JIT, and CodeGPTJIT, each with a distinct pre-trained model as its backbone. We systematically explore the differences and connections between these models. Specifically, we investigate the performance of the models when using Commit code and Commit message as inputs, as well as the relationship between training efficiency and model distribution among these six models. Additionally, we conduct an ablation experiment to explore the sensitivity of each model to inputs. Furthermore, we investigate how the models perform in zero-shot and few-shot scenarios. Our findings indicate that each model based on different backbones shows improvements, and when the backbone's pre-training model is similar, the training resources that need to be consumed are much more closer. We also observe that Commit code plays a significant role in defect detection, and different pre-trained models demonstrate better defect detection ability with a balanced dataset under few-shot scenarios. These results provide new insights for optimizing JIT defect prediction tasks using pre-trained models and highlight the factors that require more attention when constructing such models. Additionally, CodeGPTJIT and GPT2JIT achieved better performance than DeepJIT and CC2Vec on the two datasets respectively under 2000 training samples. These findings emphasize the effectiveness of transformer-based pre-trained models in JIT defect prediction tasks, especially in scenarios with limited training data.
We propose and compare methods for the analysis of extreme events in complex systems governed by PDEs that involve random parameters, in situations where we are interested in quantifying the probability that a scalar function of the system's solution is above a threshold. If the threshold is large, this probability is small and its accurate estimation is challenging. To tackle this difficulty, we blend theoretical results from large deviation theory (LDT) with numerical tools from PDE-constrained optimization. Our methods first compute parameters that minimize the LDT-rate function over the set of parameters leading to extreme events, using adjoint methods to compute the gradient of this rate function. The minimizers give information about the mechanism of the extreme events as well as estimates of their probability. We then propose a series of methods to refine these estimates, either via importance sampling or geometric approximation of the extreme event sets. Results are formulated for general parameter distributions and detailed expressions are provided when Gaussian distributions. We give theoretical and numerical arguments showing that the performance of our methods is insensitive to the extremeness of the events we are interested in. We illustrate the application of our approach to quantify the probability of extreme tsunami events on shore. Tsunamis are typically caused by a sudden, unpredictable change of the ocean floor elevation during an earthquake. We model this change as a random process, which takes into account the underlying physics. We use the one-dimensional shallow water equation to model tsunamis numerically. In the context of this example, we present a comparison of our methods for extreme event probability estimation, and find which type of ocean floor elevation change leads to the largest tsunamis on shore.
In large-scale systems there are fundamental challenges when centralised techniques are used for task allocation. The number of interactions is limited by resource constraints such as on computation, storage, and network communication. We can increase scalability by implementing the system as a distributed task-allocation system, sharing tasks across many agents. However, this also increases the resource cost of communications and synchronisation, and is difficult to scale. In this paper we present four algorithms to solve these problems. The combination of these algorithms enable each agent to improve their task allocation strategy through reinforcement learning, while changing how much they explore the system in response to how optimal they believe their current strategy is, given their past experience. We focus on distributed agent systems where the agents' behaviours are constrained by resource usage limits, limiting agents to local rather than system-wide knowledge. We evaluate these algorithms in a simulated environment where agents are given a task composed of multiple subtasks that must be allocated to other agents with differing capabilities, to then carry out those tasks. We also simulate real-life system effects such as networking instability. Our solution is shown to solve the task allocation problem to 6.7% of the theoretical optimal within the system configurations considered. It provides 5x better performance recovery over no-knowledge retention approaches when system connectivity is impacted, and is tested against systems up to 100 agents with less than a 9% impact on the algorithms' performance.
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.