In this study, Synthetic Aperture Radar (SAR) and optical data are both considered for Earth surface classification. Specifically, the integration of Sentinel-1 (S-1) and Sentinel-2 (S-2) data is carried out through supervised Machine Learning (ML) algorithms implemented on the Google Earth Engine (GEE) platform for the classification of a particular region of interest. Achieved results demonstrate how in this case radar and optical remote detection provide complementary information, benefiting surface cover classification and generally leading to increased mapping accuracy. In addition, this paper works in the direction of proving the emerging role of GEE as an effective cloud-based tool for handling large amounts of satellite data.
With the emergence of Transformer architectures and their powerful understanding of textual data, a new horizon has opened up to predict the molecular properties based on text description. While SMILES are the most common form of representation, they are lacking robustness, rich information and canonicity, which limit their effectiveness in becoming generalizable representations. Here, we present GPT-MolBERTa, a self-supervised large language model (LLM) which uses detailed textual descriptions of molecules to predict their properties. A text based description of 326000 molecules were collected using ChatGPT and used to train LLM to learn the representation of molecules. To predict the properties for the downstream tasks, both BERT and RoBERTa models were used in the finetuning stage. Experiments show that GPT-MolBERTa performs well on various molecule property benchmarks, and approaching state of the art performance in regression tasks. Additionally, further analysis of the attention mechanisms show that GPT-MolBERTa is able to pick up important information from the input textual data, displaying the interpretability of the model.
Due to the weakness of public key cryptosystems encounter of quantum computers, the need to provide a solution was emerged. The McEliece cryptosystem and its security equivalent, the Niederreiter cryptosystem, which are based on Goppa codes, are one of the solutions, but they are not practical due to their long key length. Several prior attempts to decrease the length of the public key in code-based cryptosystems involved substituting the Goppa code family with other code families. However, these efforts ultimately proved to be insecure. In 2016, the National Institute of Standards and Technology (NIST) called for proposals from around the world to standardize post-quantum cryptography (PQC) schemes to solve this issue. After receiving of various proposals in this field, the Classic McEliece cryptosystem, as well as the Hamming Quasi-Cyclic (HQC) and Bit Flipping Key Encapsulation (BIKE), chosen as code-based encryption category cryptosystems that successfully progressed to the final stage. This article proposes a method for developing a code-based public key cryptography scheme that is both simple and implementable. The proposed scheme has a much shorter public key length compared to the NIST finalist cryptosystems. The key length for the primary parameters of the McEliece cryptosystem (n=1024, k=524, t=50) ranges from 18 to 500 bits. The security of this system is at least as strong as the security of the Niederreiter cryptosystem. The proposed structure is based on the Niederreiter cryptosystem which exhibits a set of highly advantageous properties that make it a suitable candidate for implementation in all extant systems.
In situations where both extreme and non-extreme data are of interest, modelling the whole data set accurately is important. In a univariate framework, modelling the bulk and tail of a distribution has been extensively studied before. However, when more than one variable is of concern, models that aim specifically at capturing both regions correctly are scarce in the literature. A dependence model that blends two copulas with different characteristics over the whole range of the data support is proposed. One copula is tailored to the bulk and the other to the tail, with a dynamic weighting function employed to transition smoothly between them. Tail dependence properties are investigated numerically and simulation is used to confirm that the blended model is sufficiently flexible to capture a wide variety of structures. The model is applied to study the dependence between temperature and ozone concentration at two sites in the UK and compared with a single copula fit. The proposed model provides a better, more flexible, fit to the data, and is also capable of capturing complex dependence structures.
Introduction: The amount of data generated by original research is growing exponentially. Publicly releasing them is recommended to comply with the Open Science principles. However, data collected from human participants cannot be released as-is without raising privacy concerns. Fully synthetic data represent a promising answer to this challenge. This approach is explored by the French Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in the form of a synthetic data generation framework based on Classification and Regression Trees and an original distance-based filtering. The goal of this work was to develop a refined version of this framework and to assess its risk-utility profile with empirical and formal tools, including novel ones developed for the purpose of this evaluation.Materials and Methods: Our synthesis framework consists of four successive steps, each of which is designed to prevent specific risks of disclosure. We assessed its performance by applying two or more of these steps to a rich epidemiological dataset. Privacy and utility metrics were computed for each of the resulting synthetic datasets, which were further assessed using machine learning approaches.Results: Computed metrics showed a satisfactory level of protection against attribute disclosure attacks for each synthetic dataset, especially when the full framework was used. Membership disclosure attacks were formally prevented without significantly altering the data. Machine learning approaches showed a low risk of success for simulated singling out and linkability attacks. Distributional and inferential similarity with the original data were high with all datasets.Discussion: This work showed the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. Formal and empirical tools specifically developed for this demonstration are a valuable contribution to this field. Further research should focus on the extension and validation of these tools, in an effort to specify the intrinsic qualities of alternative data synthesis methods.Conclusion: By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative, which seems ripe for full-scale implementation.
One of the main challenges for interpreting black-box models is the ability to uniquely decompose square-integrable functions of non-mutually independent random inputs into a sum of functions of every possible subset of variables. However, dealing with dependencies among inputs can be complicated. We propose a novel framework to study this problem, linking three domains of mathematics: probability theory, functional analysis, and combinatorics. We show that, under two reasonable assumptions on the inputs (non-perfect functional dependence and non-degenerate stochastic dependence), it is always possible to decompose uniquely such a function. This ``canonical decomposition'' is relatively intuitive and unveils the linear nature of non-linear functions of non-linearly dependent inputs. In this framework, we effectively generalize the well-known Hoeffding decomposition, which can be seen as a particular case. Oblique projections of the black-box model allow for novel interpretability indices for evaluation and variance decomposition. Aside from their intuitive nature, the properties of these novel indices are studied and discussed. This result offers a path towards a more precise uncertainty quantification, which can benefit sensitivity analyses and interpretability studies, whenever the inputs are dependent. This decomposition is illustrated analytically, and the challenges to adopting these results in practice are discussed.
Radiomics is an emerging area of medical imaging data analysis particularly for cancer. It involves the conversion of digital medical images into mineable ultra-high dimensional data. Machine learning algorithms are widely used in radiomics data analysis to develop powerful decision support model to improve precision in diagnosis, assessment of prognosis and prediction of therapy response. However, machine learning algorithms for causal inference have not been previously employed in radiomics analysis. In this paper, we evaluate the value of machine learning algorithms for causal inference in radiomics. We select three recent competitive variable selection algorithms for causal inference: outcome-adaptive lasso (OAL), generalized outcome-adaptive lasso (GOAL) and causal ball screening (CBS). We used a sure independence screening procedure to propose an extension of GOAL and OAL for ultra-high dimensional data, SIS + GOAL and SIS + OAL. We compared SIS + GOAL, SIS + OAL and CBS using simulation study and two radiomics datasets in cancer, osteosarcoma and gliosarcoma. The two radiomics studies and the simulation study identified SIS + GOAL as the optimal variable selection algorithm.
In high-temperature plasma physics, a strong magnetic field is usually used to confine charged particles. Therefore, for studying the classical mathematical models of the physical problems it is needed to consider the effect of external magnetic fields. One of the important model equations in plasma is the Vlasov-Poisson equation with an external magnetic field. In this paper, we study the error analysis of Hamiltonian particle methods for this kind of system. The convergence of particle method for Vlasov equation and that of Hamiltonian method for particle equation are provided independently. By combining them, it can be concluded that the numerical solutions converge to the exact particle trajectories.
Learning and predicting the dynamics of physical systems requires a profound understanding of the underlying physical laws. Recent works on learning physical laws involve generalizing the equation discovery frameworks to the discovery of Hamiltonian and Lagrangian of physical systems. While the existing methods parameterize the Lagrangian using neural networks, we propose an alternate framework for learning interpretable Lagrangian descriptions of physical systems from limited data using the sparse Bayesian approach. Unlike existing neural network-based approaches, the proposed approach (a) yields an interpretable description of Lagrangian, (b) exploits Bayesian learning to quantify the epistemic uncertainty due to limited data, (c) automates the distillation of Hamiltonian from the learned Lagrangian using Legendre transformation, and (d) provides ordinary (ODE) and partial differential equation (PDE) based descriptions of the observed systems. Six different examples involving both discrete and continuous system illustrates the efficacy of the proposed approach.
Controlling spurious oscillations is crucial for designing reliable numerical schemes for hyperbolic conservation laws. This paper proposes a novel, robust, and efficient oscillation-eliminating discontinuous Galerkin (OEDG) method on general meshes, motivated by the damping technique in [Lu, Liu, and Shu, SIAM J. Numer. Anal., 59:1299-1324, 2021]. The OEDG method incorporates an OE procedure after each Runge-Kutta stage, devised by alternately evolving conventional semidiscrete DG scheme and a damping equation. A novel damping operator is carefully designed to possess scale-invariant and evolution-invariant properties. We rigorously prove optimal error estimates of the fully discrete OEDG method for linear scalar conservation laws. This might be the first generic fully-discrete error estimates for nonlinear DG schemes with automatic oscillation control mechanism. The OEDG method exhibits many notable advantages. It effectively eliminates spurious oscillations for challenging problems across various scales and wave speeds, without problem-specific parameters. It obviates the need for characteristic decomposition in hyperbolic systems. It retains key properties of conventional DG method, such as conservation, optimal convergence rates, and superconvergence. Moreover, it remains stable under normal CFL condition. The OE procedure is non-intrusive, facilitating integration into existing DG codes as an independent module. Its implementation is easy and efficient, involving only simple multiplications of modal coefficients by scalars. The OEDG approach provides new insights into the damping mechanism for oscillation control. It reveals the role of damping operator as a modal filter and establishes close relations between the damping and spectral viscosity techniques. Extensive numerical results confirm the theoretical analysis and validate the effectiveness and advantages of the OEDG method.
Discovering causal relationships from observational data is a fundamental yet challenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a method for causal feature selection which requires data from heterogeneous settings and exploits that causal models are invariant. ICP has been extended to general additive noise models and to nonparametric settings using conditional independence tests. However, the latter often suffer from low power (or poor type I error control) and additive noise models are not suitable for applications in which the response is not measured on a continuous scale, but reflects categories or counts. Here, we develop transformation-model (TRAM) based ICP, allowing for continuous, categorical, count-type, and uninformatively censored responses (these model classes, generally, do not allow for identifiability when there is no exogenous heterogeneity). As an invariance test, we propose TRAM-GCM based on the expected conditional covariance between environments and score residuals with uniform asymptotic level guarantees. For the special case of linear shift TRAMs, we also consider TRAM-Wald, which tests invariance based on the Wald statistic. We provide an open-source R package 'tramicp' and evaluate our approach on simulated data and in a case study investigating causal features of survival in critically ill patients.