Current high-throughput technologies provide a large amount of variables to describe a phenomenon. Only a few variables are generally sufficient to answer the question. Identify them in a high-dimensional Gaussian linear regression model is the one of the most-used statistical methods. In this article, we describe step-by-step the variable selection procedures built upon regularization paths. Regularization paths are obtained by combining a regularization function and an algorithm. Then, they are combined either with a model selection procedure using penalty functions or with a sampling strategy to obtain the final selected variables. We perform a comparison study by considering three simulation settings with various dependency structures on variables. %from the most classical to a most realistic one. In all the settings, we evaluate (i) the ability to discriminate between the active variables and the non-active variables along the regularization path (pROC-AUC), (ii) the prediction performance of the selected variable subset (MSE) and (iii) the relevance of the selected variables (recall, specificity, FDR). From the results, we provide recommendations on strategies to be favored depending on the characteristics of the problem at hand. We obtain that the regularization function Elastic-net provides most of the time better results than the $\ell_1$ one and the lars algorithm has to be privileged as the GD one. ESCV provides the best prediction performances. Bolasso and the knockoffs method are judicious choices to limit the selection of non-active variables while ensuring selection of enough active variables. Conversely, the data-driven penalties considered in this review are not to be favored. As for Tigress and LinSelect, they are conservative methods.
Printing custom DNA sequences is essential to scientific and biomedical research, but the technology can be used to manufacture plagues as well as cures. Just as ink printers recognize and reject attempts to counterfeit money, DNA synthesizers and assemblers should deny unauthorized requests to make viral DNA that could be used to ignite a pandemic. There are three complications. First, we don't need to quickly update printers to deal with newly discovered currencies, whereas we regularly learn of new viruses and other biological threats. Second, anti-counterfeiting specifications on a local printer can't be extracted and misused by malicious actors, unlike information on biological threats. Finally, any screening must keep the inspected DNA sequences private, as they may constitute valuable trade secrets. Here we describe SecureDNA, a free, privacy-preserving, and fully automated system capable of verifiably screening all DNA synthesis orders of 30+ base pairs against an up-to-date database of hazards, and its operational performance and specificity when applied to 67 million base pairs of DNA synthesized by providers in the United States, Europe, and China.
We present a specific-purpose globalized and preconditioned Newton-CG solver to minimize a metric-aware curved high-order mesh distortion. The solver is specially devised to optimize curved high-order meshes for high polynomial degrees with a target metric featuring non-uniform sizing, high stretching ratios, and curved alignment -- exactly the features that stiffen the optimization problem. To this end, we consider two ingredients: a specific-purpose globalization and a specific-purpose Jacobi-$\text{iLDL}^{\text{T}}(0)$ preconditioning with varying accuracy and curvature tolerances (dynamic forcing terms) for the CG method. These improvements are critical in stiff problems because, without them, the large number of non-linear and linear iterations makes curved optimization impractical. Finally, to analyze the performance of our method, the results compare the specific-purpose solver with standard optimization methods. For this, we measure the matrix-vector products indicating the solver computational cost and the line-search iterations indicating the total amount of objective function evaluations. When we combine the globalization and the linear solver ingredients, we conclude that the specific-purpose Newton-CG solver reduces the total number of matrix-vector products by one order of magnitude. Moreover, the number of non-linear and line-search iterations is mainly smaller but of similar magnitude.
We propose a material design method via gradient-based optimization on compositions, overcoming the limitations of traditional methods: exhaustive database searches and conditional generation models. It optimizes inputs via backpropagation, aligning the model's output closely with the target property and facilitating the discovery of unlisted materials and precise property determination. Our method is also capable of adaptive optimization under new conditions without retraining. Applying to exploring high-Tc superconductors, we identified potential compositions beyond existing databases and discovered new hydrogen superconductors via conditional optimization. This method is versatile and significantly advances material design by enabling efficient, extensive searches and adaptability to new constraints.
Lattices are architected metamaterials whose properties strongly depend on their geometrical design. The analogy between lattices and graphs enables the use of graph neural networks (GNNs) as a faster surrogate model compared to traditional methods such as finite element modelling. In this work, we generate a big dataset of structure-property relationships for strut-based lattices. The dataset is made available to the community which can fuel the development of methods anchored in physical principles for the fitting of fourth-order tensors. In addition, we present a higher-order GNN model trained on this dataset. The key features of the model are (i) SE(3) equivariance, and (ii) consistency with the thermodynamic law of conservation of energy. We compare the model to non-equivariant models based on a number of error metrics and demonstrate its benefits in terms of predictive performance and reduced training requirements. Finally, we demonstrate an example application of the model to an architected material design task. The methods which we developed are applicable to fourth-order tensors beyond elasticity such as piezo-optical tensor etc.
This work addresses the problem of high-dimensional classification by exploring the generalized Bayesian logistic regression method under a sparsity-inducing prior distribution. The method involves utilizing a fractional power of the likelihood resulting the fractional posterior. Our study yields concentration results for the fractional posterior, not only on the joint distribution of the predictor and response variable but also for the regression coefficients. Significantly, we derive novel findings concerning misclassification excess risk bounds using sparse generalized Bayesian logistic regression. These results parallel recent findings for penalized methods in the frequentist literature. Furthermore, we extend our results to the scenario of model misspecification, which is of critical importance.
To date, most methods for simulating conditioned diffusions are limited to the Euclidean setting. The conditioned process can be constructed using a change of measure known as Doob's $h$-transform. The specific type of conditioning depends on a function $h$ which is typically unknown in closed form. To resolve this, we extend the notion of guided processes to a manifold $M$, where one replaces $h$ by a function based on the heat kernel on $M$. We consider the case of a Brownian motion with drift, constructed using the frame bundle of $M$, conditioned to hit a point $x_T$ at time $T$. We prove equivalence of the laws of the conditioned process and the guided process with a tractable Radon-Nikodym derivative. Subsequently, we show how one can obtain guided processes on any manifold $N$ that is diffeomorphic to $M$ without assuming knowledge of the heat kernel on $N$. We illustrate our results with numerical simulations and an example of parameter estimation where a diffusion process on the torus is observed discretely in time.
Most currently used tensor regression models for high-dimensional data are based on Tucker decomposition, which has good properties but loses its efficiency in compressing tensors very quickly as the order of tensors increases, say greater than four or five. However, for the simplest tensor autoregression in handling time series data, its coefficient tensor already has the order of six. This paper revises a newly proposed tensor train (TT) decomposition and then applies it to tensor regression such that a nice statistical interpretation can be obtained. The new tensor regression can well match the data with hierarchical structures, and it even can lead to a better interpretation for the data with factorial structures, which are supposed to be better fitted by models with Tucker decomposition. More importantly, the new tensor regression can be easily applied to the case with higher order tensors since TT decomposition can compress the coefficient tensors much more efficiently. The methodology is also extended to tensor autoregression for time series data, and nonasymptotic properties are derived for the ordinary least squares estimations of both tensor regression and autoregression. A new algorithm is introduced to search for estimators, and its theoretical justification is also discussed. Theoretical and computational properties of the proposed methodology are verified by simulation studies, and the advantages over existing methods are illustrated by two real examples.
The direct parametrisation method for invariant manifold is a model-order reduction technique that can be applied to nonlinear systems described by PDEs and discretised e.g. with a finite element procedure in order to derive efficient reduced-order models (ROMs). In nonlinear vibrations, it has already been applied to autonomous and non-autonomous problems to propose ROMs that can compute backbone and frequency-response curves of structures with geometric nonlinearity. While previous developments used a first-order expansion to cope with the non-autonomous term, this assumption is here relaxed by proposing a different treatment. The key idea is to enlarge the dimension of the parametrising coordinates with additional entries related to the forcing. A new algorithm is derived with this starting assumption and, as a key consequence, the resonance relationships appearing through the homological equations involve multiple occurrences of the forcing frequency, showing that with this new development, ROMs for systems exhibiting a superharmonic resonance, can be derived. The method is implemented and validated on academic test cases involving beams and arches. It is numerically demonstrated that the method generates efficient ROMs for problems involving 3:1 and 2:1 superharmonic resonances, as well as converged results for systems where the first-order truncation on the non-autonomous term showed a clear limitation.
Mendelian randomization uses genetic variants as instrumental variables to make causal inferences about the effects of modifiable risk factors on diseases from observational data. One of the major challenges in Mendelian randomization is that many genetic variants are only modestly or even weakly associated with the risk factor of interest, a setting known as many weak instruments. Many existing methods, such as the popular inverse-variance weighted (IVW) method, could be biased when the instrument strength is weak. To address this issue, the debiased IVW (dIVW) estimator, which is shown to be robust to many weak instruments, was recently proposed. However, this estimator still has non-ignorable bias when the effective sample size is small. In this paper, we propose a modified debiased IVW (mdIVW) estimator by multiplying a modification factor to the original dIVW estimator. After this simple correction, we show that the bias of the mdIVW estimator converges to zero at a faster rate than that of the dIVW estimator under some regularity conditions. Moreover, the mdIVW estimator has smaller variance than the dIVW estimator.We further extend the proposed method to account for the presence of instrumental variable selection and balanced horizontal pleiotropy. We demonstrate the improvement of the mdIVW estimator over the dIVW estimator through extensive simulation studies and real data analysis.
To succeed in their objectives, groups of individuals must be able to make quick and accurate collective decisions on the best option among a set of alternatives with different qualities. Group-living animals aim to do that all the time. Plants and fungi are thought to do so too. Swarms of autonomous robots can also be programmed to make best-of-n decisions for solving tasks collaboratively. Ultimately, humans critically need it and so many times they should be better at it. Thanks to their mathematical tractability, simple models like the voter model and the local majority rule model have proven useful to describe the dynamics of such collective decision-making processes. To reach a consensus, individuals change their opinion by interacting with neighbors in their social network. At least among animals and robots, options with a better quality are exchanged more often and therefore spread faster than lower-quality options, leading to the collective selection of the best option. With our work, we study the impact of individuals making errors in pooling others' opinions caused, for example, by the need to reduce the cognitive load. Our analysis is grounded on the introduction of a model that generalizes the two existing models (local majority rule and voter model), showing a speed-accuracy trade-off regulated by the cognitive effort of individuals. We also investigate the impact of the interaction network topology on the collective dynamics. To do so, we extend our model and, by using the heterogeneous mean-field approach, we show the presence of another speed-accuracy trade-off regulated by network connectivity. An interesting result is that reduced network connectivity corresponds to an increase in collective decision accuracy.