Non-overlapping codes are block codes that have arisen in diverse contexts of computer science and biology. Applications typically require finding non-overlapping codes with large cardinalities, but the maximum size of non-overlapping codes has been determined only for cases where the codeword length divides the size of the alphabet, and for codes with codewords of length two or three. For all other alphabet sizes and codeword lengths no computationally feasible way to identify non-overlapping codes that attain the maximum size has been found to date. Herein we characterize maximal non-overlapping codes. We formulate the maximum non-overlapping code problem as an integer optimization problem and determine necessary conditions for optimality of a non-overlapping code. Moreover, we solve several instances of the optimization problem to show that the hitherto known constructions do not generate the optimal codes for many alphabet sizes and codeword lengths. We also evaluate the number of distinct maximum non-overlapping codes.
Optimal transport and Wasserstein distances are flourishing in many scientific fields as a means for comparing and connecting random structures. Here we pioneer the use of an optimal transport distance between L\'{e}vy measures to solve a statistical problem. Dependent Bayesian nonparametric models provide flexible inference on distinct, yet related, groups of observations. Each component of a vector of random measures models a group of exchangeable observations, while their dependence regulates the borrowing of information across groups. We derive the first statistical index of dependence in $[0,1]$ for (completely) random measures that accounts for their whole infinite-dimensional distribution, which is assumed to be equal across different groups. This is accomplished by using the geometric properties of the Wasserstein distance to solve a max-min problem at the level of the underlying L\'{e}vy measures. The Wasserstein index of dependence sheds light on the models' deep structure and has desirable properties: (i) it is $0$ if and only if the random measures are independent; (ii) it is $1$ if and only if the random measures are completely dependent; (iii) it simultaneously quantifies the dependence of $d \ge 2$ random measures, avoiding the need for pairwise comparisons; (iv) it can be evaluated numerically. Moreover, the index allows for informed prior specifications and fair model comparisons for Bayesian nonparametric models.
The Johnson--Lindenstrauss (JL) lemma is a powerful tool for dimensionality reduction in modern algorithm design. The lemma states that any set of high-dimensional points in a Euclidean space can be flattened to lower dimensions while approximately preserving pairwise Euclidean distances. Random matrices satisfying this lemma are called JL transforms (JLTs). Inspired by existing $s$-hashing JLTs with exactly $s$ nonzero elements on each column, the present work introduces an ensemble of sparse matrices encompassing so-called $s$-hashing-like matrices whose expected number of nonzero elements on each column is~$s$. The independence of the sub-Gaussian entries of these matrices and the knowledge of their exact distribution play an important role in their analyses. Using properties of independent sub-Gaussian random variables, these matrices are demonstrated to be JLTs, and their smallest and largest singular values are estimated non-asymptotically using a technique from geometric functional analysis. As the dimensions of the matrix grow to infinity, these singular values are proved to converge almost surely to fixed quantities (by using the universal Bai--Yin law), and in distribution to the Gaussian orthogonal ensemble (GOE) Tracy--Widom law after proper rescalings. Understanding the behaviors of extreme singular values is important in general because they are often used to define a measure of stability of matrix algorithms. For example, JLTs were recently used in derivative-free optimization algorithmic frameworks to select random subspaces in which are constructed random models or poll directions to achieve scalability, whence estimating their smallest singular value in particular helps determine the dimension of these subspaces.
The family of multivariate skew-normal distributions has many interesting properties. It is shown here that these hold for a general class of skew-elliptical distributions. For this class, several stochastic representations are established and then their probabilistic properties, such as characteristic function, moments, quadratic forms as well as transformation properties, are investigated.
Untargeted metabolomic profiling through liquid chromatography-mass spectrometry (LC-MS) measures a vast array of metabolites within biospecimens, advancing drug development, disease diagnosis, and risk prediction. However, the low throughput of LC-MS poses a major challenge for biomarker discovery, annotation, and experimental comparison, necessitating the merging of multiple datasets. Current data pooling methods encounter practical limitations due to their vulnerability to data variations and hyperparameter dependence. Here we introduce GromovMatcher, a flexible and user-friendly algorithm that automatically combines LC-MS datasets using optimal transport. By capitalizing on feature intensity correlation structures, GromovMatcher delivers superior alignment accuracy and robustness compared to existing approaches. This algorithm scales to thousands of features requiring minimal hyperparameter tuning. Applying our method to experimental patient studies of liver and pancreatic cancer, we discover shared metabolic features related to patient alcohol intake, demonstrating how GromovMatcher facilitates the search for biomarkers associated with lifestyle risk factors linked to several cancer types.
The swift progression of machine learning (ML) have not gone unnoticed in the realm of statistical mechanics. ML techniques have attracted attention by the classical density-functional theory (DFT) community, as they enable discovery of free-energy functionals to determine the equilibrium-density profile of a many-particle system. Within DFT, the external potential accounts for the interaction of the many-particle system with an external field, thus, affecting the density distribution. In this context, we introduce a statistical-learning framework to infer the external potential exerted on a many-particle system. We combine a Bayesian inference approach with the classical DFT apparatus to reconstruct the external potential, yielding a probabilistic description of the external potential functional form with inherent uncertainty quantification. Our framework is exemplified with a grand-canonical one-dimensional particle ensemble with excluded volume interactions in a confined geometry. The required training dataset is generated using a Monte Carlo (MC) simulation where the external potential is applied to the grand-canonical ensemble. The resulting particle coordinates from the MC simulation are fed into the learning framework to uncover the external potential. This eventually allows us to compute the equilibrium density profile of the system by using the tools of DFT. Our approach benchmarks the inferred density against the exact one calculated through the DFT formulation with the true external potential. The proposed Bayesian procedure accurately infers the external potential and the density profile. We also highlight the external-potential uncertainty quantification conditioned on the amount of available simulated data. The seemingly simple case study introduced in this work might serve as a prototype for studying a wide variety of applications, including adsorption and capillarity.
This paper presents a numerical method for the simulation of elastic solid materials coupled to fluid inclusions. The application is motivated by the modeling of vascularized tissues and by problems in medical imaging which target the estimation of effective (i.e., macroscale) material properties, taking into account the influence of microscale dynamics, such as fluid flow in the microvasculature. The method is based on the recently proposed Reduced Lagrange Multipliers framework. In particular, the interface between solid and fluid domains is not resolved within the computational mesh for the elastic material but discretized independently, imposing the coupling condition via non-matching Lagrange multipliers. Exploiting the multiscale properties of the problem, the resulting Lagrange multipliers space is reduced to a lower-dimensional characteristic set. We present the details of the stability analysis of the resulting method considering a non-standard boundary condition that enforces a local deformation on the solid-fluid boundary. The method is validated with several numerical examples.
This study focuses on the use of model and data fusion for improving the Spalart-Allmaras (SA) closure model for Reynolds-averaged Navier-Stokes solutions of separated flows. In particular, our goal is to develop of models that not-only assimilate sparse experimental data to improve performance in computational models, but also generalize to unseen cases by recovering classical SA behavior. We achieve our goals using data assimilation, namely the Ensemble Kalman Filtering approach (EnKF), to calibrate the coefficients of the SA model for separated flows. A holistic calibration strategy is implemented via a parameterization of the production, diffusion, and destruction terms. This calibration relies on the assimilation of experimental data collected velocity profiles, skin friction, and pressure coefficients for separated flows. Despite using of observational data from a single flow condition around a backward-facing step (BFS), the recalibrated SA model demonstrates generalization to other separated flows, including cases such as the 2D-bump and modified BFS. Significant improvement is observed in the quantities of interest, i.e., skin friction coefficient ($C_f$) and pressure coefficient ($C_p$) for each flow tested. Finally, it is also demonstrated that the newly proposed model recovers SA proficiency for external, unseparated flows, such as flow around a NACA-0012 airfoil without any danger of extrapolation, and that the individually calibrated terms in the SA model are targeted towards specific flow-physics wherein the calibrated production term improves the re-circulation zone while destruction improves the recovery zone.
Many low-Mach or all-Mach number codes are based on space discretizations which in combination with the first order explicit Euler method as time integration would lead to an unstable scheme. In this paper, we investigate how the choice of a suitable explicit time integration method can stabilize these schemes. We restrict ourselves to some old prototypical examples in order to find directions for further research in this field.
An adjacency-crossing graph is a graph that can be drawn such that every two edges that cross the same edge share a common endpoint. We show that the number of edges in an $n$-vertex adjacency-crossing graph is at most $5n-10$. If we require the edges to be drawn as straight-line segments, then this upper bound becomes $5n-11$. Both of these bounds are tight. The former result also follows from a very recent and independent work of Cheong et al.\cite{cheong2023weakly} who showed that the maximum size of weakly and strongly fan-planar graphs coincide. By combining this result with the bound of Kaufmann and Ueckerdt\cite{KU22} on the size of strongly fan-planar graphs and results of Brandenburg\cite{Br20} by which the maximum size of adjacency-crossing graphs equals the maximum size of fan-crossing graphs which in turn equals the maximum size of weakly fan-planar graphs, one obtains the same bound on the size of adjacency-crossing graphs. However, the proof presented here is different, simpler and direct.
The goal of explainable Artificial Intelligence (XAI) is to generate human-interpretable explanations, but there are no computationally precise theories of how humans interpret AI generated explanations. The lack of theory means that validation of XAI must be done empirically, on a case-by-case basis, which prevents systematic theory-building in XAI. We propose a psychological theory of how humans draw conclusions from saliency maps, the most common form of XAI explanation, which for the first time allows for precise prediction of explainee inference conditioned on explanation. Our theory posits that absent explanation humans expect the AI to make similar decisions to themselves, and that they interpret an explanation by comparison to the explanations they themselves would give. Comparison is formalized via Shepard's universal law of generalization in a similarity space, a classic theory from cognitive science. A pre-registered user study on AI image classifications with saliency map explanations demonstrate that our theory quantitatively matches participants' predictions of the AI.