As interest in graph data has grown in recent years, the computation of various geometric tools has become essential. In some area such as mesh processing, they often rely on the computation of geodesics and shortest paths in discretized manifolds. A recent example of such a tool is the computation of Wasserstein barycenters (WB), a very general notion of barycenters derived from the theory of Optimal Transport, and their entropic-regularized variant. In this paper, we examine how WBs on discretized meshes relate to the geometry of the underlying manifold. We first provide a generic stability result with respect to the input cost matrices. We then apply this result to random geometric graphs on manifolds, whose shortest paths converge to geodesics, hence proving the consistency of WBs computed on discretized shapes.
The aim of this article is to analyze numerical schemes using two-layer neural networks with infinite width for the resolution of the high-dimensional Poisson-Neumann partial differential equations (PDEs) with Neumann boundary conditions. Using Barron's representation of the solution with a measure of probability, the energy is minimized thanks to a gradient curve dynamic on the $2$ Wasserstein space of parameters defining the neural network. Inspired by the work from Bach and Chizat, we prove that if the gradient curve converges, then the represented function is the solution of the elliptic equation considered. Numerical experiments are given to show the potential of the method.
Dynamical mean-field theory is a powerful physics tool used to analyze the typical behavior of neural networks, where neurons can be recurrently connected, or multiple layers of neurons can be stacked. However, it is not easy for beginners to access the essence of this tool and the underlying physics. Here, we give a pedagogical introduction of this method in a particular example of generic random neural networks, where neurons are randomly and fully connected by correlated synapses and therefore the network exhibits rich emergent collective dynamics. We also review related past and recent important works applying this tool. In addition, a physically transparent and alternative method, namely the dynamical cavity method, is also introduced to derive exactly the same results. The numerical implementation of solving the integro-differential mean-field equations is also detailed, with an illustration of exploring the fluctuation dissipation theorem.
Malware is a significant threat to the security of computer systems and networks which requires sophisticated techniques to analyze the behavior and functionality for detection. Traditional signature-based malware detection methods have become ineffective in detecting new and unknown malware due to their rapid evolution. One of the most promising techniques that can overcome the limitations of signature-based detection is to use control flow graphs (CFGs). CFGs leverage the structural information of a program to represent the possible paths of execution as a graph, where nodes represent instructions and edges represent control flow dependencies. Machine learning (ML) algorithms are being used to extract these features from CFGs and classify them as malicious or benign. In this survey, we aim to review some state-of-the-art methods for malware detection through CFGs using ML, focusing on the different ways of extracting, representing, and classifying. Specifically, we present a comprehensive overview of different types of CFG features that have been used as well as different ML algorithms that have been applied to CFG-based malware detection. We provide an in-depth analysis of the challenges and limitations of these approaches, as well as suggest potential solutions to address some open problems and promising future directions for research in this field.
Non-linear polynomial systems over finite fields are used to model functional behavior of cryptosystems, with applications in system security, computer cryptography, and post-quantum cryptography. Solving polynomial systems is also one of the most difficult problems in mathematics. In this paper, we propose an automated reasoning procedure for deciding the satisfiability of a system of non-linear equations over finite fields. We introduce zero decomposition techniques to prove that polynomial constraints over finite fields yield finite basis explanation functions. We use these explanation functions in model constructing satisfiability solving, allowing us to equip a CDCL-style search procedure with tailored theory reasoning in SMT solving over finite fields. We implemented our approach and provide a novel and effective reasoning prototype for non-linear arithmetic over finite fields.
Kernel two-sample tests have been widely used for multivariate data in testing equal distribution. However, existing tests based on mapping distributions into a reproducing kernel Hilbert space are mainly targeted at specific alternatives and do not work well for some scenarios when the dimension of the data is moderate to high due to the curse of dimensionality. We propose a new test statistic that makes use of a common pattern under moderate and high dimensions and achieves substantial power improvements over existing kernel two-sample tests for a wide range of alternatives. We also propose alternative testing procedures that maintain high power with low computational cost, offering easy off-the-shelf tools for large datasets. The new approaches are compared to other state-of-the-art tests under various settings and show good performance. The new approaches are illustrated on two applications: The comparison of musks and non-musks using the shape of molecules, and the comparison of taxi trips started from John F.Kennedy airport in consecutive months. All proposed methods are implemented in an R package kerTests.
Granular material is showing very often in geotechnical engineering, petroleum engineering, material science and physics. The packings of the granular material play a very important role in their mechanical behaviors, such as stress-strain response, stability, permeability and so on. Although packing is such an important research topic that its generation has been attracted lots of attentions for a long time in theoretical, experimental, and numerical aspects, packing of granular material is still a difficult and active research topic, especially the generation of random packing of non-spherical particles. To this end, we will generate packings of same particles with same shapes, numbers, and same size distribution using geometry method and dynamic method, separately. Specifically, we will extend one of Monte Carlo models for spheres to ellipsoids and poly-ellipsoids.
Quantile regression (QR) can be used to describe the comprehensive relationship between a response and predictors. Prior domain knowledge and assumptions in application are usually formulated as constraints of parameters to improve the estimation efficiency. This paper develops methods based on multi-block ADMM to fit general penalized QR with linear constraints of regression coefficients. Different formulations to handle the linear constraints and general penalty are explored and compared. The most efficient one has explicit expressions for each parameter and avoids nested-loop iterations in some existing algorithms. Additionally, parallel ADMM algorithm for big data is also developed when data are stored in a distributed fashion. The stopping criterion and convergence of the algorithm are established. Extensive numerical experiments and a real data example demonstrate the computational efficiency of the proposed algorithms. The details of theoretical proofs and different algorithm variations are presented in Appendix.
The paper considers the distribution of a general linear combination of central and non-central chi-square random variables by exploring the branch cut regions that appear in the standard Laplace inversion process. Due to the original interest from the directional statistics, the focus of this paper is on the density function of such distributions and not on their cumulative distribution function. In fact, our results confirm that the latter is a special case of the former. Our approach provides new insight by generating alternative characterizations of the probability density function in terms of a finite number of feasible univariate integrals. In particular, the central cases seem to allow an interesting representation in terms of the branch cuts, while general degrees of freedom and non-centrality can be easily adopted using recursive differentiation. Numerical results confirm that the proposed approach works well while more transparency and therefore easier control in the accuracy is ensured.
Many modern data analytics applications on graphs operate on domains where graph topology is not known a priori, and hence its determination becomes part of the problem definition, rather than serving as prior knowledge which aids the problem solution. Part III of this monograph starts by addressing ways to learn graph topology, from the case where the physics of the problem already suggest a possible topology, through to most general cases where the graph topology is learned from the data. A particular emphasis is on graph topology definition based on the correlation and precision matrices of the observed data, combined with additional prior knowledge and structural conditions, such as the smoothness or sparsity of graph connections. For learning sparse graphs (with small number of edges), the least absolute shrinkage and selection operator, known as LASSO is employed, along with its graph specific variant, graphical LASSO. For completeness, both variants of LASSO are derived in an intuitive way, and explained. An in-depth elaboration of the graph topology learning paradigm is provided through several examples on physically well defined graphs, such as electric circuits, linear heat transfer, social and computer networks, and spring-mass systems. As many graph neural networks (GNN) and convolutional graph networks (GCN) are emerging, we have also reviewed the main trends in GNNs and GCNs, from the perspective of graph signal filtering. Tensor representation of lattice-structured graphs is next considered, and it is shown that tensors (multidimensional data arrays) are a special class of graph signals, whereby the graph vertices reside on a high-dimensional regular lattice structure. This part of monograph concludes with two emerging applications in financial data processing and underground transportation networks modeling.
The area of Data Analytics on graphs promises a paradigm shift as we approach information processing of classes of data, which are typically acquired on irregular but structured domains (social networks, various ad-hoc sensor networks). Yet, despite its long history, current approaches mostly focus on the optimization of graphs themselves, rather than on directly inferring learning strategies, such as detection, estimation, statistical and probabilistic inference, clustering and separation from signals and data acquired on graphs. To fill this void, we first revisit graph topologies from a Data Analytics point of view, and establish a taxonomy of graph networks through a linear algebraic formalism of graph topology (vertices, connections, directivity). This serves as a basis for spectral analysis of graphs, whereby the eigenvalues and eigenvectors of graph Laplacian and adjacency matrices are shown to convey physical meaning related to both graph topology and higher-order graph properties, such as cuts, walks, paths, and neighborhoods. Next, to illustrate estimation strategies performed on graph signals, spectral analysis of graphs is introduced through eigenanalysis of mathematical descriptors of graphs and in a generic way. Finally, a framework for vertex clustering and graph segmentation is established based on graph spectral representation (eigenanalysis) which illustrates the power of graphs in various data association tasks. The supporting examples demonstrate the promise of Graph Data Analytics in modeling structural and functional/semantic inferences. At the same time, Part I serves as a basis for Part II and Part III which deal with theory, methods and applications of processing Data on Graphs and Graph Topology Learning from data.