Consider a random vector $\mathbf{y}=\mathbf{\Sigma}^{1/2}\mathbf{x}$, where the $p$ elements of the vector $\mathbf{x}$ are i.i.d. real-valued random variables with zero mean and finite fourth moment, and $\mathbf{\Sigma}^{1/2}$ is a deterministic $p\times p$ matrix such that the spectral norm of the population correlation matrix $\mathbf{R}$ of $\mathbf{y}$ is uniformly bounded. In this paper, we find that the log determinant of the sample correlation matrix $\hat{\mathbf{R}}$ based on a sample of size $n$ from the distribution of $\mathbf{y}$ satisfies a CLT (central limit theorem) for $p/n\to \gamma\in (0, 1]$ and $p\leq n$. Explicit formulas for the asymptotic mean and variance are provided. In case the mean of $\mathbf{y}$ is unknown, we show that after recentering by the empirical mean the obtained CLT holds with a shift in the asymptotic mean. This result is of independent interest in both large dimensional random matrix theory and high-dimensional statistical literature of large sample correlation matrices for non-normal data. At last, the obtained findings are applied for testing of uncorrelatedness of $p$ random variables. Surprisingly, in the null case $\mathbf{R}=\mathbf{I}$, the test statistic becomes completely pivotal and the extensive simulations show that the obtained CLT also holds if the moments of order four do not exist at all, which conjectures a promising and robust test statistic for heavy-tailed high-dimensional data.
In this paper, we consider the problem of joint parameter estimation for drift and diffusion coefficients of a stochastic McKean-Vlasov equation and for the associated system of interacting particles. The analysis is provided in a general framework, as both coefficients depend on the solution of the process and on the law of the solution itself. Starting from discrete observations of the interacting particle system over a fixed interval $[0, T]$, we propose a contrast function based on a pseudo likelihood approach. We show that the associated estimator is consistent when the discretization step ($\Delta_n$) and the number of particles ($N$) satisfy $\Delta_n \rightarrow 0$ and $N \rightarrow \infty$, and asymptotically normal when additionally the condition $\Delta_n N \rightarrow 0$ holds.
We prove that there is a randomized polynomial-time algorithm that given an edge-weighted graph $G$ excluding a fixed-minor $Q$ on $n$ vertices and an accuracy parameter $\varepsilon>0$, constructs an edge-weighted graph~$H$ and an embedding $\eta\colon V(G)\to V(H)$ with the following properties: * For any constant size $Q$, the treewidth of $H$ is polynomial in $\varepsilon^{-1}$, $\log n$, and the logarithm of the stretch of the distance metric in $G$. * The expected multiplicative distortion is $(1+\varepsilon)$: for every pair of vertices $u,v$ of $G$, we have $\mathrm{dist}_H(\eta(u),\eta(v))\geq \mathrm{dist}_G(u,v)$ always and $\mathrm{Exp}[\mathrm{dist}_H(\eta(u),\eta(v))]\leq (1+\varepsilon)\mathrm{dist}_G(u,v)$. Our embedding is the first to achieve polylogarithmic treewidth of the host graph and comes close to the lower bound by Carroll and Goel, who showed that any embedding of a planar graph with $\mathcal{O}(1)$ expected distortion requires the host graph to have treewidth $\Omega(\log n)$. It also provides a unified framework for obtaining randomized quasi-polynomial-time approximation schemes for a variety of problems including network design, clustering or routing problems, in minor-free metrics where the optimization goal is the sum of selected distances. Applications include the capacitated vehicle routing problem, and capacitated clustering problems.
Recently, deep learning-based algorithms are widely adopted due to the advantage of being able to establish anomaly detection models without or with minimal domain knowledge of the task. Instead, to train the artificial neural network more stable, it should be better to define the appropriate neural network structure or the loss function. For the training anomaly detection model, the mean squared error (MSE) function is adopted widely. On the other hand, the novel loss function, logarithmic mean squared error (LMSE), is proposed in this paper to train the neural network more stable. This study covers a variety of comparisons from mathematical comparisons, visualization in the differential domain for backpropagation, loss convergence in the training process, and anomaly detection performance. In an overall view, LMSE is superior to the existing MSE function in terms of strongness of loss convergence, anomaly detection performance. The LMSE function is expected to be applicable for training not only the anomaly detection model but also the general generative neural network.
We provide a novel Neural Network architecture that can: i) output R-matrix for a given quantum integrable spin chain, ii) search for an integrable Hamiltonian and the corresponding R-matrix under assumptions of certain symmetries or other restrictions, iii) explore the space of Hamiltonians around already learned models and reconstruct the family of integrable spin chains which they belong to. The neural network training is done by minimizing loss functions encoding Yang-Baxter equation, regularity and other model-specific restrictions such as hermiticity. Holomorphy is implemented via the choice of activation functions. We demonstrate the work of our Neural Network on the two-dimensional spin chains of difference form. In particular, we reconstruct the R-matrices for all 14 classes. We also demonstrate its utility as an \textit{Explorer}, scanning a certain subspace of Hamiltonians and identifying integrable classes after clusterisation. The last strategy can be used in future to carve out the map of integrable spin chains in higher dimensions and in more general settings where no analytical methods are available.
In this paper, we investigate a two-layer fully connected neural network of the form $f(X)=\frac{1}{\sqrt{d_1}}\boldsymbol{a}^\top \sigma\left(WX\right)$, where $X\in\mathbb{R}^{d_0\times n}$ is a deterministic data matrix, $W\in\mathbb{R}^{d_1\times d_0}$ and $\boldsymbol{a}\in\mathbb{R}^{d_1}$ are random Gaussian weights, and $\sigma$ is a nonlinear activation function. We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$: the empirical conjugate kernel (CK) and neural tangent kernel (NTK), beyond the linear-width regime ($d_1\asymp n$). We focus on the $\textit{ultra-wide regime}$, where the width $d_1$ of the first layer is much larger than the sample size $n$. Under appropriate assumptions on $X$ and $\sigma$, a deformed semicircle law emerges as $d_1/n\to\infty$ and $n\to\infty$. We first prove this limiting law for generalized sample covariance matrices with some dependency. To specify it for our neural network model, we provide a nonlinear Hanson-Wright inequality that is suitable for neural networks with random weights and Lipschitz activation functions. We also demonstrate non-asymptotic concentrations of the empirical CK and NTK around their limiting kernels in the spectral norm, along with lower bounds on their smallest eigenvalues. As an application, we show that random feature regression induced by the empirical kernel achieves the same asymptotic performance as its limiting kernel regression under the ultra-wide regime. This allows us to calculate the asymptotic training and test errors for random feature regression using the corresponding kernel regression.
In this work we obtain results related to the approximation of $h$-dimensional dominant subspaces and low rank approximations of matrices $ A\in\mathbb K^{m\times n}$ (where $\mathbb K=\mathbb R$ or $\mathbb C)$ in case there is no singular gap at the index $h$, i.e. if $\sigma_h=\sigma_{h+1}$ (where $\sigma_1\geq \ldots\geq \sigma_p\geq 0$ denote the singular values of $ A$, and $p=\min\{m,n\}$). In order to do this, we develop a novel perspective for the convergence analysis of the classical deterministic block Krylov methods in this context. Indeed, starting with a matrix $ X\in\mathbb K^{n\times r}$ with $r\geq h$ satisfying a compatibility assumption with some $h$-dimensional right dominant subspace, we show that block Krylov methods produce arbitrarily good approximations for both problems mentioned above. Our approach is based on recent work by Drineas, Ipsen, Kontopoulou and Magdon-Ismail on approximation of structural left dominant subspaces. The main difference between our work and previous work on this topic is that instead of exploiting a singular gap at $h$ (which is zero in this case) we exploit the nearest existing singular gaps.
Seeking the equivalent entities among multi-source Knowledge Graphs (KGs) is the pivotal step to KGs integration, also known as \emph{entity alignment} (EA). However, most existing EA methods are inefficient and poor in scalability. A recent summary points out that some of them even require several days to deal with a dataset containing 200,000 nodes (DWY100K). We believe over-complex graph encoder and inefficient negative sampling strategy are the two main reasons. In this paper, we propose a novel KG encoder -- Dual Attention Matching Network (Dual-AMN), which not only models both intra-graph and cross-graph information smartly, but also greatly reduces computational complexity. Furthermore, we propose the Normalized Hard Sample Mining Loss to smoothly select hard negative samples with reduced loss shift. The experimental results on widely used public datasets indicate that our method achieves both high accuracy and high efficiency. On DWY100K, the whole running process of our method could be finished in 1,100 seconds, at least 10* faster than previous work. The performances of our method also outperform previous works across all datasets, where Hits@1 and MRR have been improved from 6% to 13%.
Reasoning with knowledge expressed in natural language and Knowledge Bases (KBs) is a major challenge for Artificial Intelligence, with applications in machine reading, dialogue, and question answering. General neural architectures that jointly learn representations and transformations of text are very data-inefficient, and it is hard to analyse their reasoning process. These issues are addressed by end-to-end differentiable reasoning systems such as Neural Theorem Provers (NTPs), although they can only be used with small-scale symbolic KBs. In this paper we first propose Greedy NTPs (GNTPs), an extension to NTPs addressing their complexity and scalability limitations, thus making them applicable to real-world datasets. This result is achieved by dynamically constructing the computation graph of NTPs and including only the most promising proof paths during inference, thus obtaining orders of magnitude more efficient models. Then, we propose a novel approach for jointly reasoning over KBs and textual mentions, by embedding logic facts and natural language sentences in a shared embedding space. We show that GNTPs perform on par with NTPs at a fraction of their cost while achieving competitive link prediction results on large datasets, providing explanations for predictions, and inducing interpretable models. Source code, datasets, and supplementary material are available online at //github.com/uclnlp/gntp.
Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. The theoretical footings, however, had been largely lacking until recently. In this tutorial-style overview, we highlight the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees. We review two contrasting approaches: (1) two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and (2) global landscape analysis and initialization-free algorithms. Several canonical matrix factorization problems are discussed, including but not limited to matrix sensing, phase retrieval, matrix completion, blind deconvolution, robust principal component analysis, phase synchronization, and joint alignment. Special care is taken to illustrate the key technical insights underlying their analyses. This article serves as a testament that the integrated consideration of optimization and statistics leads to fruitful research findings.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.