Transport maps can ease the sampling of distributions with non-trivial geometries by transforming them into distributions that are easier to handle. The potential of this approach has risen with the development of Normalizing Flows (NF) which are maps parameterized with deep neural networks trained to push a reference distribution towards a target. NF-enhanced samplers recently proposed blend (Markov chain) Monte Carlo methods with either (i) proposal draws from the flow or (ii) a flow-based reparametrization. In both cases, the quality of the learned transport conditions performance. The present work clarifies for the first time the relative strengths and weaknesses of these two approaches. Our study concludes that multimodal targets can be reliably handled with flow-based proposals up to moderately high dimensions. In contrast, methods relying on reparametrization struggle with multimodality but are more robust otherwise in high-dimensional settings and under poor training. To further illustrate the influence of target-proposal adequacy, we also derive a new quantitative bound for the mixing time of the Independent Metropolis-Hastings sampler.
We prove that in an approximate factor model for an $n$-dimensional vector of stationary time series the factor loadings estimated via Principal Components are asymptotically equivalent, as $n\to\infty$, to those estimated by Quasi Maximum Likelihood. Both estimators are, in turn, also asymptotically equivalent, as $n\to\infty$, to the unfeasible Ordinary Least Squares estimator we would have if the factors were observed. We also show that the usual sandwich form of the asymptotic covariance matrix of the Quasi Maximum Likelihood estimator is asymptotically equivalent to the simpler asymptotic covariance matrix of the unfeasible Ordinary Least Squares. This provides a simple way to estimate asymptotic confidence intervals for the Quasi Maximum Likelihood estimator without the need of estimating the Hessian and Fisher information matrices whose expressions are very complex. All our results hold in the general case in which the idiosyncratic components are cross-sectionally heteroskedastic as well as serially and cross-sectionally weakly correlated.
Manifold learning is a central task in modern statistics and data science. Many datasets (cells, documents, images, molecules) can be represented as point clouds embedded in a high dimensional ambient space, however the degrees of freedom intrinsic to the data are usually far fewer than the number of ambient dimensions. The task of detecting a latent manifold along which the data are embedded is a prerequisite for a wide family of downstream analyses. Real-world datasets are subject to noisy observations and sampling, so that distilling information about the underlying manifold is a major challenge. We propose a method for manifold learning that utilises a symmetric version of optimal transport with a quadratic regularisation that constructs a sparse and adaptive affinity matrix, that can be interpreted as a generalisation of the bistochastic kernel normalisation. We prove that the resulting kernel is consistent with a Laplace-type operator in the continuous limit, establish robustness to heteroskedastic noise and exhibit these results in simulations. We identify a highly efficient computational scheme for computing this optimal transport for discrete data and demonstrate that it outperforms competing methods in a set of examples.
In this paper, we present a controller that combines motion generation and control in one loop, to endow robots with reactivity and safety. In particular, we propose a control approach that enables to follow the motion plan of a first order Dynamical System (DS) with a variable stiffness profile, in a closed loop configuration where the controller is always aware of the current robot state. This allows the robot to follow a desired path with an interactive behavior dictated by the desired stiffness. We also present two solutions to enable a robot to follow the desired velocity profile, in a manner similar to trajectory tracking controllers, while maintaining the closed-loop configuration. Additionally, we exploit the concept of energy tanks in order to guarantee the passivity during interactions with the environment, as well as the asymptotic stability in free motion, of our closed-loop system. The developed approach is evaluated extensively in simulation, as well as in real robot experiments, in terms of performance and safety both in free motion and during the execution of physical interaction tasks.
For a fixed $T$ and $k \geq 2$, a $k$-dimensional vector stochastic differential equation $dX_t=\mu(X_t, \theta)dt+\nu(X_t)dW_t,$ is studied over a time interval $[0,T]$. Vector of drift parameters $\theta$ is unknown. The dependence in $\theta$ is in general nonlinear. We prove that the difference between approximate maximum likelihood estimator of the drift parameter $\overline{\theta}_n\equiv \overline{\theta}_{n,T}$ obtained from discrete observations $(X_{i\Delta_n}, 0 \leq i \leq n)$ and maximum likelihood estimator $\hat{\theta}\equiv \hat{\theta}_T$ obtained from continuous observations $(X_t, 0\leq t\leq T)$, when $\Delta_n=T/n$ tends to zero, converges stably in law to the mixed normal random vector with covariance matrix that depends on $\hat{\theta}$ and on path $(X_t, 0 \leq t\leq T)$. The uniform ellipticity of diffusion matrix $S(x)=\nu(x)\nu(x)^T$ emerges as the main assumption on the diffusion coefficient function.
While transformer architectures have dominated computer vision in recent years, these models cannot easily be deployed on hardware with limited resources for autonomous driving tasks that require real-time-performance. Their computational complexity and memory requirements limits their use, especially for applications with high-resolution inputs. In our work, we redesign the powerful state-of-the-art Vision Transformer PLG-ViT to a much more compact and efficient architecture that is suitable for such tasks. We identify computationally expensive blocks in the original PLG-ViT architecture and propose several redesigns aimed at reducing the number of parameters and floating-point operations. As a result of our redesign, we are able to reduce PLG-ViT in size by a factor of 5, with a moderate drop in performance. We propose two variants, optimized for the best trade-off between parameter count to runtime as well as parameter count to accuracy. With only 5 million parameters, we achieve 79.5$\%$ top-1 accuracy on the ImageNet-1K classification benchmark. Our networks demonstrate great performance on general vision benchmarks like COCO instance segmentation. In addition, we conduct a series of experiments, demonstrating the potential of our approach in solving various tasks specifically tailored to the challenges of autonomous driving and transportation.
Let $G$ be a graph, which represents a social network, and suppose each node $v$ has a threshold value $\tau(v)$. Consider an initial configuration, where each node is either positive or negative. In each discrete time step, a node $v$ becomes/remains positive if at least $\tau(v)$ of its neighbors are positive and negative otherwise. A node set $\mathcal{S}$ is a Target Set (TS) whenever the following holds: if $\mathcal{S}$ is fully positive initially, all nodes in the graph become positive eventually. We focus on a generalization of TS, called Timed TS (TTS), where it is permitted to assign a positive state to a node at any step of the process, rather than just at the beginning. We provide graph structures for which the minimum TTS is significantly smaller than the minimum TS, indicating that timing is an essential aspect of successful target selection strategies. Furthermore, we prove tight bounds on the minimum size of a TTS in terms of the number of nodes and maximum degree when the thresholds are assigned based on the majority rule. We show that the problem of determining the minimum size of a TTS is NP-hard and provide an Integer Linear Programming formulation and a greedy algorithm. We evaluate the performance of our algorithm by conducting experiments on various synthetic and real-world networks. We also present a linear-time exact algorithm for trees.
Deep learning techniques are increasingly applied to scientific problems, where the precision of networks is crucial. Despite being deemed as universal function approximators, neural networks, in practice, struggle to reduce the prediction errors below $O(10^{-5})$ even with large network size and extended training iterations. To address this issue, we developed the multi-stage neural networks that divides the training process into different stages, with each stage using a new network that is optimized to fit the residue from the previous stage. Across successive stages, the residue magnitudes decreases substantially and follows an inverse power-law relationship with the residue frequencies. The multi-stage neural networks effectively mitigate the spectral biases associated with regular neural networks, enabling them to capture the high frequency feature of target functions. We demonstrate that the prediction error from the multi-stage training for both regression problems and physics-informed neural networks can nearly reach the machine-precision $O(10^{-16})$ of double-floating point within a finite number of iterations. Such levels of accuracy are rarely attainable using single neural networks alone.
We introduce a new class of objectives for optimal transport computations of datasets in high-dimensional Euclidean spaces. The new objectives are parametrized by $\rho \geq 1$, and provide a metric space $\mathcal{R}_{\rho}(\cdot, \cdot)$ for discrete probability distributions in $\mathbb{R}^d$. As $\rho$ approaches $1$, the metric approaches the Earth Mover's distance, but for $\rho$ larger than (but close to) $1$, admits significantly faster algorithms. Namely, for distributions $\mu$ and $\nu$ supported on $n$ and $m$ vectors in $\mathbb{R}^d$ of norm at most $r$ and any $\epsilon > 0$, we give an algorithm which outputs an additive $\epsilon r$-approximation to $\mathcal{R}_{\rho}(\mu, \nu)$ in time $(n+m) \cdot \mathrm{poly}((nm)^{(\rho-1)/\rho} \cdot 2^{\rho / (\rho-1)} / \epsilon)$.
Link prediction on knowledge graphs (KGs) is a key research topic. Previous work mainly focused on binary relations, paying less attention to higher-arity relations although they are ubiquitous in real-world KGs. This paper considers link prediction upon n-ary relational facts and proposes a graph-based approach to this task. The key to our approach is to represent the n-ary structure of a fact as a small heterogeneous graph, and model this graph with edge-biased fully-connected attention. The fully-connected attention captures universal inter-vertex interactions, while with edge-aware attentive biases to particularly encode the graph structure and its heterogeneity. In this fashion, our approach fully models global and local dependencies in each n-ary fact, and hence can more effectively capture associations therein. Extensive evaluation verifies the effectiveness and superiority of our approach. It performs substantially and consistently better than current state-of-the-art across a variety of n-ary relational benchmarks. Our code is publicly available.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.