To accurately reproduce measurements from the real world, simulators need to have an adequate model of the physical system and require the parameters of the model be identified. We address the latter problem of estimating parameters through a Bayesian inference approach that approximates a posterior distribution over simulation parameters given real sensor measurements. By extending the commonly used Gaussian likelihood model for trajectories via the multiple-shooting formulation, our chosen particle-based inference algorithm Stein Variational Gradient Descent is able to identify highly nonlinear, underactuated systems. We leverage GPU code generation and differentiable simulation to evaluate the likelihood and its gradient for many particles in parallel. Our algorithm infers non-parametric distributions over simulation parameters more accurately than comparable baselines and handles constraints over parameters efficiently through gradient-based optimization. We evaluate estimation performance on several physical experiments. On an underactuated mechanism where a 7-DOF robot arm excites an object with an unknown mass configuration, we demonstrate how our inference technique can identify symmetries between the parameters and provide highly accurate predictions. Project website: //uscresl.github.io/prob-diff-sim
This paper investigates the problem of online statistical inference of model parameters in stochastic optimization problems via the Kiefer-Wolfowitz algorithm with random search directions. We first present the asymptotic distribution for the Polyak-Ruppert-averaging type Kiefer-Wolfowitz (AKW) estimators, whose asymptotic covariance matrices depend on the function-value query complexity and the distribution of search directions. The distributional result reflects the trade-off between statistical efficiency and function query complexity. We further analyze the choices of random search directions to minimize the asymptotic covariance matrix, and conclude that the optimal search direction depends on the optimality criteria with respect to different summary statistics of the Fisher information matrix. Based on the asymptotic distribution result, we conduct online statistical inference by providing two construction procedures of valid confidence intervals. We provide numerical experiments verifying our theoretical results with the practical effectiveness of the procedures.
Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant step stochastic iterative algorithms do not converge asymptotically to the optimal solution, but instead have a stationary distribution, which in general cannot be analytically characterized. In this work, we study the asymptotic behavior of the appropriately scaled stationary distribution, in the limit when the constant stepsize goes to zero. Specifically, we consider the following three settings: (1) SGD algorithms with smooth and strongly convex objective, (2) linear SA algorithms involving a Hurwitz matrix, and (3) nonlinear SA algorithms involving a contractive operator. When the iterate is scaled by $1/\sqrt{\alpha}$, where $\alpha$ is the constant stepsize, we show that the limiting scaled stationary distribution is a solution of an integral equation. Under a uniqueness assumption (which can be removed in certain settings) on this equation, we further characterize the limiting distribution as a Gaussian distribution whose covariance matrix is the unique solution of a suitable Lyapunov equation. For SA algorithms beyond these cases, our numerical experiments suggest that unlike central limit theorem type results: (1) the scaling factor need not be $1/\sqrt{\alpha}$, and (2) the limiting distribution need not be Gaussian. Based on the numerical study, we come up with a formula to determine the right scaling factor, and make insightful connection to the Euler-Maruyama discretization scheme for approximating stochastic differential equations.
Improving learning efficiency is paramount for learning resource allocation with deep neural networks (DNNs) in wireless communications over highly dynamic environments. Incorporating domain knowledge into learning is a promising way of dealing with this issue, which is an emerging topic in the wireless community. In this article, we first briefly summarize two classes of approaches to using domain knowledge: introducing mathematical models or prior knowledge to deep learning. Then, we consider a kind of symmetric prior, permutation equivariance, which widely exists in wireless tasks. To explain how such a generic prior is harnessed to improve learning efficiency, we resort to ranking, which jointly sorts the input and output of a DNN. We use power allocation among subcarriers, probabilistic content caching, and interference coordination to illustrate the improvement of learning efficiency by exploiting the property. From the case study, we find that the required training samples to achieve given system performance decreases with the number of subcarriers or contents, owing to an interesting phenomenon: "sample hardening". Simulation results show that the training samples, the free parameters in DNNs and the training time can be reduced dramatically by harnessing the prior knowledge. The samples required to train a DNN after ranking can be reduced by $15 \sim 2,400$ folds to achieve the same system performance as the counterpart without using prior.
In this paper, we investigate the problem of statistical inference of the model parameters in stochastic optimization problems via the Kiefer-Wolfowitz algorithm with random search directions. We first present the asymptotic distribution for the Polyak-Ruppert-averaging type Kiefer-Wolfowitz (AKW) estimators, whose asymptotic covariance matrices depend on the function query complexity and the distribution of search directions. The distributional result reflects the trade-off between statistical efficiency and function query complexity. We further analyze the choices of random search directions to minimize the asymptotic covariance matrix, and conclude that the optimal search direction depends on the optimality criteria with respect to different summary statistics of the Fisher information matrix. Based on the asymptotic distribution result, we conduct one-pass statistical inference by providing two constructions of valid confidence intervals. We provide numerical experiments verifying our theoretical results with the practical effectiveness of the procedures.
Meta-learning optimizes the hyperparameters of a training procedure, such as its initialization, kernel, or learning rate, based on data sampled from a number of auxiliary tasks. A key underlying assumption is that the auxiliary tasks, known as meta-training tasks, share the same generating distribution as the tasks to be encountered at deployment time, known as meta-test tasks. This may, however, not be the case when the test environment differ from the meta-training conditions. To address shifts in task generating distribution between meta-training and meta-testing phases, this paper introduces weighted free energy minimization (WFEM) for transfer meta-learning. We instantiate the proposed approach for non-parametric Bayesian regression and classification via Gaussian Processes (GPs). The method is validated on a toy sinusoidal regression problem, as well as on classification using miniImagenet and CUB data sets, through comparison with standard meta-learning of GP priors as implemented by PACOH.
Artificial reverberation (AR) models play a central role in various audio applications. Therefore, estimating the AR model parameters (ARPs) of a target reverberation is a crucial task. Although a few recent deep-learning-based approaches have shown promising performance, their non-end-to-end training scheme prevents them from fully exploiting the potential of deep neural networks. This motivates to introduce differentiable artificial reverberation (DAR) models which allows loss gradients to be back-propagated end-to-end. However, implementing the AR models with their difference equations "as is" in the deep-learning framework severely bottlenecks the training speed when executed with a parallel processor like GPU due to their infinite impulse response (IIR) components. We tackle this problem by replacing the IIR filters with finite impulse response (FIR) approximations with the frequency-sampling method (FSM). Using the FSM, we implement three DAR models -- differentiable Filtered Velvet Noise (FVN), Advanced Filtered Velvet Noise (AFVN), and Feedback Delay Network (FDN). For each AR model, we train its ARP estimation networks for analysis-synthesis (RIR-to-ARP) and blind estimation (reverberant-speech-to-ARP) task in an end-to-end manner with its DAR model counterpart. Experiment results show that the proposed method achieves consistent performance improvement over the non-end-to-end approaches in both objective metrics and subjective listening test results.
Personalized recommender systems are playing an increasingly important role as more content and services become available and users struggle to identify what might interest them. Although matrix factorization and deep learning based methods have proved effective in user preference modeling, they violate the triangle inequality and fail to capture fine-grained preference information. To tackle this, we develop a distance-based recommendation model with several novel aspects: (i) each user and item are parameterized by Gaussian distributions to capture the learning uncertainties; (ii) an adaptive margin generation scheme is proposed to generate the margins regarding different training triplets; (iii) explicit user-user/item-item similarity modeling is incorporated in the objective function. The Wasserstein distance is employed to determine preferences because it obeys the triangle inequality and can measure the distance between probabilistic distributions. Via a comparison using five real-world datasets with state-of-the-art methods, the proposed model outperforms the best existing models by 4-22% in terms of recall@K on Top-K recommendation.
This work focuses on combining nonparametric topic models with Auto-Encoding Variational Bayes (AEVB). Specifically, we first propose iTM-VAE, where the topics are treated as trainable parameters and the document-specific topic proportions are obtained by a stick-breaking construction. The inference of iTM-VAE is modeled by neural networks such that it can be computed in a simple feed-forward manner. We also describe how to introduce a hyper-prior into iTM-VAE so as to model the uncertainty of the prior parameter. Actually, the hyper-prior technique is quite general and we show that it can be applied to other AEVB based models to alleviate the {\it collapse-to-prior} problem elegantly. Moreover, we also propose HiTM-VAE, where the document-specific topic distributions are generated in a hierarchical manner. HiTM-VAE is even more flexible and can generate topic distributions with better variability. Experimental results on 20News and Reuters RCV1-V2 datasets show that the proposed models outperform the state-of-the-art baselines significantly. The advantages of the hyper-prior technique and the hierarchical model construction are also confirmed by experiments.
Latent Dirichlet Allocation (LDA) is a topic model widely used in natural language processing and machine learning. Most approaches to training the model rely on iterative algorithms, which makes it difficult to run LDA on big corpora that are best analyzed in parallel and distributed computational environments. Indeed, current approaches to parallel inference either don't converge to the correct posterior or require storage of large dense matrices in memory. We present a novel sampler that overcomes both problems, and we show that this sampler is faster, both empirically and theoretically, than previous Gibbs samplers for LDA. We do so by employing a novel P\'olya-urn-based approximation in the sparse partially collapsed sampler for LDA. We prove that the approximation error vanishes with data size, making our algorithm asymptotically exact, a property of importance for large-scale topic models. In addition, we show, via an explicit example, that -- contrary to popular belief in the topic modeling literature -- partially collapsed samplers can be more efficient than fully collapsed samplers. We conclude by comparing the performance of our algorithm with that of other approaches on well-known corpora.
Owing to the recent advances in "Big Data" modeling and prediction tasks, variational Bayesian estimation has gained popularity due to their ability to provide exact solutions to approximate posteriors. One key technique for approximate inference is stochastic variational inference (SVI). SVI poses variational inference as a stochastic optimization problem and solves it iteratively using noisy gradient estimates. It aims to handle massive data for predictive and classification tasks by applying complex Bayesian models that have observed as well as latent variables. This paper aims to decentralize it allowing parallel computation, secure learning and robustness benefits. We use Alternating Direction Method of Multipliers in a top-down setting to develop a distributed SVI algorithm such that independent learners running inference algorithms only require sharing the estimated model parameters instead of their private datasets. Our work extends the distributed SVI-ADMM algorithm that we first propose, to an ADMM-based networked SVI algorithm in which not only are the learners working distributively but they share information according to rules of a graph by which they form a network. This kind of work lies under the umbrella of `deep learning over networks' and we verify our algorithm for a topic-modeling problem for corpus of Wikipedia articles. We illustrate the results on latent Dirichlet allocation (LDA) topic model in large document classification, compare performance with the centralized algorithm, and use numerical experiments to corroborate the analytical results.