A lower bound is an important tool for predicting the performance that an estimator can achieve under a particular statistical model. Bayesian bounds are a kind of such bounds which not only utilizes the observation statistics but also includes the prior model information. In reality, however, the true model generating the data is either unknown or simplified when deriving estimators, which motivates the works to derive estimation bounds under modeling mismatch situations. This paper provides a derivation of a Bayesian Cram\'{e}r-Rao bound under model misspecification, defining important concepts such as pseudotrue parameter that were not clearly identified in previous works. The general result is particularized in linear and Gaussian problems, where closed-forms are available and results are used to validate the results.
Recently, tensor singular value decomposition (t-SVD) has emerged as a promising tool for hyperspectral image (HSI) processing. In the t-SVD, there are two key building blocks: (i) the low-rank enhanced transform and (ii) the accompanying low-rank characterization of transformed frontal slices. Previous t-SVD methods mainly focus on the developments of (i), while neglecting the other important aspect, i.e., the exact characterization of transformed frontal slices. In this letter, we exploit the potentiality in both building blocks by leveraging the \underline{\bf H}ierarchical nonlinear transform and the \underline{\bf H}ierarchical matrix factorization to establish a new \underline{\bf T}ensor \underline{\bf F}actorization (termed as H2TF). Compared to shallow counter partners, e.g., low-rank matrix factorization or its convex surrogates, H2TF can better capture complex structures of transformed frontal slices due to its hierarchical modeling abilities. We then suggest the H2TF-based HSI denoising model and develop an alternating direction method of multipliers-based algorithm to address the resultant model. Extensive experiments validate the superiority of our method over state-of-the-art HSI denoising methods.
The paper analyses properties of a large class of "path-based" Data Envelopment Analysis models through a unifying general scheme. The scheme includes the well-known oriented radial models, the hyperbolic distance function model, the directional distance function models, and even permits their generalisations. The modelling is not constrained to non-negative data and is flexible enough to accommodate variants of standard models over arbitrary data. Mathematical tools developed in the paper allow systematic analysis of the models from the point of view of ten desirable properties. It is shown that some of the properties are satisfied (resp., fail) for all models in the general scheme, while others have a more nuanced behaviour and must be assessed individually in each model. Our results can help researchers and practitioners navigate among the different models and apply the models to mixed data.
Under-approximations of reachable sets and tubes have been receiving growing research attention due to their important roles in control synthesis and verification. Available under-approximation methods applicable to continuous-time linear systems typically assume the ability to compute transition matrices and their integrals exactly, which is not feasible in general, and/or suffer from high computational costs. In this note, we attempt to overcome these drawbacks for a class of linear time-invariant (LTI) systems, where we propose a novel method to under-approximate finite-time forward reachable sets and tubes, utilizing approximations of the matrix exponential and its integral. In particular, we consider the class of continuous-time LTI systems with an identity input matrix and initial and input values belonging to full dimensional sets that are affine transformations of closed unit balls. The proposed method yields computationally efficient under-approximations of reachable sets and tubes, when implemented using zonotopes, with first-order convergence guarantees in the sense of the Hausdorff distance. To illustrate its performance, we implement our approach in three numerical examples, where linear systems of dimensions ranging between 2 and 200 are considered.
Recent advances of data-driven machine learning have revolutionized fields like computer vision, reinforcement learning, and many scientific and engineering domains. In many real-world and scientific problems, systems that generate data are governed by physical laws. Recent work shows that it provides potential benefits for machine learning models by incorporating the physical prior and collected data, which makes the intersection of machine learning and physics become a prevailing paradigm. In this survey, we present this learning paradigm called Physics-Informed Machine Learning (PIML) which is to build a model that leverages empirical data and available physical prior knowledge to improve performance on a set of tasks that involve a physical mechanism. We systematically review the recent development of physics-informed machine learning from three perspectives of machine learning tasks, representation of physical prior, and methods for incorporating physical prior. We also propose several important open research problems based on the current trends in the field. We argue that encoding different forms of physical prior into model architectures, optimizers, inference algorithms, and significant domain-specific applications like inverse engineering design and robotic control is far from fully being explored in the field of physics-informed machine learning. We believe that this study will encourage researchers in the machine learning community to actively participate in the interdisciplinary research of physics-informed machine learning.
We derive information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We show experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning.
The notion of uncertainty is of major importance in machine learning and constitutes a key element of machine learning methodology. In line with the statistical tradition, uncertainty has long been perceived as almost synonymous with standard probability and probabilistic predictions. Yet, due to the steadily increasing relevance of machine learning for practical applications and related issues such as safety requirements, new problems and challenges have recently been identified by machine learning scholars, and these problems may call for new methodological developments. In particular, this includes the importance of distinguishing between (at least) two different types of uncertainty, often refereed to as aleatoric and epistemic. In this paper, we provide an introduction to the topic of uncertainty in machine learning as well as an overview of hitherto attempts at handling uncertainty in general and formalizing this distinction in particular.
Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet have recently achieved state-of-the-art performance on a variety of language understanding tasks. However, their size makes them impractical for a number of scenarios, especially on mobile and edge devices. In particular, the input word embedding matrix accounts for a significant proportion of the model's memory footprint, due to the large input vocabulary and embedding dimensions. Knowledge distillation techniques have had success at compressing large neural network models, but they are ineffective at yielding student models with vocabularies different from the original teacher models. We introduce a novel knowledge distillation technique for training a student model with a significantly smaller vocabulary as well as lower embedding and hidden state dimensions. Specifically, we employ a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. We combine this approach with learning shared projection matrices that transfer layer-wise knowledge from the teacher model to the student model. Our method is able to compress the BERT_BASE model by more than 60x, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB. Experimental results also demonstrate higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques.
Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. The theoretical footings, however, had been largely lacking until recently. In this tutorial-style overview, we highlight the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees. We review two contrasting approaches: (1) two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and (2) global landscape analysis and initialization-free algorithms. Several canonical matrix factorization problems are discussed, including but not limited to matrix sensing, phase retrieval, matrix completion, blind deconvolution, robust principal component analysis, phase synchronization, and joint alignment. Special care is taken to illustrate the key technical insights underlying their analyses. This article serves as a testament that the integrated consideration of optimization and statistics leads to fruitful research findings.
Many tasks in natural language processing can be viewed as multi-label classification problems. However, most of the existing models are trained with the standard cross-entropy loss function and use a fixed prediction policy (e.g., a threshold of 0.5) for all the labels, which completely ignores the complexity and dependencies among different labels. In this paper, we propose a meta-learning method to capture these complex label dependencies. More specifically, our method utilizes a meta-learner to jointly learn the training policies and prediction policies for different labels. The training policies are then used to train the classifier with the cross-entropy loss function, and the prediction policies are further implemented for prediction. Experimental results on fine-grained entity typing and text classification demonstrate that our proposed method can obtain more accurate multi-label classification results.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.