Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: $(i)$ Estimate latent variables associated to rows and columns; $(ii)$ Partition the table in blocks according to the row/column latents; $(iii)$ Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; $(iv)$ Append a compressed encoding of the latents. We evaluate it on several benchmark datasets, and study optimal compression in a probabilistic model for that tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state encoders do not achieve this rate. On the other hand, the latent estimation strategy outlined above achieves the optimal rate.
Mixture distributions with dynamic weights are an efficient way of modeling loss data characterized by heavy tails. However, maximum likelihood estimation of this family of models is difficult, mostly because of the need to evaluate numerically an intractable normalizing constant. In such a setup, simulation-based estimation methods are an appealing alternative. The approximate maximum likelihood estimation (AMLE) approach is employed. It is a general method that can be applied to mixtures with any component densities, as long as simulation is feasible. The focus is on the dynamic lognormal-generalized Pareto distribution, and the Cram\'er - von Mises distance is used to measure the discrepancy between observed and simulated samples. After deriving the theoretical properties of the estimators, a hybrid procedure is developed, where standard maximum likelihood is first employed to determine the bounds of the uniform priors required as input for AMLE. Simulation experiments and two real-data applications suggest that this approach yields a major improvement with respect to standard maximum likelihood estimation.
A predictive model makes outcome predictions based on some given features, i.e., it estimates the conditional probability of the outcome given a feature vector. In general, a predictive model cannot estimate the causal effect of a feature on the outcome, i.e., how the outcome will change if the feature is changed while keeping the values of other features unchanged. This is because causal effect estimation requires interventional probabilities. However, many real world problems such as personalised decision making, recommendation, and fairness computing, need to know the causal effect of any feature on the outcome for a given instance. This is different from the traditional causal effect estimation problem with a fixed treatment variable. This paper first tackles the challenge of estimating the causal effect of any feature (as the treatment) on the outcome w.r.t. a given instance. The theoretical results naturally link a predictive model to causal effect estimations and imply that a predictive model is causally interpretable when the conditions identified in the paper are satisfied. The paper also reveals the robust property of a causally interpretable model. We use experiments to demonstrate that various types of predictive models, when satisfying the conditions identified in this paper, can estimate the causal effects of features as accurately as state-of-the-art causal effect estimation methods. We also show the potential of such causally interpretable predictive models for robust predictions and personalised decision making.
Reducing the data footprint of visual content via image compression is essential to reduce storage requirements, but also to reduce the bandwidth and latency requirements for transmission. In particular, the use of compressed images allows for faster transfer of data, and faster response times for visual recognition in edge devices that rely on cloud-based services. In this paper, we first analyze the impact of image compression using traditional codecs, as well as recent state-of-the-art neural compression approaches, on three visual recognition tasks: image classification, object detection, and semantic segmentation. We consider a wide range of compression levels, ranging from 0.1 to 2 bits-per-pixel (bpp). We find that for all three tasks, the recognition ability is significantly impacted when using strong compression. For example, for segmentation mIoU is reduced from 44.5 to 30.5 mIoU when compressing to 0.1 bpp using the best compression model we evaluated. Second, we test to what extent this performance drop can be ascribed to a loss of relevant information in the compressed image, or to a lack of generalization of visual recognition models to images with compression artefacts. We find that to a large extent the performance loss is due to the latter: by finetuning the recognition models on compressed training images, most of the performance loss is recovered. For example, bringing segmentation accuracy back up to 42 mIoU, i.e. recovering 82% of the original drop in accuracy.
We present an alternating least squares type numerical optimization scheme to estimate conditionally-independent mixture models in $\mathbb{R}^n$, without parameterizing the distributions. Following the method of moments, we tackle an incomplete tensor decomposition problem to learn the mixing weights and componentwise means. Then we compute the cumulative distribution functions, higher moments and other statistics of the component distributions through linear solves. Crucially for computations in high dimensions, the steep costs associated with high-order tensors are evaded, via the development of efficient tensor-free operations. Numerical experiments demonstrate the competitive performance of the algorithm, and its applicability to many models and applications. Furthermore we provide theoretical analyses, establishing identifiability from low-order moments of the mixture and guaranteeing local linear convergence of the ALS algorithm.
To estimate causal effects, analysts performing observational studies in health settings utilize several strategies to mitigate bias due to confounding by indication. There are two broad classes of approaches for these purposes: use of confounders and instrumental variables (IVs). Because such approaches are largely characterized by untestable assumptions, analysts must operate under an indefinite paradigm that these methods will work imperfectly. In this tutorial, we formalize a set of general principles and heuristics for estimating causal effects in the two approaches when the assumptions are potentially violated. This crucially requires reframing the process of observational studies as hypothesizing potential scenarios where the estimates from one approach are less inconsistent than the other. While most of our discussion of methodology centers around the linear setting, we touch upon complexities in non-linear settings and flexible procedures such as target minimum loss-based estimation (TMLE) and double machine learning (DML). To demonstrate the application of our principles, we investigate the use of donepezil off-label for mild cognitive impairment (MCI). We compare and contrast results from confounder and IV methods, traditional and flexible, within our analysis and to a similar observational study and clinical trial.
Higher-order tensor datasets arise commonly in recommendation systems, neuroimaging, and social networks. Here we develop probable methods for estimating a possibly high rank signal tensor from noisy observations. We consider a generative latent variable tensor model that incorporates both high rank and low rank models, including but not limited to, simple hypergraphon models, single index models, low-rank CP models, and low-rank Tucker models. Comprehensive results are developed on both the statistical and computational limits for the signal tensor estimation. We find that high-dimensional latent variable tensors are of log-rank; the fact explains the pervasiveness of low-rank tensors in applications. Furthermore, we propose a polynomial-time spectral algorithm that achieves the computationally optimal rate. We show that the statistical-computational gap emerges only for latent variable tensors of order 3 or higher. Numerical experiments and two real data applications are presented to demonstrate the practical merits of our methods.
In this work, we consider a differential description of the evolution of the state of a reaction-diffusion system under environmental fluctuations. We are interested in estimating the state of the system when only partial observations are available. To describe how observations and states are related, we combine multiplicative noise-driven dynamics with an observation model. More specifically, we ensure that the observations are subjected to error in the form of additive noise. We focus on the state estimation of a Belousov-Zhabotinskii chemical reaction. We simulate a reaction conducted in a quasi-two-dimensional physical domain, such as on the surface of a Petri dish. We aim at reconstructing the emerging chemical patterns based on noisy spectral observations. For this task, we consider a finite difference representation on the spatial domain, where nodes are chosen according to observation sites. We approximate the solution to this state estimation problem with the Block particle filter, a sequential Monte Carlo method capable of addressing the associated high-dimensionality of this state-space representation.
Since the 1950s, machine translation (MT) has become one of the important tasks of AI and development, and has experienced several different periods and stages of development, including rule-based methods, statistical methods, and recently proposed neural network-based learning methods. Accompanying these staged leaps is the evaluation research and development of MT, especially the important role of evaluation methods in statistical translation and neural translation research. The evaluation task of MT is not only to evaluate the quality of machine translation, but also to give timely feedback to machine translation researchers on the problems existing in machine translation itself, how to improve and how to optimise. In some practical application fields, such as in the absence of reference translations, the quality estimation of machine translation plays an important role as an indicator to reveal the credibility of automatically translated target languages. This report mainly includes the following contents: a brief history of machine translation evaluation (MTE), the classification of research methods on MTE, and the the cutting-edge progress, including human evaluation, automatic evaluation, and evaluation of evaluation methods (meta-evaluation). Manual evaluation and automatic evaluation include reference-translation based and reference-translation independent participation; automatic evaluation methods include traditional n-gram string matching, models applying syntax and semantics, and deep learning models; evaluation of evaluation methods includes estimating the credibility of human evaluations, the reliability of the automatic evaluation, the reliability of the test set, etc. Advances in cutting-edge evaluation methods include task-based evaluation, using pre-trained language models based on big data, and lightweight optimisation models using distillation techniques.
Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics, for decades. Nowadays, estimating causal effect from observational data has become an appealing research direction owing to the large amount of available data and low budget requirement, compared with randomized controlled trials. Embraced with the rapidly developed machine learning area, various causal effect estimation methods for observational data have sprung up. In this survey, we provide a comprehensive review of causal inference methods under the potential outcome framework, one of the well known causal inference framework. The methods are divided into two categories depending on whether they require all three assumptions of the potential outcome framework or not. For each category, both the traditional statistical methods and the recent machine learning enhanced methods are discussed and compared. The plausible applications of these methods are also presented, including the applications in advertising, recommendation, medicine and so on. Moreover, the commonly used benchmark datasets as well as the open-source codes are also summarized, which facilitate researchers and practitioners to explore, evaluate and apply the causal inference methods.
Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet have recently achieved state-of-the-art performance on a variety of language understanding tasks. However, their size makes them impractical for a number of scenarios, especially on mobile and edge devices. In particular, the input word embedding matrix accounts for a significant proportion of the model's memory footprint, due to the large input vocabulary and embedding dimensions. Knowledge distillation techniques have had success at compressing large neural network models, but they are ineffective at yielding student models with vocabularies different from the original teacher models. We introduce a novel knowledge distillation technique for training a student model with a significantly smaller vocabulary as well as lower embedding and hidden state dimensions. Specifically, we employ a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. We combine this approach with learning shared projection matrices that transfer layer-wise knowledge from the teacher model to the student model. Our method is able to compress the BERT_BASE model by more than 60x, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB. Experimental results also demonstrate higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques.