This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be non-robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or follows non-Gaussian distributions. To overcome this issue, a new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. This paper shows that this problem reduces to the estimation of a mixture of Angular Gaussian distributions under generic assumptions (i.e., each sample is drawn from a mixture of elliptical distributions, which is possibly different for one sample to another). In that case, the complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework with missing data thanks to its conditional distribution, which is shown to be a multivariate $t$-distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.
We study streaming algorithms in the white-box adversarial stream model, where the internal state of the streaming algorithm is revealed to an adversary who adaptively generates the stream updates, but the algorithm obtains fresh randomness unknown to the adversary at each time step. We incorporate cryptographic assumptions to construct robust algorithms against such adversaries. We propose efficient algorithms for sparse recovery of vectors, low rank recovery of matrices and tensors, as well as low rank plus sparse recovery of matrices, i.e., robust PCA. Unlike deterministic algorithms, our algorithms can report when the input is not sparse or low rank even in the presence of such an adversary. We use these recovery algorithms to improve upon and solve new problems in numerical linear algebra and combinatorial optimization on white-box adversarial streams. For example, we give the first efficient algorithm for outputting a matching in a graph with insertions and deletions to its edges provided the matching size is small, and otherwise we declare the matching size is large. We also improve the approximation versus memory tradeoff of previous work for estimating the number of non-zero elements in a vector and computing the matrix rank.
The aim of this paper is to describe a novel non-parametric noise reduction technique from the point of view of Bayesian inference that may automatically improve the signal-to-noise ratio of one- and two-dimensional data, such as e.g. astronomical images and spectra. The algorithm iteratively evaluates possible smoothed versions of the data, the smooth models, obtaining an estimation of the underlying signal that is statistically compatible with the noisy measurements. Iterations stop based on the evidence and the $\chi^2$ statistic of the last smooth model, and we compute the expected value of the signal as a weighted average of the whole set of smooth models. In this paper, we explain the mathematical formalism and numerical implementation of the algorithm, and we evaluate its performance in terms of the peak signal to noise ratio, the structural similarity index, and the time payload, using a battery of real astronomical observations. Our Fully Adaptive Bayesian Algorithm for Data Analysis (FABADA) yields results that, without any parameter tuning, are comparable to standard image processing algorithms whose parameters have been optimized based on the true signal to be recovered, something that is impossible in a real application. State-of-the-art non-parametric methods, such as BM3D, offer slightly better performance at high signal-to-noise ratio, while our algorithm is significantly more accurate for extremely noisy data (higher than $20-40\%$ relative errors, a situation of particular interest in the field of astronomy). In this range, the standard deviation of the residuals obtained by our reconstruction may become more than an order of magnitude lower than that of the original measurements. The source code needed to reproduce all the results presented in this report, including the implementation of the method, is publicly available at //github.com/PabloMSanAla/fabada
Machine learning algorithms dedicated to financial time series forecasting have gained a lot of interest. But choosing between several algorithms can be challenging, as their estimation accuracy may be unstable over time. Online aggregation of experts combine the forecasts of a finite set of models in a single approach without making any assumption about the models. In this paper, a Bernstein Online Aggregation (BOA) procedure is applied to the construction of long-short strategies built from individual stock return forecasts coming from different machine learning models. The online mixture of experts leads to attractive portfolio performances even in environments characterised by non-stationarity. The aggregation outperforms individual algorithms, offering a higher portfolio Sharpe Ratio, lower shortfall, with a similar turnover. Extensions to expert and aggregation specialisations are also proposed to improve the overall mixture on a family of portfolio evaluation metrics.
Many food products involve mixtures of ingredients, where the mixtures can be expressed as combinations of ingredient proportions. In many cases, the quality and the consumer preference may also depend on the way in which the mixtures are processed. The processing is generally defined by the settings of one or more process variables. Experimental designs studying the joint impact of the mixture ingredient proportions and the settings of the process variables are called mixture-process variable experiments. In this article, we show how to combine mixture-process variable experiments and discrete choice experiments, to quantify and model consumer preferences for food products that can be viewed as processed mixtures. First, we describe the modeling of data from such combined experiments. Next, we describe how to generate D- and I-optimal designs for choice experiments involving mixtures and process variables, and we compare the two kinds of designs using two examples.
Model overconfidence and poor calibration are common in machine learning and difficult to account for when applying standard empirical risk minimization. In this work, we propose a novel method to alleviate these problems that we call odd-$k$-out learning (OKO), which minimizes the cross-entropy error for sets rather than for single examples. This naturally allows the model to capture correlations across data examples and achieves both better accuracy and calibration, especially in limited training data and class-imbalanced regimes. Perhaps surprisingly, OKO often yields better calibration even when training with hard labels and dropping any additional calibration parameter tuning, such as temperature scaling. We provide theoretical justification, establishing that OKO naturally yields better calibration, and provide extensive experimental analyses that corroborate our theoretical findings. We emphasize that OKO is a general framework that can be easily adapted to many settings and the trained model can be applied to single examples at inference time, without introducing significant run-time overhead or architecture changes.
The ever-growing complexity of reinforcement learning (RL) tasks demands a distributed RL system to efficiently generate and process a massive amount of data to train intelligent agents. However, existing open-source libraries suffer from various limitations, which impede their practical use in challenging scenarios where large-scale training is necessary. While industrial systems from OpenAI and DeepMind have achieved successful large-scale RL training, their system architecture and implementation details remain undisclosed to the community. In this paper, we present a novel abstraction on the dataflows of RL training, which unifies practical RL training across diverse applications into a general framework and enables fine-grained optimizations. Following this abstraction, we develop a scalable, efficient, and extensible distributed RL system called ReaLly Scalable RL (SRL). The system architecture of SRL separates major RL computation components and allows massively parallelized training. Moreover, SRL offers user-friendly and extensible interfaces for customized algorithms. Our evaluation shows that SRL outperforms existing academic libraries in both a single machine and a medium-sized cluster. In a large-scale cluster, the novel architecture of SRL leads to up to 3.7x speedup compared to the design choices adopted by the existing libraries. We also conduct a direct benchmark comparison to OpenAI's industrial system, Rapid, in the challenging hide-and-seek environment. SRL reproduces the same solution as reported by OpenAI with up to 5x speedup in wall-clock time. Furthermore, we also examine the performance of SRL in a much harder variant of the hide-and-seek environment and achieve substantial learning speedup by scaling SRL to over 15k CPU cores and 32 A100 GPUs. Notably, SRL is the first in the academic community to perform RL experiments at such a large scale.
Time-series clustering serves as a powerful data mining technique for time-series data in the absence of prior knowledge about clusters. A large amount of time-series data with large size has been acquired and used in various research fields. Hence, clustering method with low computational cost is required. Given that a quantum-inspired computing technology, such as a simulated annealing machine, surpasses conventional computers in terms of fast and accurately solving combinatorial optimization problems, it holds promise for accomplishing clustering tasks that are challenging to achieve using existing methods. This study proposes a novel time-series clustering method that leverages an annealing machine. The proposed method facilitates an even classification of time-series data into clusters close to each other while maintaining robustness against outliers. Moreover, its applicability extends to time-series images. We compared the proposed method with a standard existing method for clustering an online distributed dataset. In the existing method, the distances between each data are calculated based on the Euclidean distance metric, and the clustering is performed using the k-means++ method. We found that both methods yielded comparable results. Furthermore, the proposed method was applied to a flow measurement image dataset containing noticeable noise with a signal-to-noise ratio of approximately 1. Despite a small signal variation of approximately 2%, the proposed method effectively classified the data without any overlap among the clusters. In contrast, the clustering results by the standard existing method and the conditional image sampling (CIS) method, a specialized technique for flow measurement data, displayed overlapping clusters. Consequently, the proposed method provides better results than the other two methods, demonstrating its potential as a superior clustering method.
Entity resolution (ER) is the process of identifying records that refer to the same entities within one or across multiple databases. Numerous techniques have been developed to tackle ER challenges over the years, with recent emphasis placed on machine and deep learning methods for the matching phase. However, the quality of the benchmark datasets typically used in the experimental evaluations of learning-based matching algorithms has not been examined in the literature. To cover this gap, we propose four different approaches to assessing the difficulty and appropriateness of 13 established datasets: two theoretical approaches, which involve new measures of linearity and existing measures of complexity, and two practical approaches: the difference between the best non-linear and linear matchers, as well as the difference between the best learning-based matcher and the perfect oracle. Our analysis demonstrates that most of the popular datasets pose rather easy classification tasks. As a result, they are not suitable for properly evaluating learning-based matching algorithms. To address this issue, we propose a new methodology for yielding benchmark datasets. We put it into practice by creating four new matching tasks, and we verify that these new benchmarks are more challenging and therefore more suitable for further advancements in the field.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.
Recent advances in maximizing mutual information (MI) between the source and target have demonstrated its effectiveness in text generation. However, previous works paid little attention to modeling the backward network of MI (i.e., dependency from the target to the source), which is crucial to the tightness of the variational information maximization lower bound. In this paper, we propose Adversarial Mutual Information (AMI): a text generation framework which is formed as a novel saddle point (min-max) optimization aiming to identify joint interactions between the source and target. Within this framework, the forward and backward networks are able to iteratively promote or demote each other's generated instances by comparing the real and synthetic data distributions. We also develop a latent noise sampling strategy that leverages random variations at the high-level semantic space to enhance the long term dependency in the generation process. Extensive experiments based on different text generation tasks demonstrate that the proposed AMI framework can significantly outperform several strong baselines, and we also show that AMI has potential to lead to a tighter lower bound of maximum mutual information for the variational information maximization problem.