We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we analyze of the effectiveness of our SSL approach in improving prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled data to determine signal-noise combinations where SSL outperforms both supervised learning and the null model. We then use SSL in an adaptive manner based on estimation of the signal and noise. In the special case of linear regression with Gaussian covariates, we prove that the non-adaptive SSL version is in fact not capable of improving on both the supervised estimator and the null model simultaneously, beyond a negligible O(1/n) term. On the other hand, the adaptive model presented in this work, can achieve a substantial improvement over both competitors simultaneously, under a variety of settings. This is shown empirically through extensive simulations, and extended to other scenarios, such as non-Gaussian covariates, misspecified linear regression, or generalized linear regression with non-linear link functions.
Learning from imprecise labels such as "animal" or "bird", but making precise predictions like "snow bunting" at inference time is an important capability for any classifier when expertly labeled training data is scarce. Contributions by volunteers or results of web crawling lack precision in this manner, but are still valuable. And crucially, these weakly labeled examples are available in larger quantities for lower cost than high-quality bespoke training data. CHILLAX, a recently proposed method to tackle this task, leverages a hierarchical classifier to learn from imprecise labels. However, it has two major limitations. First, it does not learn from examples labeled as the root of the hierarchy, e.g., "object". Second, an extrapolation of annotations to precise labels is only performed at test time, where confident extrapolations could be already used as training data. In this work, we extend CHILLAX with a self-supervised scheme using constrained semantic extrapolation to generate pseudo-labels. This addresses the second concern, which in turn solves the first problem, enabling an even weaker supervision requirement than CHILLAX. We evaluate our approach empirically, showing that our method allows for a consistent accuracy improvement of 0.84 to 1.19 percent points over CHILLAX and is suitable as a drop-in replacement without any negative consequences such as longer training times.
Image super-resolution (SR) is a fast-moving field with novel architectures attracting the spotlight. However, most SR models were optimized with dated training strategies. In this work, we revisit the popular RCAN model and examine the effect of different training options in SR. Surprisingly (or perhaps as expected), we show that RCAN can outperform or match nearly all the CNN-based SR architectures published after RCAN on standard benchmarks with a proper training strategy and minimal architecture change. Besides, although RCAN is a very large SR architecture with more than four hundred convolutional layers, we draw a notable conclusion that underfitting is still the main problem restricting the model capability instead of overfitting. We observe supportive evidence that increasing training iterations clearly improves the model performance while applying regularization techniques generally degrades the predictions. We denote our simply revised RCAN as RCAN-it and recommend practitioners to use it as baselines for future research. Code is publicly available at //github.com/zudi-lin/rcan-it.
High levels of missing data and strong class imbalance are ubiquitous challenges that are often presented simultaneously in real-world time series data. Existing methods approach these problems separately, frequently making significant assumptions about the underlying data generation process in order to lessen the impact of missing information. In this work, we instead demonstrate how a general self-supervised training method, namely Autoregressive Predictive Coding (APC), can be leveraged to overcome both missing data and class imbalance simultaneously without strong assumptions. Specifically, on a synthetic dataset, we show that standard baselines are substantially improved upon through the use of APC, yielding the greatest gains in the combined setting of high missingness and severe class imbalance. We further apply APC on two real-world medical time-series datasets, and show that APC improves the classification performance in all settings, ultimately achieving state-of-the-art AUPRC results on the Physionet benchmark.
This study concentrates on clustering problems and aims to find compact clusters that are informative regarding the outcome variable. The main goal is partitioning data points so that observations in each cluster are similar and the outcome variable can be predicated using these clusters simultaneously. We model this semi-supervised clustering problem as a multi-objective optimization problem with considering deviation of data points in clusters and prediction error of the outcome variable as two objective functions to be minimized. For finding optimal clustering solutions, we employ a non-dominated sorting genetic algorithm II approach and local regression is applied as prediction method for the output variable. For comparing the performance of the proposed model, we compute seven models using five real-world data sets. Furthermore, we investigate the impact of using local regression for predicting the outcome variable in all models, and examine the performance of the multi-objective models compared to single-objective models.
Generalized linear mixed models are useful in studying hierarchical data with possibly non-Gaussian responses. However, the intractability of likelihood functions poses challenges for estimation. We develop a new method suitable for this problem, called imputation maximization stochastic approximation (IMSA). For each iteration, IMSA first imputes latent variables/random effects, then maximizes over the complete data likelihood, and finally moves the estimate towards the new maximizer while preserving a proportion of the previous value. The limiting point of IMSA satisfies a self-consistency property and can be less biased in finite samples than the maximum likelihood estimator solved by score-equation based stochastic approximation (ScoreSA). Numerically, IMSA can also be advantageous over ScoreSA in achieving more stable convergence and respecting the parameter ranges under various transformations such as nonnegative variance components. This is corroborated through our simulation studies where IMSA consistently outperforms ScoreSA.
Self-training algorithms, which train a model to fit pseudolabels predicted by another previously-learned model, have been very successful for learning with unlabeled data using neural networks. However, the current theoretical understanding of self-training only applies to linear models. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. At the core of our analysis is a simple but realistic ``expansion'' assumption, which states that a low-probability subset of the data must expand to a neighborhood with large probability relative to the subset. We also assume that neighborhoods of examples in different classes have minimal overlap. We prove that under these assumptions, the minimizers of population objectives based on self-training and input-consistency regularization will achieve high accuracy with respect to ground-truth labels. By using off-the-shelf generalization bounds, we immediately convert this result to sample complexity guarantees for neural nets that are polynomial in the margin and Lipschitzness. Our results help explain the empirical successes of recently proposed self-training algorithms which use input consistency regularization.
Despite much success, deep learning generally does not perform well with small labeled training sets. In these scenarios, data augmentation has shown much promise in alleviating the need for more labeled data, but it so far has mostly been applied in supervised settings and achieved limited gains. In this work, we propose to apply data augmentation to unlabeled data in a semi-supervised learning setting. Our method, named Unsupervised Data Augmentation or UDA, encourages the model predictions to be consistent between an unlabeled example and an augmented unlabeled example. Unlike previous methods that use random noise such as Gaussian noise or dropout noise, UDA has a small twist in that it makes use of harder and more realistic noise generated by state-of-the-art data augmentation methods. This small twist leads to substantial improvements on six language tasks and three vision tasks even when the labeled set is extremely small. For example, on the IMDb text classification dataset, with only 20 labeled examples, UDA achieves an error rate of 4.20, outperforming the state-of-the-art model trained on 25,000 labeled examples. On standard semi-supervised learning benchmarks CIFAR-10 and SVHN, UDA outperforms all previous approaches and achieves an error rate of 2.7% on CIFAR-10 with only 4,000 examples and an error rate of 2.85% on SVHN with only 250 examples, nearly matching the performance of models trained on the full sets which are one or two orders of magnitude larger. UDA also works well on large-scale datasets such as ImageNet. When trained with 10% of the labeled set, UDA improves the top-1/top-5 accuracy from 55.1/77.3% to 68.7/88.5%. For the full ImageNet with 1.3M extra unlabeled data, UDA further pushes the performance from 78.3/94.4% to 79.0/94.5%.
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results.
Graph-based semi-supervised learning (SSL) is an important learning problem where the goal is to assign labels to initially unlabeled nodes in a graph. Graph Convolutional Networks (GCNs) have recently been shown to be effective for graph-based SSL problems. GCNs inherently assume existence of pairwise relationships in the graph-structured data. However, in many real-world problems, relationships go beyond pairwise connections and hence are more complex. Hypergraphs provide a natural modeling tool to capture such complex relationships. In this work, we explore the use of GCNs for hypergraph-based SSL. In particular, we propose HyperGCN, an SSL method which uses a layer-wise propagation rule for convolutional neural networks operating directly on hypergraphs. To the best of our knowledge, this is the first principled adaptation of GCNs to hypergraphs. HyperGCN is able to encode both the hypergraph structure and hypernode features in an effective manner. Through detailed experimentation, we demonstrate HyperGCN's effectiveness at hypergraph-based SSL.
Using low dimensional vector space to represent words has been very effective in many NLP tasks. However, it doesn't work well when faced with the problem of rare and unseen words. In this paper, we propose to leverage the knowledge in semantic dictionary in combination with some morphological information to build an enhanced vector space. We get an improvement of 2.3% over the state-of-the-art Heidel Time system in temporal expression recognition, and obtain a large gain in other name entity recognition (NER) tasks. The semantic dictionary Hownet alone also shows promising results in computing lexical similarity.