This study aims to show the fundamental difference between logistic regression and Bayesian classifiers in the case of exponential and unexponential families of distributions, yielding the following findings. First, the logistic regression is a less general representation of a Bayesian classifier. Second, one should suppose distributions of classes for the correct specification of logistic regression equations. Third, in specific cases, there is no difference between predicted probabilities from correctly specified generative Bayesian classifier and discriminative logistic regression.
Discrete and especially binary random variables occur in many machine learning models, notably in variational autoencoders with binary latent states and in stochastic binary networks. When learning such models, a key tool is an estimator of the gradient of the expected loss with respect to the probabilities of binary variables. The straight-through (ST) estimator gained popularity due to its simplicity and efficiency, in particular in deep networks where unbiased estimators are impractical. Several techniques were proposed to improve over ST while keeping the same low computational complexity: Gumbel-Softmax, ST-Gumbel-Softmax, BayesBiNN, FouST. We conduct a theoretical analysis of Bias and Variance of these methods in order to understand tradeoffs and verify the originally claimed properties. The presented theoretical results are mainly negative, showing limitations of these methods and in some cases revealing serious issues.
Logistic regression is one of the most fundamental methods for modeling the probability of a binary outcome based on a collection of covariates. However, the classical formulation of logistic regression relies on the independent sampling assumption, which is often violated when the outcomes interact through an underlying network structure. This necessitates the development of models that can simultaneously handle both the network peer-effect (arising from neighborhood interactions) and the effect of high-dimensional covariates. In this paper, we develop a framework for incorporating such dependencies in a high-dimensional logistic regression model by introducing a quadratic interaction term, as in the Ising model, designed to capture pairwise interactions from the underlying network. The resulting model can also be viewed as an Ising model, where the node-dependent external fields linearly encode the high-dimensional covariates. We propose a penalized maximum pseudo-likelihood method for estimating the network peer-effect and the effect of the covariates, which, in addition to handling the high-dimensionality of the parameters, conveniently avoids the computational intractability of the maximum likelihood approach. Consequently, our method is computationally efficient and, under various standard regularity conditions, our estimate attains the classical high-dimensional rate of consistency. In particular, our results imply that even under network dependence it is possible to consistently estimate the model parameters at the same rate as in classical logistic regression, when the true parameter is sparse and the underlying network is not too dense. As a consequence of the general results, we derive the rates of consistency of our estimator for various natural graph ensembles, such as bounded degree graphs, sparse Erd\H{o}s-R\'{e}nyi random graphs, and stochastic block models.
For many inference problems in statistics and econometrics, the unknown parameter is identified by a set of moment conditions. A generic method of solving moment conditions is the Generalized Method of Moments (GMM). However, classical GMM estimation is potentially very sensitive to outliers. Robustified GMM estimators have been developed in the past, but suffer from several drawbacks: computational intractability, poor dimension-dependence, and no quantitative recovery guarantees in the presence of a constant fraction of outliers. In this work, we develop the first computationally efficient GMM estimator (under intuitive assumptions) that can tolerate a constant $\epsilon$ fraction of adversarially corrupted samples, and that has an $\ell_2$ recovery guarantee of $O(\sqrt{\epsilon})$. To achieve this, we draw upon and extend a recent line of work on algorithmic robust statistics for related but simpler problems such as mean estimation, linear regression and stochastic optimization. As two examples of the generality of our algorithm, we show how our estimation algorithm and assumptions apply to instrumental variables linear and logistic regression. Moreover, we experimentally validate that our estimator outperforms classical IV regression and two-stage Huber regression on synthetic and semi-synthetic datasets with corruption.
Multiclass logistic regression is a fundamental task in machine learning with applications in classification and boosting. Previous work (Foster et al., 2018) has highlighted the importance of improper predictors for achieving "fast rates" in the online multiclass logistic regression problem without suffering exponentially from secondary problem parameters, such as the norm of the predictors in the comparison class. While Foster et al. (2018) introduced a statistically optimal algorithm, it is in practice computationally intractable due to its run-time complexity being a large polynomial in the time horizon and dimension of input feature vectors. In this paper, we develop a new algorithm, FOLKLORE, for the problem which runs significantly faster than the algorithm of Foster et al.(2018) -- the running time per iteration scales quadratically in the dimension -- at the cost of a linear dependence on the norm of the predictors in the regret bound. This yields the first practical algorithm for online multiclass logistic regression, resolving an open problem of Foster et al.(2018). Furthermore, we show that our algorithm can be applied to online bandit multiclass prediction and online multiclass boosting, yielding more practical algorithms for both problems compared to the ones in Foster et al.(2018) with similar performance guarantees. Finally, we also provide an online-to-batch conversion result for our algorithm.
One-dimensional fragment of first-order logic is obtained by restricting quantification to blocks of existential (universal) quantifiers that leave at most one variable free. We investigate this fragment over words and trees, presenting a complete classification of the complexity of its satisfiability problem for various navigational signatures, and comparing its expressive power with other important formalisms. These include the two-variable fragment with counting and the unary negation fragment.
Simultaneous analysis of gene expression data and genetic variants is highly of interest, especially when the number of gene expressions and genetic variants are both greater than the sample size. Association of both causal genes and effective SNPs makes the use of sparse modeling of such genetic data sets, highly important. The high-dimensional sparse instrumental variables models are one of such useful association models, which models the simultaneous relation of the gene expressions and genetic variants with complex traits. From a Bayesian viewpoint, the sparsity can be favored using sparsity-enforcing priors such as spike-and-slab priors. A two-stage modification of the expectation propagation (EP) algorithm is proposed and examined for approximate inference in high-dimensional sparse instrumental variables models with spike-and-slab priors. This method is an adoption of the classical two-stage least squares method, to be used with the Bayes context. A simulation study is performed to examine the performance of the methods. The proposed method is applied to analysis of the mouse obesity data.
Contrastive learning has achieved state-of-the-art performance in various self-supervised learning tasks and even outperforms its supervised counterpart. Despite its empirical success, theoretical understanding of why contrastive learning works is still limited. In this paper, (i) we provably show that contrastive learning outperforms autoencoder, a classical unsupervised learning method, for both feature recovery and downstream tasks; (ii) we also illustrate the role of labeled data in supervised contrastive learning. This provides theoretical support for recent findings that contrastive learning with labels improves the performance of learned representations in the in-domain downstream task, but it can harm the performance in transfer learning. We verify our theory with numerical experiments.
We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models (Uria et al., 2014) and absorbing discrete diffusion (Austin et al., 2021), which we show are special cases of ARDMs under mild assumptions. ARDMs are simple to implement and easy to train. Unlike standard ARMs, they do not require causal masking of model representations, and can be trained using an efficient objective similar to modern probabilistic diffusion models that scales favourably to highly-dimensional data. At test time, ARDMs support parallel generation which can be adapted to fit any given generation budget. We find that ARDMs require significantly fewer steps than discrete diffusion models to attain the same performance. Finally, we apply ARDMs to lossless compression, and show that they are uniquely suited to this task. Contrary to existing approaches based on bits-back coding, ARDMs obtain compelling results not only on complete datasets, but also on compressing single data points. Moreover, this can be done using a modest number of network calls for (de)compression due to the model's adaptable parallel generation.
This paper considers inference in a linear regression model with random right censoring and outliers. The number of outliers can grow with the sample size while their proportion goes to zero. The model is semiparametric and we make only very mild assumptions on the distribution of the error term, contrary to most other existing approaches in the literature. We propose to penalize the estimator proposed by Stute for censored linear regression by the l1-norm. We derive rates of convergence and establish asymptotic normality of the estimator of the regression coefficients. Our estimator has the same asymptotic variance as Stute's estimator in the censored linear model without outliers. Hence, there is no loss of efficiency as a result of robustness. Tests and confidence sets can therefore rely on the theory developed by Stute. The outlined procedure is also computationally advantageous, since it amounts to solving a convex optimization program. We also propose a second estimator which uses the proposed penalized Stute estimator as a first step to detect outliers. It has similar theoretical properties but better performance in finite samples as assessed by simulations.
Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalising to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by product. The results and their inferential implications are showcased on synthetic and real data.