Model selection is a ubiquitous problem that arises in the application of many statistical and machine learning methods. In the likelihood and related settings, it is typical to use the method of information criteria (IC) to choose the most parsimonious among competing models by penalizing the likelihood-based objective function. Theorems guaranteeing the consistency of IC can often be difficult to verify and are often specific and bespoke. We present a set of results that guarantee consistency for a class of IC, which we call PanIC (from the Greek root 'pan', meaning 'of everything'), with easily verifiable regularity conditions. The PanIC are applicable in any loss-based learning problem and are not exclusive to likelihood problems. We illustrate the verification of regularity conditions for model selection problems regarding finite mixture models, least absolute deviation and support vector regression, and principal component analysis, and we demonstrate the effectiveness of the PanIC for such problems via numerical simulations. Furthermore, we present new sufficient conditions for the consistency of BIC-like estimators and provide comparisons of the BIC to PanIC.
In consensus clustering, a clustering algorithm is used in combination with a subsampling procedure to detect stable clusters. Previous studies on both simulated and real data suggest that consensus clustering outperforms native algorithms. We extend here consensus clustering to allow for attribute weighting in the calculation of pairwise distances using existing regularised approaches. We propose a procedure for the calibration of the number of clusters (and regularisation parameter) by maximising a novel consensus score calculated directly from consensus clustering outputs, making it extremely computationally competitive. Our simulation study shows better clustering performances of (i) models calibrated by maximising our consensus score compared to existing calibration scores, and (ii) weighted compared to unweighted approaches in the presence of features that do not contribute to cluster definition. Application on real gene expression data measured in lung tissue reveals clear clusters corresponding to different lung cancer subtypes. The R package sharp (version 1.4.0) is available on CRAN.
Focusing on stochastic programming (SP) with covariate information, this paper proposes an empirical risk minimization (ERM) method embedded within a nonconvex piecewise affine decision rule (PADR), which aims to learn the direct mapping from features to optimal decisions. We establish the nonasymptotic consistency result of our PADR-based ERM model for unconstrained problems and asymptotic consistency result for constrained ones. To solve the nonconvex and nondifferentiable ERM problem, we develop an enhanced stochastic majorization-minimization algorithm and establish the asymptotic convergence to (composite strong) directional stationarity along with complexity analysis. We show that the proposed PADR-based ERM method applies to a broad class of nonconvex SP problems with theoretical consistency guarantees and computational tractability. Our numerical study demonstrates the superior performance of PADR-based ERM methods compared to state-of-the-art approaches under various settings, with significantly lower costs, less computation time, and robustness to feature dimensions and nonlinearity of the underlying dependency.
Probabilistic programming languages rely fundamentally on some notion of sampling, and this is doubly true for probabilistic programming languages which perform Bayesian inference using Monte Carlo techniques. Verifying samplers - proving that they generate samples from the correct distribution - is crucial to the use of probabilistic programming languages for statistical modelling and inference. However, the typical denotational semantics of probabilistic programs is incompatible with deterministic notions of sampling. This is problematic, considering that most statistical inference is performed using pseudorandom number generators. We present a higher-order probabilistic programming language centred on the notion of samplers and sampler operations. We give this language an operational and denotational semantics in terms of continuous maps between topological spaces. Our language also supports discontinuous operations, such as comparisons between reals, by using the type system to track discontinuities. This feature might be of independent interest, for example in the context of differentiable programming. Using this language, we develop tools for the formal verification of sampler correctness. We present an equational calculus to reason about equivalence of samplers, and a sound calculus to prove semantic correctness of samplers, i.e. that a sampler correctly targets a given measure by construction.
Feature selection is the problem of selecting a subset of features for a machine learning model that maximizes model quality subject to a budget constraint. For neural networks, prior methods, including those based on $\ell_1$ regularization, attention, and other techniques, typically select the entire feature subset in one evaluation round, ignoring the residual value of features during selection, i.e., the marginal contribution of a feature given that other features have already been selected. We propose a feature selection algorithm called Sequential Attention that achieves state-of-the-art empirical results for neural networks. This algorithm is based on an efficient one-pass implementation of greedy forward selection and uses attention weights at each step as a proxy for feature importance. We give theoretical insights into our algorithm for linear regression by showing that an adaptation to this setting is equivalent to the classical Orthogonal Matching Pursuit (OMP) algorithm, and thus inherits all of its provable guarantees. Our theoretical and empirical analyses offer new explanations towards the effectiveness of attention and its connections to overparameterization, which may be of independent interest.
An important issue in many multivariate regression problems is to eliminate candidate predictors with null predictor vectors. In large-dimensional (LD) setting where the numbers of responses and predictors are large, model selection encounters the scalability challenge. Knock-one-out (KOO) statistics hold promise to meet this challenge. In this paper, the almost sure limits and the central limit theorem of the KOO statistics are derived under the LD setting and mild distributional assumptions (finite fourth moments) of the errors. These theoretical results guarantee the strong consistency of a subset selection rule based on the KOO statistics with a general threshold. For enhancing the robustness of the selection rule, we also propose a bootstrap threshold for the KOO approach. Simulation results support our conclusions and demonstrate the selection probabilities by the KOO approach with the bootstrap threshold outperform the methods using Akaike information threshold, Bayesian information threshold and Mallow's C$_p$ threshold. We compare the proposed KOO approach with those based on information threshold to a chemometrics dataset and a yeast cell-cycle dataset, which suggests our proposed method identifies useful models.
This paper concerns the efficient implementation of a method for optimal binary labeling of graph vertices, originally proposed by Malmberg and Ciesielski (2020). This method finds, in quadratic time with respect to graph size, a labeling that globally minimizes an objective function based on the $L_\infty$-norm. The method enables global optimization for a novel class of optimization problems, with high relevance in application areas such as image processing and computer vision. In the original formulation, the Malmberg-Ciesielski algorithm is unfortunately very computationally expensive, limiting its utility in practical applications. Here, we present a modified version of the algorithm that exploits redundancies in the original method to reduce computation time. While our proposed method has the same theoretical asymptotic time complexity, we demonstrate that is substantially more efficient in practice. Even for small problems, we observe a speedup of 4-5 orders of magnitude. This reduction in computation time makes the Malmberg-Ciesielski method a viable option for many practical applications.
This paper considers the robust phase retrieval problem, which can be cast as a nonsmooth and nonconvex optimization problem. We propose a new inexact proximal linear algorithm with the subproblem being solved inexactly. Our contributions are two adaptive stopping criteria for the subproblem. The convergence behavior of the proposed methods is analyzed. Through experiments on both synthetic and real datasets, we demonstrate that our methods are much more efficient than existing methods, such as the original proximal linear algorithm and the subgradient method.
We consider generalized Bayesian inference on stochastic processes and dynamical systems with potentially long-range dependency. Given a sequence of observations, a class of parametrized model processes with a prior distribution, and a loss function, we specify the generalized posterior distribution. The problem of frequentist posterior consistency is concerned with whether as more and more samples are observed, the posterior distribution on parameters will asymptotically concentrate on the "right" parameters. We show that posterior consistency can be derived using a combination of classical large deviation techniques, such as Varadhan's lemma, conditional/quenched large deviations, annealed large deviations, and exponential approximations. We show that the posterior distribution will asymptotically concentrate on parameters that minimize the expected loss and a divergence term, and we identify the divergence term as the Donsker-Varadhan relative entropy rate from process-level large deviations. As an application, we prove new quenched and annealed large deviation asymptotics and new Bayesian posterior consistency results for a class of mixing stochastic processes. In the case of Markov processes, one can obtain explicit conditions for posterior consistency, whenever estimates for log-Sobolev constants are available, which makes our framework essentially a black box. We also recover state-of-the-art posterior consistency on classical dynamical systems with a simple proof. Our approach has the potential of proving posterior consistency for a wide range of Bayesian procedures in a unified way.
The existence and consistency of a maximum likelihood estimator for the joint probability distribution of random parameters in discrete-time abstract parabolic systems are established by taking a nonparametric approach in the context of a mixed effects statistical model using a Prohorov metric framework on a set of feasible measures. A theoretical convergence result for a finite dimensional approximation scheme for computing the maximum likelihood estimator is also established and the efficacy of the approach is demonstrated by applying the scheme to the transdermal transport of alcohol modeled by a random parabolic PDE. Numerical studies included show that the maximum likelihood estimator is statistically consistent in that the convergence of the estimated distribution to the "true" distribution is observed in an example involving simulated data. The algorithm developed is then applied to two datasets collected using two different transdermal alcohol biosensors. Using the leave-one-out cross-validation method, we get an estimate for the distribution of the random parameters based on a training set. The input from a test drinking episode is then used to quantify the uncertainty propagated from the random parameters to the output of the model in the form of a 95% error band surrounding the estimated output signal.
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-\beta^{n})/(1-\beta)$, where $n$ is the number of samples and $\beta \in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.