A wide variety of model explanation approaches have been proposed in recent years, all guided by very different rationales and heuristics. In this paper, we take a new route and cast interpretability as a statistical inference problem. We propose a general deep probabilistic model designed to produce interpretable predictions. The model parameters can be learned via maximum likelihood, and the method can be adapted to any predictor network architecture and any type of prediction problem. Our method is a case of amortized interpretability models, where a neural network is used as a selector to allow for fast interpretation at inference time. Several popular interpretability methods are shown to be particular cases of regularised maximum likelihood for our general model. We propose new datasets with ground truth selection which allow for the evaluation of the features importance map. Using these datasets, we show experimentally that using multiple imputation provides more reasonable interpretations.
The increasing requirements for data protection and privacy has attracted a huge research interest on distributed artificial intelligence and specifically on federated learning, an emerging machine learning approach that allows the construction of a model between several participants who hold their own private data. In the initial proposal of federated learning the architecture was centralised and the aggregation was done with federated averaging, meaning that a central server will orchestrate the federation using the most straightforward averaging strategy. This research is focused on testing different federated strategies in a peer-to-peer environment. The authors propose various aggregation strategies for federated learning, including weighted averaging aggregation, using different factors and strategies based on participant contribution. The strategies are tested with varying data sizes to identify the most robust ones. This research tests the strategies with several biomedical datasets and the results of the experiments show that the accuracy-based weighted average outperforms the classical federated averaging method.
Prediction is a central problem in Statistics, and there is currently a renewed interest for the so-called predictive approach in Bayesian statistics. What is the latter about? One has to return on foundational concepts, which we do in this paper, moving from the role of exchangeability and reviewing forms of partial exchangeability for more structured data, with the aim of discussing their use and implications in Bayesian statistics. There we show the underlying concept that, in Bayesian statistics, a predictive rule is meant as a learning rule - how one conveys past information to information on future events. This concept has implications on the use of exchangeability and generally invests all statistical problems, also in inference. It applies to classic contexts and to less explored situations, such as the use of predictive algorithms that can be read as Bayesian learning rules. The paper offers a historical overview, but also includes a few new results, presents some recent developments and poses some open questions.
What is the best paradigm to recognize objects -- discriminative inference (fast but potentially prone to shortcut learning) or using a generative model (slow but potentially more robust)? We build on recent advances in generative modeling that turn text-to-image models into classifiers. This allows us to study their behavior and to compare them against discriminative models and human psychophysical data. We report four intriguing emergent properties of generative classifiers: they show a record-breaking human-like shape bias (99% for Imagen), near human-level out-of-distribution accuracy, state-of-the-art alignment with human classification errors, and they understand certain perceptual illusions. Our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data surprisingly well.
Selecting the appropriate number of clusters is a critical step in applying clustering algorithms. To assist in this process, various cluster validity indices (CVIs) have been developed. These indices are designed to identify the optimal number of clusters within a dataset. However, users may not always seek the absolute optimal number of clusters but rather a secondary option that better aligns with their specific applications. This realization has led us to introduce a Bayesian cluster validity index (BCVI), which builds upon existing indices. The BCVI utilizes either Dirichlet or generalized Dirichlet priors, resulting in the same posterior distribution. We evaluate our BCVI using the Wiroonsri index for hard clustering and the Wiroonsri-Preedasawakul index for soft clustering as underlying indices. We compare the performance of our proposed BCVI with that of the original underlying indices and several other existing CVIs, including Davies-Bouldin, Starczewski, Xie-Beni, and KWON2 indices. Our BCVI offers clear advantages in situations where user expertise is valuable, allowing users to specify their desired range for the final number of clusters. To illustrate this, we conduct experiments classified into three different scenarios. Additionally, we showcase the practical applicability of our approach through real-world datasets, such as MRI brain tumor images. These tools will be published as a new R package 'BayesCVI'.
Motivated by the important statistical role of sparsity, the paper uncovers four reparametrizations for covariance matrices in which sparsity is associated with conditional independence graphs in a notional Gaussian model. The intimate relationship between the Iwasawa decomposition of the general linear group and the open cone of positive definite matrices allows a unifying perspective. Specifically, the positive definite cone can be reconstructed without loss or redundancy from the exponential map applied to four Lie subalgebras determined by the Iwasawa decomposition of the general linear group. This accords geometric interpretations to the reparametrizations and the corresponding notion of sparsity. Conditions that ensure legitimacy of the reparametrizations for statistical models are identified. While the focus of this work is on understanding population-level structure, there are strong methodological implications. In particular, since the population-level sparsity manifests in a vector space, imposition of sparsity on relevant sample quantities produces a covariance estimate that respects the positive definite cone constraint.
Composite quantile regression has been used to obtain robust estimators of regression coefficients in linear models with good statistical efficiency. By revealing an intrinsic link between the composite quantile regression loss function and the Wasserstein distance from the residuals to the set of quantiles, we establish a generalization of the composite quantile regression to the multiple-output settings. Theoretical convergence rates of the proposed estimator are derived both under the setting where the additive error possesses only a finite $\ell$-th moment (for $\ell > 2$) and where it exhibits a sub-Weibull tail. In doing so, we develop novel techniques for analyzing the M-estimation problem that involves Wasserstein-distance in the loss. Numerical studies confirm the practical effectiveness of our proposed procedure.
We present an information-theoretic lower bound for the problem of parameter estimation with time-uniform coverage guarantees. Via a new a reduction to sequential testing, we obtain stronger lower bounds that capture the hardness of the time-uniform setting. In the case of location model estimation, logistic regression, and exponential family models, our $\Omega(\sqrt{n^{-1}\log \log n})$ lower bound is sharp to within constant factors in typical settings.
This article studies structure-preserving discretizations of Hilbert complexes with nonconforming spaces that rely on projections onto an underlying conforming subcomplex. This approach follows the conforming/nonconforming Galerkin (CONGA) method introduced in [doi.org/10.1090/mcom/3079, doi.org/10.5802/smai-jcm.20, doi.org/10.5802/smai-jcm.21] to derive efficient structure-preserving finite element schemes for the time-dependent Maxwell and Maxwell-Vlasov systems by relaxing the curl-conforming constraint in finite element exterior calculus (FEEC) spaces. Here, it is extended to the discretization of full Hilbert complexes with possibly nontrivial harmonic fields, and the properties of the CONGA Hodge Laplacian operator are investigated. By using block-diagonal mass matrices which may be locally inverted, this framework possesses a canonical sequence of dual commuting projection operators which are local, and it naturally yields local discrete coderivative operators, in contrast to conforming FEEC discretizations. The resulting CONGA Hodge Laplacian operator is also local, and its kernel consists of the same discrete harmonic fields as the underlying conforming operator, provided that a symmetric stabilization term is added to handle the space nonconformities. Under the assumption that the underlying conforming subcomplex admits a bounded cochain projection, and that the conforming projections are stable with moment-preserving properties, a priori convergence results are established for both the CONGA Hodge Laplace source and eigenvalue problems. Our theory is finally illustrated with a spectral element method, and numerical experiments are performed which corroborate our results. Applications to spline finite elements on multi-patch mapped domains are described in a related article [arXiv:2208.05238] for which the present work provides a theoretical background.
We construct an estimator $\widehat{\Sigma}$ for covariance matrices of unknown, centred random vectors X, with the given data consisting of N independent measurements $X_1,...,X_N$ of X and the wanted confidence level. We show under minimal assumptions on X, the estimator performs with the optimal accuracy with respect to the operator norm. In addition, the estimator is also optimal with respect to direction dependence accuracy: $\langle \widehat{\Sigma}u,u\rangle$ is an optimal estimator for $\sigma^2(u)=\mathbb{E}\langle X,u\rangle^2$ when $\sigma^2(u)$ is ``large".
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.