This study proposes an interpretable neural network-based non-proportional odds model (N3POM) for ordinal regression. In the model, the response variable can take continuous values, and the regression coefficients vary depending on the predicting ordinal response. Contrary to conventional approaches, where the linear coefficients of regression are directly estimated from the discrete response, we train a non-linear neural network that outputs the linear coefficients by taking the response as its input. By virtue of the neural network, N3POM may have flexibility while preserving the interpretability of the conventional ordinal regression. We show a sufficient condition under which the predicted conditional cumulative probability (CCP) locally satisfies the monotonicity constraint over a user-specified region in the covariate space. We also provide a monotonicity-preserving stochastic (MPS) algorithm for adequately training the neural network.
Graph Neural Networks (GNNs) are able to achieve high classification accuracy on many important real world datasets, but provide no rigorous notion of predictive uncertainty. Quantifying the confidence of GNN models is difficult due to the dependence between datapoints induced by the graph structure. We leverage recent advances in conformal prediction to construct prediction sets for node classification in inductive learning scenarios. We do this by taking an existing approach for conformal classification that relies on \textit{exchangeable} data and modifying it by appropriately weighting the conformal scores to reflect the network structure. We show through experiments on standard benchmark datasets using popular GNN models that our approach provides tighter and better calibrated prediction sets than a naive application of conformal prediction.
We introduce a generalized additive model for location, scale, and shape (GAMLSS) next of kin aiming at distribution-free and parsimonious regression modelling for arbitrary outcomes. We replace the strict parametric distribution formulating such a model by a transformation function, which in turn is estimated from data. Doing so not only makes the model distribution-free but also allows to limit the number of linear or smooth model terms to a pair of location-scale predictor functions. We derive the likelihood for continuous, discrete, and randomly censored observations, along with corresponding score functions. A plethora of existing algorithms is leveraged for model estimation, including constrained maximum-likelihood, the original GAMLSS algorithm, and transformation trees. Parameter interpretability in the resulting models is closely connected to model selection. We propose the application of a novel best subset selection procedure to achieve especially simple ways of interpretation. All techniques are motivated and illustrated by a collection of applications from different domains, including crossing and partial proportional hazards, complex count regression, non-linear ordinal regression, and growth curves. All analyses are reproducible with the help of the "tram" add-on package to the R system for statistical computing and graphics.
This paper investigates the case of interference, when a unit's treatment also affects other units' outcome. When interference is at work, policy evaluation mostly relies on the use of randomized experiments under cluster interference and binary treatment. Instead, we consider a non-experimental setting under continuous treatment and network interference. In particular, we define spillover effects by specifying the exposure to network treatment as a weighted average of the treatment received by units connected through physical, social or economic interactions. We provide a generalized propensity score-based estimator to estimate both direct and spillover effects of a continuous treatment. Our estimator also allows to consider asymmetric network connections characterized by heterogeneous intensities. To showcase this methodology, we investigate whether and how spillover effects shape the optimal level of policy interventions in agricultural markets. Our results show that, in this context, neglecting interference may underestimate the degree of policy effectiveness.
Data for pretraining machine learning models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target domain -- where the model will ultimately be used -- is known in advance. In that case, one would ideally pretrain only on the dataset(s) most similar to the target one. Instead of limiting this choice to those datasets already present in the pretraining collection, here we explore extending this search to all datasets that can be synthesized as `combinations' of them. We define such combinations as multi-dataset interpolations, formalized through the notion of generalized geodesics from optimal transport (OT) theory. We compute these geodesics using a recent notion of distance between labeled datasets, and derive alternative interpolation schemes based on it: using either barycentric projections or optimal transport maps, the latter computed using recent neural OT methods. These methods are scalable, efficient, and -- notably -- can be used to interpolate even between datasets with distinct and unrelated label sets. Through various experiments in transfer learning in computer vision, we demonstrate this is a promising new approach for targeted on-demand dataset synthesis.
Correlated data are ubiquitous in today's data-driven society. While regression models for analyzing means and variances of responses of interest are relatively well-developed, the development of these models for analyzing the correlations is largely confined to longitudinal data, a special form of sequentially correlated data. This paper proposes a new method for the analysis of correlations to fully exploit the use of covariates for general correlated data. In a renewed analysis of the Classroom data, a highly unbalanced multilevel clustered data with within-class and within-school correlations, our method reveals informative insights on these structures not previously known. In another analysis of the malaria immune response data in Benin, a longitudinal study with time-dependent covariates where the exact times of the observations are not available, our approach again provides promising new results. At the heart of our approach is a new generalized z-transformation that converts correlation matrices constrained to be positive definite to vectors with unrestricted support, and is order-invariant. These two properties enable us to develop regression analysis incorporating covariates for the modelling of correlations via the use of maximum likelihood.
This paper introduces a novel, computationally-efficient algorithm for predictive inference (PI) that requires no distributional assumptions on the data and can be computed faster than existing bootstrap-type methods for neural networks. Specifically, if there are $n$ training samples, bootstrap methods require training a model on each of the $n$ subsamples of size $n-1$; for large models like neural networks, this process can be computationally prohibitive. In contrast, our proposed method trains one neural network on the full dataset with $(\epsilon, \delta)$-differential privacy (DP) and then approximates each leave-one-out model efficiently using a linear approximation around the differentially-private neural network estimate. With exchangeable data, we prove that our approach has a rigorous coverage guarantee that depends on the preset privacy parameters and the stability of the neural network, regardless of the data distribution. Simulations and experiments on real data demonstrate that our method satisfies the coverage guarantees with substantially reduced computation compared to bootstrap methods.
Our goal is to develop an efficient contact detection algorithm for large-scale GPU-based simulation of non-convex objects. Current GPU-based simulators such as IsaacGym and Brax must trade-off speed with fidelity, generality, or both when simulating non-convex objects. Their main issue lies in contact detection (CD): existing CD algorithms, such as Gilbert-Johnson-Keerthi (GJK), must trade off their computational speed with accuracy which becomes expensive as the number of collisions among non-convex objects increases. We propose a data-driven approach for CD, whose accuracy depends only on the quality and quantity of offline dataset rather than online computation time. Unlike GJK, our method inherently has a uniform computational flow, which facilitates efficient GPU usage based on advanced compilers such as XLA (Accelerated Linear Algebra). Further, we offer a data-efficient solution by learning the patterns of colliding local crop object shapes, rather than global object shapes which are harder to learn. We demonstrate our approach improves the efficiency of existing CD methods by a factor of 5-10 for non-convex objects with comparable accuracy. Using the previous work on contact resolution for a neural-network-based contact detector, we integrate our CD algorithm into the open-source GPU-based simulator, Brax, and show that we can improve the efficiency over IsaacGym and generality over standard Brax. We highly recommend the videos of our simulator included in the supplementary materials.
Ancestry-specific proteome-wide association studies (PWAS) based on genetically predicted protein expression can reveal complex disease etiology specific to certain ancestral groups. These studies require ancestry-specific models for protein expression as a function of SNP genotypes. In order to improve protein expression prediction in ancestral populations historically underrepresented in genomic studies, we propose a new penalized maximum likelihood estimator for fitting ancestry-specific joint protein quantitative trait loci models. Our estimator borrows information across ancestral groups, while simultaneously allowing for heterogeneous error variances and regression coefficients. We propose an alternative parameterization of our model which makes the objective function convex and the penalty scale invariant. To improve computational efficiency, we propose an approximate version of our method and study its theoretical properties. Our method provides a substantial improvement in protein expression prediction accuracy in individuals of African ancestry, and in a downstream PWAS analysis, leads to the discovery of multiple associations between protein expression and blood lipid traits in the African ancestry population.
Data transformations are essential for broad applicability of parametric regression models. However, for Bayesian analysis, joint inference of the transformation and model parameters typically involves restrictive parametric transformations or nonparametric representations that are computationally inefficient and cumbersome for implementation and theoretical analysis, which limits their usability in practice. This paper introduces a simple, general, and efficient strategy for joint posterior inference of an unknown transformation and all regression model parameters. The proposed approach directly targets the posterior distribution of the transformation by linking it with the marginal distributions of the independent and dependent variables, and then deploys a Bayesian nonparametric model via the Bayesian bootstrap. Crucially, this approach delivers (1) joint posterior consistency under general conditions, including multiple model misspecifications, and (2) efficient Monte Carlo (not Markov chain Monte Carlo) inference for the transformation and all parameters for important special cases. These tools apply across a variety of data domains, including real-valued, integer-valued, compactly-supported, and positive data. Simulation studies and an empirical application demonstrate the effectiveness and efficiency of this strategy for semiparametric Bayesian analysis with linear models, quantile regression, and Gaussian processes.
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.