We discuss difficulties of evaluating partisan gerrymandering in the congressional districts in Utah and the failure of many common metrics in Utah. We explain why the Republican vote share in the least-Republican district (LRVS) is a good indicator of the advantage or disadvantage each party has in the Utah congressional districts. Although the LRVS only makes sense in settings with at most one competitive district, in that setting it directly captures the extent to which a given redistricting plan gives advantage or disadvantage to the Republican and Democratic parties. We use the LRVS to evaluate the most common measures of partisan gerrymandering in the context of Utah's 2011 congressional districts. We do this by generating large ensembles of alternative redistricting plans using Markov chain Monte Carlo methods. We also discuss the implications of this new metric and our results on the question of whether the 2011 Utah congressional plan was gerrymandered.
The Gaussian kernel and its traditional normalizations (e.g., row-stochastic) are popular approaches for assessing similarities between data points, commonly used for manifold learning and clustering, as well as supervised and semi-supervised learning on graphs. In many practical situations, the data can be corrupted by noise that prohibits traditional affinity matrices from correctly assessing similarities, especially if the noise magnitudes vary considerably across the data, e.g., under heteroskedasticity or outliers. An alternative approach that provides a more stable behavior under noise is the doubly stochastic normalization of the Gaussian kernel. In this work, we investigate this normalization in a setting where points are sampled from an unknown density on a low-dimensional manifold embedded in high-dimensional space and corrupted by possibly strong, non-identically distributed, sub-Gaussian noise. We establish the pointwise concentration of the doubly stochastic affinity matrix and its scaling factors around certain population forms. We then utilize these results to develop several tools for robust inference. First, we derive a robust density estimator that can substantially outperform the standard kernel density estimator under high-dimensional noise. Second, we provide estimators for the pointwise noise magnitudes, the pointwise signal magnitudes, and the pairwise Euclidean distances between clean data points. Lastly, we derive robust graph Laplacian normalizations that approximate popular manifold Laplacians, including the Laplace Beltrami operator, showing that the local geometry of the manifold can be recovered under high-dimensional noise. We exemplify our results in simulations and on real single-cell RNA-sequencing data. In the latter, we show that our proposed normalizations are robust to technical variability associated with different cell types.
Asymptotic study on the partition function $p(n)$ began with the work of Hardy and Ramanujan. Later Rademacher obtained a convergent series for $p(n)$ and an error bound was given by Lehmer. Despite having this, a full asymptotic expansion for $p(n)$ with an explicit error bound is not known. Recently O'Sullivan studied the asymptotic expansion of $p^{k}(n)$-partitions into $k$th powers, initiated by Wright, and consequently obtained an asymptotic expansion for $p(n)$ along with a concise description of the coefficients involved in the expansion but without any estimation of the error term. Here we consider a detailed and comprehensive analysis on an estimation of the error term obtained by truncating the asymptotic expansion for $p(n)$ at any positive integer $n$. This gives rise to an infinite family of inequalities for $p(n)$ which finally answers to a question proposed by Chen. Our error term estimation predominantly relies on applications of algorithmic methods from symbolic summation.
Datasets with significant proportions of bias present threats for training a trustworthy model on NLU tasks. Despite yielding great progress, current debiasing methods impose excessive reliance on the knowledge of bias attributes. Definition of the attributes, however, is elusive and varies across different datasets. Furthermore, leveraging these attributes at input level to bias mitigation may leave a gap between intrinsic properties and the underlying decision rule. To narrow down this gap and liberate the supervision on bias, we suggest extending bias mitigation into feature space. Therefore, a novel model, Recovering Intended-Feature Subspace with Knowledge-Free (RISK) is developed. Assuming that shortcut features caused by various biases are unintended for prediction, RISK views them as redundant features. When delving into a lower manifold to remove redundancies, RISK reveals that an extremely low-dimensional subspace with intended features can robustly represent the highly biased dataset. Empirical results demonstrate our model can consistently improve model generalization to out-of-distribution set, and achieves a new state-of-the-art performance.
Linear computation broadcast (LCBC) refers to a setting with $d$ dimensional data stored at a central server, where $K$ users, each with some prior linear side-information, wish to retrieve various linear combinations of the data. The goal is to determine the minimum amount of information that must be broadcast to satisfy all the users. The reciprocal of the optimal broadcast cost is the capacity of LCBC. The capacity is known for up to $K=3$ users. Since LCBC includes index coding as a special case, large $K$ settings of LCBC are at least as hard as the index coding problem. Instead of the general setting (all instances), by focusing on the generic setting (almost all instances) this work shows that the generic capacity of the symmetric LCBC (where every user has $m'$ dimensions of side-information and $m$ dimensions of demand) for large number of users ($K>d$ suffices) is $C_g=1/\Delta_g$, where $\Delta_g=\min\left\{\max\{0,d-m'\}, Km, \frac{dm}{m+m'}\right\}$ is the broadcast cost that is both achievable and unbeatable asymptotically almost surely for large $n$, among all LCBC instances with the given parameters $p,K,d,m,m'$. Relative to baseline schemes of random coding or separate transmissions, $C_g$ shows an extremal gain by a factor of $K$ as a function of number of users, and by a factor of $\approx d/4$ as a function of data dimensions, when optimized over remaining parameters. For arbitrary number of users, the generic capacity of the symmetric LCBC is characterized within a factor of $2$.
The precision matrix that encodes conditional linear dependency relations among a set of variables forms an important object of interest in multivariate analysis. Sparse estimation procedures for precision matrices such as the graphical lasso (Glasso) gained popularity as they facilitate interpretability, thereby separating pairs of variables that are conditionally dependent from those that are independent (given all other variables). Glasso lacks, however, robustness to outliers. To overcome this problem, one typically applies a robust plug-in procedure where the Glasso is computed from a robust covariance estimate instead of the sample covariance, thereby providing protection against outliers. In this paper, we study such estimators theoretically, by deriving and comparing their influence function, sensitivity curves and asymptotic variances.
Variational Bayesian posterior inference often requires simplifying approximations such as mean-field parametrisation to ensure tractability. However, prior work has associated the variational mean-field approximation for Bayesian neural networks with underfitting in the case of small datasets or large model sizes. In this work, we show that invariances in the likelihood function of over-parametrised models contribute to this phenomenon because these invariances complicate the structure of the posterior by introducing discrete and/or continuous modes which cannot be well approximated by Gaussian mean-field distributions. In particular, we show that the mean-field approximation has an additional gap in the evidence lower bound compared to a purpose-built posterior that takes into account the known invariances. Importantly, this invariance gap is not constant; it vanishes as the approximation reverts to the prior. We proceed by first considering translation invariances in a linear model with a single data point in detail. We show that, while the true posterior can be constructed from a mean-field parametrisation, this is achieved only if the objective function takes into account the invariance gap. Then, we transfer our analysis of the linear model to neural networks. Our analysis provides a framework for future work to explore solutions to the invariance problem.
Generative Adversarial Networks (GANs) have achieved great success in data generation. However, its statistical properties are not fully understood. In this paper, we consider the statistical behavior of the general $f$-divergence formulation of GAN, which includes the Kullback--Leibler divergence that is closely related to the maximum likelihood principle. We show that for parametric generative models that are correctly specified, all $f$-divergence GANs with the same discriminator classes are asymptotically equivalent under suitable regularity conditions. Moreover, with an appropriately chosen local discriminator, they become equivalent to the maximum likelihood estimate asymptotically. For generative models that are misspecified, GANs with different $f$-divergences {converge to different estimators}, and thus cannot be directly compared. However, it is shown that for some commonly used $f$-divergences, the original $f$-GAN is not optimal in that one can achieve a smaller asymptotic variance when the discriminator training in the original $f$-GAN formulation is replaced by logistic regression. The resulting estimation method is referred to as Adversarial Gradient Estimation (AGE). Empirical studies are provided to support the theory and to demonstrate the advantage of AGE over the original $f$-GANs under model misspecification.
We train graph neural networks on halo catalogues from Gadget N-body simulations to perform field-level likelihood-free inference of cosmological parameters. The catalogues contain $\lesssim$5,000 halos with masses $\gtrsim 10^{10}~h^{-1}M_\odot$ in a periodic volume of $(25~h^{-1}{\rm Mpc})^3$; every halo in the catalogue is characterized by several properties such as position, mass, velocity, concentration, and maximum circular velocity. Our models, built to be permutationally, translationally, and rotationally invariant, do not impose a minimum scale on which to extract information and are able to infer the values of $\Omega_{\rm m}$ and $\sigma_8$ with a mean relative error of $\sim6\%$, when using positions plus velocities and positions plus masses, respectively. More importantly, we find that our models are very robust: they can infer the value of $\Omega_{\rm m}$ and $\sigma_8$ when tested using halo catalogues from thousands of N-body simulations run with five different N-body codes: Abacus, CUBEP$^3$M, Enzo, PKDGrav3, and Ramses. Surprisingly, the model trained to infer $\Omega_{\rm m}$ also works when tested on thousands of state-of-the-art CAMELS hydrodynamic simulations run with four different codes and subgrid physics implementations. Using halo properties such as concentration and maximum circular velocity allow our models to extract more information, at the expense of breaking the robustness of the models. This may happen because the different N-body codes are not converged on the relevant scales corresponding to these parameters.
In this paper, we are testing the symmetry in the distribution of data observed on a random variable. We proposed test statistics using cumulative past and residual extropy of record values based on the characterization developed by Gupta and Chaudhary (2022) [5]. It is shown that the obtained estimator is consistent. Our proposed test has an advantage that we do not need to estimate the centre of symmetry. The empirical density, critical value and power of the proposed test statistics have been obtained. The test procedure has been implemented on six real-life data sets to verify its performance in identifying the symmetric nature. Simulations indicate our test performs better than the competitor tests.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.