In 2023, the International Conference on Machine Learning (ICML) required authors with multiple submissions to rank their submissions based on perceived quality. In this paper, we aim to employ these author-specified rankings to enhance peer review in machine learning and artificial intelligence conferences by extending the Isotonic Mechanism (Su, 2021, 2022) to exponential family distributions. This mechanism generates adjusted scores closely align with the original scores while adhering to author-specified rankings. Despite its applicability to a broad spectrum of exponential family distributions, this mechanism's implementation does not necessitate knowledge of the specific distribution form. We demonstrate that an author is incentivized to provide accurate rankings when her utility takes the form of a convex additive function of the adjusted review scores. For a certain subclass of exponential family distributions, we prove that the author reports truthfully only if the question involves only pairwise comparisons between her submissions, thus indicating the optimality of ranking in truthful information elicitation. Lastly, we show that the adjusted scores improve dramatically the accuracy of the original scores and achieve nearly minimax optimality for estimating the true scores with statistical consistecy when true scores have bounded total variation.
Distributed computing is critically important for modern statistical analysis. Herein, we develop a distributed quasi-Newton (DQN) framework with excellent statistical, computation, and communication efficiency. In the DQN method, no Hessian matrix inversion or communication is needed. This considerably reduces the computation and communication complexity of the proposed method. Notably, related existing methods only analyze numerical convergence and require a diverging number of iterations to converge. However, we investigate the statistical properties of the DQN method and theoretically demonstrate that the resulting estimator is statistically efficient over a small number of iterations under mild conditions. Extensive numerical analyses demonstrate the finite sample performance.
We study causal effect estimation from a mixture of observational and interventional data in a confounded linear regression model with multivariate treatments. We show that the statistical efficiency in terms of expected squared error can be improved by combining estimators arising from both the observational and interventional setting. To this end, we derive methods based on matrix weighted linear estimators and prove that our methods are asymptotically unbiased in the infinite sample limit. This is an important improvement compared to the pooled estimator using the union of interventional and observational data, for which the bias only vanishes if the ratio of observational to interventional data tends to zero. Studies on synthetic data confirm our theoretical findings. In settings where confounding is substantial and the ratio of observational to interventional data is large, our estimators outperform a Stein-type estimator and various other baselines.
Finite element methods and kinematically coupled schemes that decouple the fluid velocity and structure's displacement have been extensively studied for incompressible fluid-structure interaction (FSI) over the past decade. While these methods are known to be stable and easy to implement, optimal error analysis has remained challenging. Previous work has primarily relied on the classical elliptic projection technique, which is only suitable for parabolic problems and does not lead to optimal convergence of numerical solutions to the FSI problems in the standard $L^2$ norm. In this article, we propose a new kinematically coupled scheme for incompressible FSI thin-structure model and establish a new framework for the numerical analysis of FSI problems in terms of a newly introduced coupled non-stationary Ritz projection, which allows us to prove the optimal-order convergence of the proposed method in the $L^2$ norm. The methodology presented in this article is also applicable to numerous other FSI models and serves as a fundamental tool for advancing research in this field.
This paper discusses the approximate distributions of eigenvalues of a singular Wishart matrix. We give the approximate joint density of eigenvalues by Laplace approximation for the hyper-geometric functions of matrix arguments. Furthermore, we show that the distribution of each eigenvalue can be approximated by the chi-square distribution with varying degrees of freedom when the population eigenvalues are infinitely dispersed. The derived result is applied to testing the equality of eigenvalues in two populations
Demand for reliable statistics at a local area (small area) level has greatly increased in recent years. Traditional area-specific estimators based on probability samples are not adequate because of small sample size or even zero sample size in a local area. As a result, methods based on models linking the areas are widely used. World Bank focused on estimating poverty measures, in particular poverty incidence and poverty gap called FGT measures, using a simulated census method, called ELL, based on a one-fold nested error model for a suitable transformation of the welfare variable. Modified ELL methods leading to significant gain in efficiency over ELL also have been proposed under the one-fold model. An advantage of ELL and modified ELL methods is that distributional assumptions on the random effects in the model are not needed. In this paper, we extend ELL and modified ELL to two-fold nested error models to estimate poverty indicators for areas (say a state) and subareas (say counties within a state). Our simulation results indicate that the modified ELL estimators lead to large efficiency gains over ELL at the area level and subarea level. Further, modified ELL method retaining both area and subarea estimated effects in the model (called MELL2) performs significantly better in terms of mean squared error (MSE) for sampled subareas than the modified ELL retaining only estimated area effect in the model (called MELL1).
Posterior predictive p-values (ppps) have become popular tools for Bayesian model criticism, being general-purpose and easy to use. However, their interpretation can be difficult because their distribution is not uniform under the hypothesis that the model did generate the data. To address this issue, procedures to obtain calibrated ppps (cppps) have been proposed although not used in practice, because they require repeated simulation of new data and model estimation via MCMC. Here we give methods to balance the computational trade-off between the number of calibration replicates and the number of MCMC samples per replicate. Our results suggest that investing in a large number of calibration replicates while using short MCMC chains can save significant computation time compared to naive implementations, without significant loss in accuracy. We propose different estimators for the variance of the cppp that can be used to confirm quickly when the model fits the data well. Variance estimation requires the effective sample sizes of many short MCMC chains; we show that these can be well approximated using the single long MCMC chain from the real-data model. The procedure for cppp is implemented in NIMBLE, a flexible framework for hierarchical modeling that supports many models and discrepancy measures.
In the setting of functional data analysis, we derive optimal rates of convergence in the supremum norm for estimating the H\"older-smooth mean function of a stochastic processes which is repeatedly and discretely observed at fixed, multivariate, synchronous design points and with additional errors. Similarly to the rates in $L_2$ obtained in Cai and Yuan (2011), for sparse design a discretization term dominates, while in the dense case the $\sqrt n$ rate can be achieved as if the $n$ processes were continuously observed without errors. However, our analysis differs in several respects from Cai and Yuan (2011). First, we do not assume that the paths of the processes are as smooth as the mean, but still obtain the $\sqrt n$ rate of convergence without additional logarithmic factors in the dense setting. Second, we show that in the supremum norm, there is an intermediate regime between the sparse and dense cases dominated by the contribution of the observation errors. Third, and in contrast to the analysis in $L_2$, interpolation estimators turn out to be sub-optimal in $L_\infty$ in the dense setting, which explains their poor empirical performance. We also obtain a central limit theorem in the supremum norm and discuss the selection of the bandwidth. Simulations and real data applications illustrate the results.
Long-term outcomes of experimental evaluations are necessarily observed after long delays. We develop semiparametric methods for combining the short-term outcomes of experiments with observational measurements of short-term and long-term outcomes, in order to estimate long-term treatment effects. We characterize semiparametric efficiency bounds for various instances of this problem. These calculations facilitate the construction of several estimators. We analyze the finite-sample performance of these estimators with a simulation calibrated to data from an evaluation of the long-term effects of a poverty alleviation program.
Motivated by pathwise stochastic calculus, we say that a continuous real-valued function $x$ admits the roughness exponent $R$ if the $p^{\text{th}}$ variation of $x$ converges to zero if $p>1/R$ and to infinity if $p<1/R$. For the sample paths of many stochastic processes, such as fractional Brownian motion, the roughness exponent exists and equals the standard Hurst parameter. In our main result, we provide a mild condition on the Faber--Schauder coefficients of $x$ under which the roughness exponent exists and is given as the limit of the classical Gladyshev estimates $\widehat R_n(x)$. This result can be viewed as a strong consistency result for the Gladyshev estimators in an entirely model-free setting, because no assumption whatsoever is made on the possible dynamics of the function $x$. Nonetheless, our proof is probabilistic and relies on a martingale that is hidden in the Faber--Schauder expansion of $x$. Since the Gladyshev estimators are not scale-invariant, we construct several scale-invariant estimators that are derived from the sequence $(\widehat R_n)_{n\in\mathbb N}$. We also discuss how a dynamic change in the roughness parameter of a time series can be detected. Finally, we extend our results to the case in which the $p^{\text{th}}$ variation of $x$ is defined over a sequence of unequally spaced partitions. Our results are illustrated by means of high-frequency financial time series.
The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications (eg. sentiment classification, span-prediction based question answering or machine translation). However, it builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time. This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information. Moreover, it is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model's lifetime. The first goal of this thesis is to characterize the different forms this shift can take in the context of natural language processing, and propose benchmarks and evaluation metrics to measure its effect on current deep learning architectures. We then proceed to take steps to mitigate the effect of distributional shift on NLP models. To this end, we develop methods based on parametric reformulations of the distributionally robust optimization framework. Empirically, we demonstrate that these approaches yield more robust models as demonstrated on a selection of realistic problems. In the third and final part of this thesis, we explore ways of efficiently adapting existing models to new domains or tasks. Our contribution to this topic takes inspiration from information geometry to derive a new gradient update rule which alleviate catastrophic forgetting issues during adaptation.