The large-sample behavior of non-degenerate multivariate $U$-statistics of arbitrary degree is investigated under the assumption that their kernel depends on parameters that can be estimated consistently. Mild regularity conditions are given which guarantee that once properly normalized, such statistics are asymptotically multivariate Gaussian both under the null hypothesis and sequences of local alternatives. The work of Randles (1982, Ann. Statist.) is extended in three ways: the data and the kernel values can be multivariate rather than univariate, the limiting behavior under local alternatives is studied for the first time, and the effect of knowing some of the nuisance parameters is quantified. These results can be applied to a broad range of goodness-of-fit testing contexts, as shown in one specific example.
This paper investigates extremal quantiles under two-way cluster dependence. We demonstrate that the limiting distribution of the unconditional intermediate order quantiles in the tails converges to a Gaussian distribution. This is remarkable as two-way cluster dependence entails potential non-Gaussianity in general, but extremal quantiles do not suffer from this issue. Building upon this result, we extend our analysis to extremal quantile regressions of intermediate order.
With the expansion of operational scale of supermarkets in China, the vegetable market has grown considerably. The decision-making related to procurement costs and allocation quantities of vegetables has become a pivotal factor in determining the profitability of supermarkets. This paper analyzes the relationship between pricing and allocation faced by supermarkets in vegetable operations. Optimization algorithms are employed to determine replenishment and pricing strategies. Linear regression is utilized to model the historical data of various products, establishing the relationship between sale prices and sales volumes for 61 products. By integrating historical data on vegetable costs with time information based on the 24 solar terms, a cost prediction model is trained using TCN-Attention. The Topis evaluation model identifies the 32 most market-demanded products. A genetic algorithm is then used to search for the globally optimized vegetable product allocation-pricing decision.
Privacy protection methods, such as differentially private mechanisms, introduce noise into resulting statistics which often produces complex and intractable sampling distributions. In this paper, we propose a simulation-based "repro sample" approach to produce statistically valid confidence intervals and hypothesis tests, which builds on the work of Xie and Wang (2022). We show that this methodology is applicable to a wide variety of private inference problems, appropriately accounts for biases introduced by privacy mechanisms (such as by clamping), and improves over other state-of-the-art inference methods such as the parametric bootstrap in terms of the coverage and type I error of the private inference. We also develop significant improvements and extensions for the repro sample methodology for general models (not necessarily related to privacy), including 1) modifying the procedure to ensure guaranteed coverage and type I errors, even accounting for Monte Carlo error, and 2) proposing efficient numerical algorithms to implement the confidence intervals and $p$-values.
Natural systems with emergent behaviors often organize along low-dimensional subsets of high-dimensional spaces. For example, despite the tens of thousands of genes in the human genome, the principled study of genomics is fruitful because biological processes rely on coordinated organization that results in lower dimensional phenotypes. To uncover this organization, many nonlinear dimensionality reduction techniques have successfully embedded high-dimensional data into low-dimensional spaces by preserving local similarities between data points. However, the nonlinearities in these methods allow for too much curvature to preserve general trends across multiple non-neighboring data clusters, thereby limiting their interpretability and generalizability to out-of-distribution data. Here, we address both of these limitations by regularizing the curvature of manifolds generated by variational autoencoders, a process we coin ``$\Gamma$-VAE''. We demonstrate its utility using two example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage tracing experiment in hematopoietic stem cell differentiation. We find that the resulting regularized manifolds identify mesoscale structure associated with different cancer cell types, and accurately re-embed tissues from completely unseen, out-of distribution cancers as if they were originally trained on them. Finally, we show that preserving long-range relationships to differentiated cells separates undifferentiated cells -- which have not yet specialized -- according to their eventual fate. Broadly, we anticipate that regularizing the curvature of generative models will enable more consistent, predictive, and generalizable models in any high-dimensional system with emergent low-dimensional behavior.
We systematically evaluated the performance of seven large language models in generating programming code using various prompt strategies, programming languages, and task difficulties. GPT-4 substantially outperforms other large language models, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt strategy outperforms 85 percent of human participants. Additionally, GPT-4 demonstrates strong capabilities in translating code between different programming languages and in learning from past errors. The computational efficiency of the code generated by GPT-4 is comparable to that of human programmers. These results suggest that GPT-4 has the potential to serve as a reliable assistant in programming code generation and software development.
We study the problem of adaptive variable selection in a Gaussian white noise model of intensity $\varepsilon$ under certain sparsity and regularity conditions on an unknown regression function $f$. The $d$-variate regression function $f$ is assumed to be a sum of functions each depending on a smaller number $k$ of variables ($1 \leq k \leq d$). These functions are unknown to us and only few of them are nonzero. We assume that $d=d_\varepsilon \to \infty$ as $\varepsilon \to 0$ and consider the cases when $k$ is fixed and when $k=k_\varepsilon \to \infty$, $k=o(d)$ as $\varepsilon \to 0$. In this work, we introduce an adaptive selection procedure that, under some model assumptions, identifies exactly all nonzero $k$-variate components of $f$. In addition, we establish conditions under which exact identification of the nonzero components is impossible. These conditions ensure that the proposed selection procedure is the best possible in the asymptotically minimax sense with respect to the Hamming risk.
Some popular Machine Learning Uncertainty Quantification (ML-UQ) calibration statistics do not have predefined reference values and are mostly used in comparative studies. In consequence, calibration is almost never validated and the diagnostic is left to the appreciation of the reader. Simulated reference values, based on synthetic calibrated datasets derived from actual uncertainties, have been proposed to palliate this problem. As the generative probability distribution for the simulation of synthetic errors is often not constrained, the sensitivity of simulated reference values to the choice of generative distribution might be problematic, shedding a doubt on the calibration diagnostic. This study explores various facets of this problem, and shows that some statistics are excessively sensitive to the choice of generative distribution to be used for validation when the generative distribution is unknown. This is the case, for instance, of the correlation coefficient between absolute errors and uncertainties (CC) and of the expected normalized calibration error (ENCE). A robust validation workflow to deal with simulated reference values is proposed.
We propose an algorithm which predicts each subsequent time step relative to the previous timestep of intractable short rate model (when adjusted for drift and overall distribution of previous percentile result) and show that the method achieves superior outcomes to the unbiased estimate both on the trained dataset and different validation data.
This paper presents a method for thematic agreement assessment of geospatial data products of different semantics and spatial granularities, which may be affected by spatial offsets between test and reference data. The proposed method uses a multi-scale framework allowing for a probabilistic evaluation whether thematic disagreement between datasets is induced by spatial offsets due to different nature of the datasets or not. We test our method using real-estate derived settlement locations and remote-sensing derived building footprint data.
We propose and analyse boundary-preserving schemes for the strong approximations of some scalar SDEs with non-globally Lipschitz drift and diffusion coefficients whose state-space is bounded. The schemes consists of a Lamperti transform followed by a Lie--Trotter splitting. We prove $L^{p}(\Omega)$-convergence of order $1$, for every $p \geq 1$, of the schemes and exploit the Lamperti transform to confine the numerical approximations to the state-space of the considered SDE. We provide numerical experiments that confirm the theoretical results and compare the proposed Lamperti-splitting schemes to other numerical schemes for SDEs.