Classical confidence intervals after best subset selection are widely implemented in statistical software and are routinely used to guide practitioners in scientific fields to conclude significance. However, there are increasing concerns in the recent literature about the validity of these confidence intervals in that the intended frequentist coverage is not attained. In the context of the Akaike information criterion (AIC), recent studies observe an under-coverage phenomenon in terms of overfitting, where the estimate of error variance under the selected submodel is smaller than that for the true model. Under-coverage is particularly troubling in selective inference as it points to inflated Type I errors that would invalidate significant findings. In this article, we delineate a complementary, yet provably more deciding factor behind the incorrect coverage of classical confidence intervals under AIC, in terms of altered conditional sampling distributions of pivotal quantities. Resting on selective techniques developed in other settings, our finite-sample characterization of the selection event under AIC uncovers its geometry as a union of finitely many intervals on the real line, based on which we derive new confidence intervals with guaranteed coverage for any sample size. This geometry derived for AIC selection enables exact (and typically less than exact) conditioning, circumventing the need for the excessive conditioning common in other post-selection methods. The proposed methods are easy to implement and can be broadly applied to other commonly used best subset selection criteria. In an application to a classical US consumption dataset, the proposed confidence intervals arrive at different conclusions compared to the conventional ones, even when the selected model is the full model, leading to interpretable findings that better align with empirical observations.
Within the ViSE (Voting in Stochastic Environment) model, we study the effectiveness of majority voting in various environments. By the pit of losses paradox, majority decisions in apparently hostile environments systematically reduce the capital of society. In such cases, the basic action of ``rejecting all proposals without voting'' outperforms simple majority. We reveal another pit of losses appearing in favorable environments. Here, the simple action of ``accepting all proposals without voting'' is superior to simple majority, which thus causes a loss compared to total acceptance. We show that the second pit of losses is a mirror image of the pit of losses in hostile environments and explain this phenomenon. Technically, we consider a voting society consisting of individual agents whose strategy is supporting all proposals that increase their capital and a group whose members vote for the increase of the total group capital. According to the main result, the expected capital gain of each agent in the environment whose proposal generator $\xi$ has mean $\mu>0$ exceeds by $\mu$ their expected capital gain with generator $-\xi$. This result extends to the shift-based families of generators with symmetric distributions. The difference by $\mu$ causes symmetry relative to the basic action that rejects/accepts all proposals in unfavorable/favorable environments.
For multivariate data, tandem clustering is a well-known technique aiming to improve cluster identification through initial dimension reduction. Nevertheless, the usual approach using principal component analysis (PCA) has been criticized for focusing solely on inertia so that the first components do not necessarily retain the structure of interest for clustering. To address this limitation, a new tandem clustering approach based on invariant coordinate selection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while providing affine invariant components. Certain theoretical results have been previously derived and guarantee that under some elliptical mixture models, the group structure can be highlighted on a subset of the first and/or last components. However, ICS has garnered minimal attention within the context of clustering. Two challenges associated with ICS include choosing the pair of scatter matrices and selecting the components to retain. For effective clustering purposes, it is demonstrated that the best scatter pairs consist of one scatter matrix capturing the within-cluster structure and another capturing the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully chosen subset size that is smaller than usual. The performance of ICS as a dimension reduction method is evaluated in terms of preserving the cluster structure in the data. In an extensive simulation study and empirical applications with benchmark data sets, various combinations of scatter matrices as well as component selection criteria are compared in situations with and without outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the PCA-based approach.
We present accurate and mathematically consistent formulations of a diffuse-interface model for two-phase flow problems involving rapid evaporation. The model addresses challenges including discontinuities in the density field by several orders of magnitude, leading to high velocity and pressure jumps across the liquid-vapor interface, along with dynamically changing interface topologies. To this end, we integrate an incompressible Navier--Stokes solver combined with a conservative level-set formulation and a regularized, i.e., diffuse, representation of discontinuities into a matrix-free adaptive finite element framework. The achievements are three-fold: First, this work proposes mathematically consistent definitions for the level-set transport velocity in the diffuse interface region by extrapolating the velocity from the liquid or gas phase, which exhibit superior prediction accuracy for the evaporated mass and the resulting interface dynamics compared to a local velocity evaluation, especially for highly curved interfaces. Second, we show that accurate prediction of the evaporation-induced pressure jump requires a consistent, namely a reciprocal, density interpolation across the interface, which satisfies local mass conservation. Third, the combination of diffuse interface models for evaporation with standard Stokes-type constitutive relations for viscous flows leads to significant pressure artifacts in the diffuse interface region. To mitigate these, we propose a modification for such constitutive model types. Through selected analytical and numerical examples, the aforementioned properties are validated. The presented model promises new insights in simulation-based prediction of melt-vapor interactions in thermal multiphase flows such as in laser-based powder bed fusion of metals.
We develop an anytime-valid permutation test, where the dataset is fixed and the permutations are sampled sequentially one by one, with the objective of saving computational resources by sampling fewer permutations and stopping early. The core technical advance is the development of new test martingales (nonnegative martingales with initial value one) for testing exchangeability against a very particular alternative. These test martingales are constructed using new and simple betting strategies that smartly bet on the relative ranks of permuted test statistics. The betting strategy is guided by the derivation of a simple log-optimal betting strategy, and displays excellent power in practice. In contrast to a well-known method by Besag and Clifford, our method yields a valid e-value or a p-value at any stopping time, and with particular stopping rules, it yields computational gains under both the null and the alternative without compromising power.
Map matching is a common preprocessing step for analysing vehicle trajectories. In the theory community, the most popular approach for map matching is to compute a path on the road network that is the most spatially similar to the trajectory, where spatial similarity is measured using the Fr\'echet distance. A shortcoming of existing map matching algorithms under the Fr\'echet distance is that every time a trajectory is matched, the entire road network needs to be reprocessed from scratch. An open problem is whether one can preprocess the road network into a data structure, so that map matching queries can be answered in sublinear time. In this paper, we investigate map matching queries under the Fr\'echet distance. We provide a negative result for geometric planar graphs. We show that, unless SETH fails, there is no data structure that can be constructed in polynomial time that answers map matching queries in $O((pq)^{1-\delta})$ query time for any $\delta > 0$, where $p$ and $q$ are the complexities of the geometric planar graph and the query trajectory, respectively. We provide a positive result for realistic input graphs, which we regard as the main result of this paper. We show that for $c$-packed graphs, one can construct a data structure of $\tilde O(cp)$ size that can answer $(1+\varepsilon)$-approximate map matching queries in $\tilde O(c^4 q \log^4 p)$ time, where $\tilde O(\cdot)$ hides lower-order factors and dependence of $\varepsilon$.
A key challenge in analyzing the behavior of change-plane estimators is that the objective function has multiple minimizers. Two estimators are proposed to deal with this non-uniqueness. For each estimator, an n-rate of convergence is established, and the limiting distribution is derived. Based on these results, we provide a parametric bootstrap procedure for inference. The validity of our theoretical results and the finite sample performance of the bootstrap are demonstrated through simulation experiments. We illustrate the proposed methods to latent subgroup identification in precision medicine using the ACTG175 AIDS study data.
We study knowable informational dependence between empirical questions, modeled as continuous functional dependence between variables in a topological setting. We also investigate epistemic independence in topological terms and show that it is compatible with functional (but non-continuous) dependence. We then proceed to study a stronger notion of knowability based on uniformly continuous dependence. On the technical logical side, we determine the complete logics of languages that combine general functional dependence, continuous dependence, and uniformly continuous dependence.
High-dimensional data are routinely collected in many areas. We are particularly interested in Bayesian classification models in which one or more variables are imbalanced. Current Markov chain Monte Carlo algorithms for posterior computation are inefficient as $n$ and/or $p$ increase due to worsening time per step and mixing rates. One strategy is to use a gradient-based sampler to improve mixing while using data sub-samples to reduce per-step computational complexity. However, usual sub-sampling breaks down when applied to imbalanced data. Instead, we generalize piece-wise deterministic Markov chain Monte Carlo algorithms to include importance-weighted and mini-batch sub-sampling. These approaches maintain the correct stationary distribution with arbitrarily small sub-samples, and substantially outperform current competitors. We provide theoretical support and illustrate gains in simulated and real data applications.
Various methods have recently been proposed to estimate causal effects with confidence intervals that are uniformly valid over a set of data generating processes when high-dimensional nuisance models are estimated by post-model-selection or machine learning estimators. These methods typically require that all the confounders are observed to ensure identification of the effects. We contribute by showing how valid semiparametric inference can be obtained in the presence of unobserved confounders and high-dimensional nuisance models. We propose uncertainty intervals which allow for unobserved confounding, and show that the resulting inference is valid when the amount of unobserved confounding is small relative to the sample size; the latter is formalized in terms of convergence rates. Simulation experiments illustrate the finite sample properties of the proposed intervals and investigate an alternative procedure that improves the empirical coverage of the intervals when the amount of unobserved confounding is large. Finally, a case study on the effect of smoking during pregnancy on birth weight is used to illustrate the use of the methods introduced to perform a sensitivity analysis to unobserved confounding.
Building robust, interpretable, and secure AI system requires quantifying and representing uncertainty under a probabilistic perspective to mimic human cognitive abilities. However, probabilistic computation presents significant challenges for most conventional artificial neural network, as they are essentially implemented in a deterministic manner. In this paper, we develop an efficient probabilistic computation framework by truncating the probabilistic representation of neural activation up to its mean and covariance and construct a moment neural network that encapsulates the nonlinear coupling between the mean and covariance of the underlying stochastic network. We reveal that when only the mean but not the covariance is supervised during gradient-based learning, the unsupervised covariance spontaneously emerges from its nonlinear coupling with the mean and faithfully captures the uncertainty associated with model predictions. Our findings highlight the inherent simplicity of probabilistic computation by seamlessly incorporating uncertainty into model prediction, paving the way for integrating it into large-scale AI systems.