The presented methodology for testing the goodness-of-fit of an Autoregressive Hilbertian model (ARH(1) model) provides an infinite-dimensional formulation of the approach proposed in Koul and Stute (1999), based on empirical process marked by residuals. Applying a central and functional central limit result for Hilbert-valued martingale difference sequences, the asymptotic behavior of the formulated H-valued empirical process, also indexed by H, is obtained under the null hypothesis. The limiting process is H-valued generalized (i.e., indexed by H) Wiener process, leading to an asymptotically distribution free test. Consistency is also analyzed. The case of misspecified autocorrelation operator of the ARH(1) process is addressed as well. Beyond the Euclidean setting, this approach allows to implement goodness of fit testing in the context of manifold and spherical functional autoregressive processes.
We consider the problem of learning functions in the $\mathcal{F}_{p,\pi}$ and Barron spaces, which are natural function spaces that arise in the high-dimensional analysis of random feature models (RFMs) and two-layer neural networks. Through a duality analysis, we reveal that the approximation and estimation of these spaces can be considered equivalent in a certain sense. This enables us to focus on the easier problem of approximation and estimation when studying the generalization of both models. The dual equivalence is established by defining an information-based complexity that can effectively control estimation errors. Additionally, we demonstrate the flexibility of our duality framework through comprehensive analyses of two concrete applications. The first application is to study learning functions in $\mathcal{F}_{p,\pi}$ with RFMs. We prove that the learning does not suffer from the curse of dimensionality as long as $p>1$, implying RFMs can work beyond the kernel regime. Our analysis extends existing results [CMM21] to the noisy case and removes the requirement of overparameterization. The second application is to investigate the learnability of reproducing kernel Hilbert space (RKHS) under the $L^\infty$ metric. We derive both lower and upper bounds of the minimax estimation error by using the spectrum of the associated kernel. We then apply these bounds to dot-product kernels and analyze how they scale with the input dimension. Our results suggest that learning with ReLU (random) features is generally intractable in terms of reaching high uniform accuracy.
Discovering causal relations from observational data is important. The existence of unobserved variables (e.g. latent confounding or mediation) can mislead the causal identification. To overcome this problem, proximal causal discovery methods attempted to adjust for the bias via the proxy of the unobserved variable. Particularly, hypothesis test-based methods proposed to identify the causal edge by testing the induced violation of linearity. However, these methods only apply to discrete data with strict level constraints, which limits their practice in the real world. In this paper, we fix this problem by extending the proximal hypothesis test to cases where the system consists of continuous variables. Our strategy is to present regularity conditions on the conditional distributions of the observed variables given the hidden factor, such that if we discretize its observed proxy with sufficiently fine, finite bins, the involved discretization error can be effectively controlled. Based on this, we can convert the problem of testing continuous causal relations to that of testing discrete causal relations in each bin, which can be effectively solved with existing methods. These non-parametric regularities we present are mild and can be satisfied by a wide range of structural causal models. Using both simulated and real-world data, we show the effectiveness of our method in recovering causal relations when unobserved variables exist.
Let a polytope $P$ be defined by a system $A x \leq b$. We consider the problem of counting the number of integer points inside $P$, assuming that $P$ is $\Delta$-modular, where the polytope $P$ is called $\Delta$-modular if all the rank sub-determinants of $A$ are bounded by $\Delta$ in the absolute value. We present a new FPT-algorithm, parameterized by $\Delta$ and by the maximal number of vertices in $P$, where the maximum is taken by all r.h.s. vectors $b$. We show that our algorithm is more efficient for $\Delta$-modular problems than the approach of A. Barvinok et al. To this end, we do not directly compute the short rational generating function for $P \cap Z^n$, which is commonly used for the considered problem. Instead, we use the dynamic programming principle to compute its particular representation in the form of exponential series that depends on a single variable. We completely do not rely to the Barvinok's unimodular sign decomposition technique. Using our new complexity bound, we consider different special cases that may be of independent interest. For example, we give FPT-algorithms for counting the integer points number in $\Delta$-modular simplices and similar polytopes that have $n + O(1)$ facets. As a special case, for any fixed $m$, we give an FPT-algorithm to count solutions of the unbounded $m$-dimensional $\Delta$-modular subset-sum problem.
Clustering is a powerful and extensively used data science tool. While clustering is generally thought of as an unsupervised learning technique, there are also supervised variations such as Spath's clusterwise regression that attempt to find clusters of data that yield low regression error on a supervised target. We believe that clusterwise regression is just a single vertex of a largely unexplored design space of supervised clustering models. In this article, we define a generalized optimization framework for predictive clustering that admits different cluster definitions (arbitrary point assignment, closest center, and bounding box) and both regression and classification objectives. We then present a joint optimization strategy that exploits mixed-integer linear programming (MILP) for global optimization in this generalized framework. To alleviate scalability concerns for large datasets, we also provide highly scalable greedy algorithms inspired by the Majorization-Minimization (MM) framework. Finally, we demonstrate the ability of our models to uncover different interpretable discrete cluster structures in data by experimenting with four real-world datasets.
In this paper, we focus our attention on the high-dimensional double sparse linear regression, that is, a combination of element-wise and group-wise sparsity.To address this problem, we propose an IHT-style (iterative hard thresholding) procedure that dynamically updates the threshold at each step. We establish the matching upper and lower bounds for parameter estimation, showing the optimality of our proposal in the minimax sense. Coupled with a novel sparse group information criterion, we develop a fully adaptive procedure to handle unknown group sparsity and noise levels.We show that our adaptive procedure achieves optimal statistical accuracy with fast convergence. Finally, we demonstrate the superiority of our method by comparing it with several state-of-the-art algorithms on both synthetic and real-world datasets.
Most inverse problems from physical sciences are formulated as PDE-constrained optimization problems. This involves identifying unknown parameters in equations by optimizing the model to generate PDE solutions that closely match measured data. The formulation is powerful and widely used in many sciences and engineering fields. However, one crucial assumption is that the unknown parameter must be deterministic. In reality, however, many problems are stochastic in nature, and the unknown parameter is random. The challenge then becomes recovering the full distribution of this unknown random parameter. It is a much more complex task. In this paper, we examine this problem in a general setting. In particular, we conceptualize the PDE solver as a push-forward map that pushes the parameter distribution to the generated data distribution. This way, the SDE-constrained optimization translates to minimizing the distance between the generated distribution and the measurement distribution. We then formulate a gradient-flow equation to seek the ground-truth parameter probability distribution. This opens up a new paradigm for extending many techniques in PDE-constrained optimization to that for systems with stochasticity.
Random smoothing data augmentation is a unique form of regularization that can prevent overfitting by introducing noise to the input data, encouraging the model to learn more generalized features. Despite its success in various applications, there has been a lack of systematic study on the regularization ability of random smoothing. In this paper, we aim to bridge this gap by presenting a framework for random smoothing regularization that can adaptively and effectively learn a wide range of ground truth functions belonging to the classical Sobolev spaces. Specifically, we investigate two underlying function spaces: the Sobolev space of low intrinsic dimension, which includes the Sobolev space in $D$-dimensional Euclidean space or low-dimensional sub-manifolds as special cases, and the mixed smooth Sobolev space with a tensor structure. By using random smoothing regularization as novel convolution-based smoothing kernels, we can attain optimal convergence rates in these cases using a kernel gradient descent algorithm, either with early stopping or weight decay. It is noteworthy that our estimator can adapt to the structural assumptions of the underlying data and avoid the curse of dimensionality. This is achieved through various choices of injected noise distributions such as Gaussian, Laplace, or general polynomial noises, allowing for broad adaptation to the aforementioned structural assumptions of the underlying data. The convergence rate depends only on the effective dimension, which may be significantly smaller than the actual data dimension. We conduct numerical experiments on simulated data to validate our theoretical results.
Interpretability methods are developed to understand the working mechanisms of black-box models, which is crucial to their responsible deployment. Fulfilling this goal requires both that the explanations generated by these methods are correct and that people can easily and reliably understand them. While the former has been addressed in prior work, the latter is often overlooked, resulting in informal model understanding derived from a handful of local explanations. In this paper, we introduce explanation summary (ExSum), a mathematical framework for quantifying model understanding, and propose metrics for its quality assessment. On two domains, ExSum highlights various limitations in the current practice, helps develop accurate model understanding, and reveals easily overlooked properties of the model. We also connect understandability to other properties of explanations such as human alignment, robustness, and counterfactual minimality and plausibility.
Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet have recently achieved state-of-the-art performance on a variety of language understanding tasks. However, their size makes them impractical for a number of scenarios, especially on mobile and edge devices. In particular, the input word embedding matrix accounts for a significant proportion of the model's memory footprint, due to the large input vocabulary and embedding dimensions. Knowledge distillation techniques have had success at compressing large neural network models, but they are ineffective at yielding student models with vocabularies different from the original teacher models. We introduce a novel knowledge distillation technique for training a student model with a significantly smaller vocabulary as well as lower embedding and hidden state dimensions. Specifically, we employ a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. We combine this approach with learning shared projection matrices that transfer layer-wise knowledge from the teacher model to the student model. Our method is able to compress the BERT_BASE model by more than 60x, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB. Experimental results also demonstrate higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques.
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.