Consider the normal linear regression setup when the number of covariates p is much larger than the sample size n, and the covariates form correlated groups. The response variable y is not related to an entire group of covariates in all or none basis, rather the sparsity assumption persists within and between groups. We extend the traditional g-prior setup to this framework. Variable selection consistency of the proposed method is shown under fairly general conditions, assuming the covariates to be random and allowing the true model to grow with both n and p. For the purpose of implementation of the proposed g-prior method to high-dimensional setup, we propose two procedures. First, a group screening procedure, termed as group SIS (GSIS), and secondly, a novel stochastic search variable selection algorithm, termed as group informed variable selection algorithm (GiVSA), which uses the known group structure efficiently to explore the model space without discarding any covariate based on an initial screening. Screening consistency of GSIS, and theoretical mixing time of GiVSA are studied using the canonical path ensemble approach of Yang et al. (2016). Performance of the proposed prior with implementation of GSIS as well as GiVSA are validated using various simulated examples and a real data related to residential buildings.
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.
We prove that a polynomial fraction of the set of $k$-component forests in the $m \times n$ grid graph have equal numbers of vertices in each component. This resolves a conjecture of Charikar, Liu, Liu, and Vuong. It also establishes the first provably polynomial-time algorithm for (exactly or approximately) sampling balanced grid graph partitions according to the spanning tree distribution, which weights each $k$-partition according to the product, across its $k$ pieces, of the number of spanning trees of each piece. Our result has applications to understanding political districtings, where there is an underlying graph of indivisible geographic units that must be partitioned into $k$ population-balanced connected subgraphs. In this setting, tree-weighted partitions have interesting geometric properties, and this has stimulated significant effort to develop methods to sample them.
Continual Federated Learning (CFL) combines Federated Learning (FL), the decentralized learning of a central model on a number of client devices that may not communicate their data, and Continual Learning (CL), the learning of a model from a continual stream of data without keeping the entire history. In CL, the main challenge is \textit{forgetting} what was learned from past data. While replay-based algorithms that keep a small pool of past training data are effective to reduce forgetting, only simple replay sample selection strategies have been applied to CFL in prior work, and no previous work has explored coordination among clients for better sample selection. To bridge this gap, we adapt a replay sample selection objective based on loss gradient diversity to CFL and propose a new relaxation-based selection of samples to optimize the objective. Next, we propose a practical algorithm to coordinate gradient-based replay sample selection across clients without communicating private data. We benchmark our coordinated and uncoordinated replay sample selection algorithms against random sampling-based baselines with language models trained on a large scale de-identified real-world text dataset. We show that gradient-based sample selection methods both boost performance and reduce forgetting compared to random sampling methods, with our coordination method showing gains early in the low replay size regime (when the budget for storing past data is small).
A minimal perfect hash function (MPHF) maps a set of n keys to the first n integers without collisions. Representing this bijection needs at least $\log_2(e) \approx 1.443$ bits per key, and there is a wide range of practical implementations achieving about 2 bits per key. Minimal perfect hashing is a key ingredient in many compact data structures such as updatable retrieval data structures and approximate membership data structures. A simple implementation reaching the space lower bound is to sample random hash functions using brute-force, which needs about $e^n \approx 2.718^n$ tries in expectation. ShockHash recently reduced that to about $(e/2)^n \approx 1.359^n$ tries in expectation by sampling random graphs. With bipartite ShockHash, we now sample random bipartite graphs. In this paper, we describe the general algorithmic ideas of bipartite ShockHash and give an experimental evaluation. The key insight is that we can try all combinations of two hash functions, each mapping into one half of the output range. This reduces the number of sampled hash functions to only about $(\sqrt{e/2})^n \approx 1.166^n$ in expectation. In itself, this does not reduce the asymptotic running time much because all combinations still need to be tested. However, by filtering the candidates before combining them, we can reduce this to less than $1.175^n$ combinations in expectation. Our implementation of bipartite ShockHash is up to 3 orders of magnitude faster than original ShockHash. Inside the RecSplit framework, bipartite ShockHash-RS enables significantly larger base cases, leading to a construction that is, depending on the allotted space budget, up to 20 times faster. In our most extreme configuration, ShockHash-RS can build an MPHF for 10 million keys with 1.489 bits per key (within 3.3% of the lower bound) in about half an hour, pushing the limits of what is possible.
Anomaly Detection (AD) is a critical task that involves identifying observations that do not conform to a learned model of normality. Prior work in deep AD is predominantly based on a familiarity hypothesis, where familiar features serve as the reference in a pre-trained embedding space. While this strategy has proven highly successful, it turns out that it causes consistent false negatives when anomalies consist of truly novel features that are not well captured by the pre-trained encoding. We propose a novel approach to AD using explainability to capture novel features as unexplained observations in the input space. We achieve strong performance across a wide range of anomaly benchmarks by combining similarity and novelty in a hybrid approach. Our approach establishes a new state-of-the-art across multiple benchmarks, handling diverse anomaly types while eliminating the need for expensive background models and dense matching. In particular, we show that by taking account of novel features, we reduce false negative anomalies by up to 40% on challenging benchmarks compared to the state-of-the-art. Our method gives visually inspectable explanations for pixel-level anomalies.
We consider an equation of multiple variables in which a partial derivative does not vanish at a point. The implicit function theorem provides a local existence and uniqueness of the function for the equation. In this paper, we propose an algorithm to approximate the function by a polynomial without using higher-order differentiability, which depends essentially on integrability. Moreover, we extend the method to a system of equations if the Jacobian determinant does not vanish. This is a robust method for implicit functions that are not differentiable to higher-order. Additionally, we present two numerical experiments to verify the theoretical results.
This article discusses the uncertainty quantification (UQ) for time-independent linear and nonlinear partial differential equation (PDE)-based systems with random model parameters carried out using sampling-free intrusive stochastic Galerkin method leveraging multilevel scalable solvers constructed combining two-grid Schwarz method and AMG. High-resolution spatial meshes along with a large number of stochastic expansion terms increase the system size leading to significant memory consumption and computational costs. Domain decomposition (DD)-based parallel scalable solvers are developed to this end for linear and nonlinear stochastic PDEs. A generalized minimum residual (GMRES) iterative solver equipped with a multilevel preconditioner consisting of restricted additive Schwarz (RAS) for the fine grid and algebraic multigrid (AMG) for the coarse grid is constructed to improve scalability. Numerical experiments illustrate the scalabilities of the proposed solver for stochastic linear and nonlinear Poisson problems.
We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and $f$-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-$k$ loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3$\times$ faster than baselines such as stochastic gradient and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language domains.
Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs. However, they have difficulty generalizing to longer proofs, and they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.
A novel recurrence formula for moments with respect to M\"{u}ntz-Legendre polynomials is proposed and applied to construct a numerical method for solving generalized Gauss quadratures with power function weight for M\"{u}ntz systems. These quadrature rules exhibit several properties similar to the classical Gaussian quadratures for polynomial systems, including positive weights, rapid convergence, and others. They are applicable to a wide range of functions, including smooth functions and functions with endpoint singularities, commonly found in integral equations with singular kernels, complex analysis, potential theory, and other areas.