Adversarially robust models are locally smooth around each data sample so that small perturbations cannot drastically change model outputs. In modern systems, such smoothness is usually obtained via Adversarial Training, which explicitly enforces models to perform well on perturbed examples. In this work, we show the surprising effectiveness of instead regularizing the gradient with respect to model inputs on natural examples only. Penalizing input Gradient Norm is commonly believed to be a much inferior approach. Our analyses identify that the performance of Gradient Norm regularization critically depends on the smoothness of activation functions, and are in fact extremely effective on modern vision transformers that adopt smooth activations over piecewise linear ones (eg, ReLU), contrary to prior belief. On ImageNet-1k, Gradient Norm training achieves > 90% the performance of state-of-the-art PGD-3 Adversarial Training} (52% vs.~56%), while using only 60% computation cost of the state-of-the-art without complex adversarial optimization. Our analyses also highlight the relationship between model robustness and properties of natural input gradients, such as asymmetric sample and channel statistics. Surprisingly, we find model robustness can be significantly improved by simply regularizing its gradients to concentrate on image edges without explicit conditioning on the gradient norm.
Denoising diffusion bridge models (DDBMs) are a powerful variant of diffusion models for interpolating between two arbitrary paired distributions given as endpoints. Despite their promising performance in tasks like image translation, DDBMs require a computationally intensive sampling process that involves the simulation of a (stochastic) differential equation through hundreds of network evaluations. In this work, we take the first step in fast sampling of DDBMs without extra training, motivated by the well-established recipes in diffusion models. We generalize DDBMs via a class of non-Markovian diffusion bridges defined on the discretized timesteps concerning sampling, which share the same marginal distributions and training objectives, give rise to generative processes ranging from stochastic to deterministic, and result in diffusion bridge implicit models (DBIMs). DBIMs are not only up to 25$\times$ faster than the vanilla sampler of DDBMs but also induce a novel, simple, and insightful form of ordinary differential equation (ODE) which inspires high-order numerical solvers. Moreover, DBIMs maintain the generation diversity in a distinguished way, by using a booting noise in the initial sampling step, which enables faithful encoding, reconstruction, and semantic interpolation in image translation tasks. Code is available at \url{//github.com/thu-ml/DiffusionBridge}.
Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.
The adoption of increasingly complex deep models has fueled an urgent need for insight into how these models make predictions. Counterfactual explanations form a powerful tool for providing actionable explanations to practitioners. Previously, counterfactual explanation methods have been designed by traversing the latent space of generative models. Yet, these latent spaces are usually greatly simplified, with most of the data distribution complexity contained in the decoder rather than the latent embedding. Thus, traversing the latent space naively without taking the nonlinear decoder into account can lead to unnatural counterfactual trajectories. We introduce counterfactual explanations obtained using a Riemannian metric pulled back via the decoder and the classifier under scrutiny. This metric encodes information about the complex geometric structure of the data and the learned representation, enabling us to obtain robust counterfactual trajectories with high fidelity, as demonstrated by our experiments in real-world tabular datasets.
Compromise estimation entails using a weighted average of outputs from several candidate models, and is a viable alternative to model selection when the choice of model is not obvious. As such, it is a tool used by both frequentists and Bayesians, and in both cases, the literature is vast and includes studies of performance in simulations and applied examples. However, frequentist researchers often prove oracle properties, showing that a proposed average asymptotically performs at least as well as any other average comprising the same candidates. On the Bayesian side, such oracle properties are yet to be established. This paper considers Bayesian stacking estimators, and evaluates their performance using frequentist asymptotics. Oracle properties are derived for estimators stacking Bayesian linear and logistic regression models, and combined with Monte Carlo experiments that show Bayesian stacking may outperform the best candidate model included in the stack. Thus, the result is not only a frequentist motivation of a fundamentally Bayesian procedure, but also an extended range of methods available to frequentist practitioners.
Efficient sampling from un-normalized target distributions is pivotal in scientific computing and machine learning. While neural samplers have demonstrated potential with a special emphasis on sampling efficiency, existing neural implicit samplers still have issues such as poor mode covering behavior, unstable training dynamics, and sub-optimal performances. To tackle these issues, in this paper, we introduce Denoising Fisher Training (DFT), a novel training approach for neural implicit samplers with theoretical guarantees. We frame the training problem as an objective of minimizing the Fisher divergence by deriving a tractable yet equivalent loss function, which marks a unique theoretical contribution to assessing the intractable Fisher divergences. DFT is empirically validated across diverse sampling benchmarks, including two-dimensional synthetic distribution, Bayesian logistic regression, and high-dimensional energy-based models (EBMs). Notably, in experiments with high-dimensional EBMs, our best one-step DFT neural sampler achieves results on par with MCMC methods with up to 200 sampling steps, leading to a substantially greater efficiency over 100 times higher. This result not only demonstrates the superior performance of DFT in handling complex high-dimensional sampling but also sheds light on efficient sampling methodologies across broader applications.
As quantum computing advances, traditional cryptographic security measures, including token obfuscation, are increasingly vulnerable to quantum attacks. This paper introduces a quantum-enhanced approach to token obfuscation leveraging quantum superposition and multi-basis verification to establish a robust defense against these threats. In our method, tokens are encoded in superposition states, making them simultaneously exist in multiple states until measured, thus enhancing obfuscation complexity. Multi-basis verification further secures these tokens by enforcing validation across multiple quantum bases, thwarting unauthorized access. Additionally, we incorporate a quantum decay protocol and a refresh mechanism to manage the token life-cycle securely. Our experimental results demonstrate significant improvements in token security and robustness, validating this approach as a promising solution for quantum-secure cryptographic applications. This work not only highlights the feasibility of quantum-based token obfuscation but also lays the foundation for future quantum-safe security architectures.
Explainability methods are often challenging to evaluate and compare. With a multitude of explainers available, practitioners must often compare and select explainers based on quantitative evaluation metrics. One particular differentiator between explainers is the diversity of explanations for a given dataset; i.e. whether all explanations are identical, unique and uniformly distributed, or somewhere between these two extremes. In this work, we define a complexity measure for explainers, globalness, which enables deeper understanding of the distribution of explanations produced by feature attribution and feature selection methods for a given dataset. We establish the axiomatic properties that any such measure should possess and prove that our proposed measure, Wasserstein Globalness, meets these criteria. We validate the utility of Wasserstein Globalness using image, tabular, and synthetic datasets, empirically showing that it both facilitates meaningful comparison between explainers and improves the selection process for explainability methods.
Bayesian hierarchical models are commonly employed for inference in count datasets, as they account for multiple levels of variation by incorporating prior distributions for parameters at different levels. Examples include Beta-Binomial, Negative-Binomial (NB), Dirichlet-Multinomial (DM) distributions. In this paper, we address two crucial challenges that arise in various Bayesian count models: inference for the concentration parameter in the ratio of Gamma functions and the inability of these models to effectively handle excessive zeros and small nonzero counts. We propose a novel class of prior distributions that facilitates conjugate updating of the concentration parameter in Gamma ratios, enabling full Bayesian inference for the aforementioned count distributions. We use DM models as our running examples. Our methodology leverages fast residue computation and admits closed-form posterior moments. Additionally, we recommend a default horseshoe type prior which has a heavy tail and substantial mass around zero. It admits continuous shrinkage, making the posterior highly adaptable to sparsity or quasi-sparsity in the data. Furthermore, we offer insights and potential generalizations to other count models facing the two challenges. We demonstrate the usefulness of our approach on both simulated examples and on real-world applications. Finally, we conclude with directions for future research.
We prove that the long-run behavior of Hawkes processes is fully determined by the average number and the dispersion of child events. For subcritical processes we provide FLLNs and FCLTs under minimal conditions on the kernel of the process with the precise form of the limit theorems depending strongly on the dispersion of child events. For a critical Hawkes process with weakly dispersed child events, functional central limit theorems do not hold. Instead, we prove that the rescaled intensity processes and rescaled Hawkes processes behave like CIR-processes without mean-reversion, respectively integrated CIR-processes. We provide the rate of convergence by establishing an upper bound on the Wasserstein distance between the distributions of rescaled Hawkes process and the corresponding limit process. By contrast, critical Hawkes process with heavily dispersed child events share many properties of subcritical ones. In particular, functional limit theorems hold. However, unlike subcritical processes critical ones with heavily dispersed child events display long-range dependencies.
2D-based Industrial Anomaly Detection has been widely discussed, however, multimodal industrial anomaly detection based on 3D point clouds and RGB images still has many untouched fields. Existing multimodal industrial anomaly detection methods directly concatenate the multimodal features, which leads to a strong disturbance between features and harms the detection performance. In this paper, we propose Multi-3D-Memory (M3DM), a novel multimodal anomaly detection method with hybrid fusion scheme: firstly, we design an unsupervised feature fusion with patch-wise contrastive learning to encourage the interaction of different modal features; secondly, we use a decision layer fusion with multiple memory banks to avoid loss of information and additional novelty classifiers to make the final decision. We further propose a point feature alignment operation to better align the point cloud and RGB features. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the state-of-the-art (SOTA) methods on both detection and segmentation precision on MVTec-3D AD dataset. Code is available at //github.com/nomewang/M3DM.