Classifier-free guidance (CFG) is crucial for improving both generation quality and alignment between the input condition and final output in diffusion models. While a high guidance scale is generally required to enhance these aspects, it also causes oversaturation and unrealistic artifacts. In this paper, we revisit the CFG update rule and introduce modifications to address this issue. We first decompose the update term in CFG into parallel and orthogonal components with respect to the conditional model prediction and observe that the parallel component primarily causes oversaturation, while the orthogonal component enhances image quality. Accordingly, we propose down-weighting the parallel component to achieve high-quality generations without oversaturation. Additionally, we draw a connection between CFG and gradient ascent and introduce a new rescaling and momentum method for the CFG update rule based on this insight. Our approach, termed adaptive projected guidance (APG), retains the quality-boosting advantages of CFG while enabling the use of higher guidance scales without oversaturation. APG is easy to implement and introduces practically no additional computational overhead to the sampling process. Through extensive experiments, we demonstrate that APG is compatible with various conditional diffusion models and samplers, leading to improved FID, recall, and saturation scores while maintaining precision comparable to CFG, making our method a superior plug-and-play alternative to standard classifier-free guidance.
The optimal branch number of MDS matrices has established their importance in designing diffusion layers for various block ciphers and hash functions. As a result, numerous matrix structures, including Hadamard and circulant matrices, have been proposed for constructing MDS matrices. Also, in the literature, significant attention is typically given to identifying MDS candidates with optimal implementations or proposing new constructions across different orders. However, this paper takes a different approach by not emphasizing efficiency issues or introducing new constructions. Instead, its primary objective is to enumerate Hadamard MDS and involutory Hadamard MDS matrices of order $4$ within the field $\mathbb{F}_{2^r}$. Specifically, it provides an explicit formula for the count of both Hadamard MDS and involutory Hadamard MDS matrices of order $4$ over $\mathbb{F}_{2^r}$. Additionally, it derives the count of Hadamard Near-MDS (NMDS) and involutory Hadamard NMDS matrices, each with exactly one zero in each row, of order $4$ over $\mathbb{F}_{2^r}$. Furthermore, the paper discusses some circulant-like matrices for constructing NMDS matrices and proves that when $n$ is even, any $2n \times 2n$ Type-II circulant-like matrix can never be an NMDS matrix. While it is known that NMDS matrices may be singular, this paper establishes that singular Hadamard matrices can never be NMDS matrices. Moreover, it proves that there exist exactly two orthogonal Type-I circulant-like matrices of order $4$ over $\mathbb{F}_{2^r}$.
While induction is considered a key mechanism for in-context learning in LLMs, understanding its precise circuit decomposition beyond toy models remains elusive. Here, we study the emergence of induction behavior within LLMs by probing their response to weak single-token perturbations of the residual stream. We find that LLMs exhibit a robust, universal regime in which their response remains scale-invariant under changes in perturbation strength, thereby allowing us to quantify the build-up of token correlations throughout the model. By applying our method, we observe signatures of induction behavior within the residual stream of Gemma-2-2B, Llama-3.2-3B, and GPT-2-XL. Across all models, we find that these induction signatures gradually emerge within intermediate layers and identify the relevant model sections composing this behavior. Our results provide insights into the collective interplay of components within LLMs and serve as a benchmark for large-scale circuit analysis.
It has been shown that one can design distributed algorithms that are (nearly) singularly optimal, meaning they simultaneously achieve optimal time and message complexity (within polylogarithmic factors), for several fundamental global problems such as broadcast, leader election, and spanning tree construction, under the $\text{KT}_0$ assumption. With this assumption, nodes have initial knowledge only of themselves, not their neighbors. In this case the time and message lower bounds are $\Omega(D)$ and $\Omega(m)$, respectively, where $D$ is the diameter of the network and $m$ is the number of edges, and there exist (even) deterministic algorithms that simultaneously match these bounds. On the other hand, under the $\text{KT}_1$ assumption, whereby each node has initial knowledge of itself and the identifiers of its neighbors, the situation is not clear. For the $\text{KT}_1$ CONGEST model (where messages are of small size), King, Kutten, and Thorup (KKT) showed that one can solve several fundamental global problems (with the notable exception of BFS tree construction) such as broadcast, leader election, and spanning tree construction with $\tilde{O}(n)$ message complexity ($n$ is the network size), which can be significantly smaller than $m$. Randomization is crucial in obtaining this result. While the message complexity of the KKT result is near-optimal, its time complexity is $\tilde{O}(n)$ rounds, which is far from the standard lower bound of $\Omega(D)$. In this paper, we show that in the $\text{KT}_1$ LOCAL model (where message sizes are not restricted), singular optimality is achievable. Our main result is that all global problems, including BFS tree construction, can be solved in $\tilde{O}(D)$ rounds and $\tilde{O}(n)$ messages, where both bounds are optimal up to polylogarithmic factors. Moreover, we show that this can be achieved deterministically.
Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. The solution can be obtained either as a closed-form solution, which includes solving a system of linear equations, or iteratively through gradient descent. Using the iterative approach opens up for changing the kernel during training, something that is investigated in this paper. We theoretically address the effects this has on model complexity and generalization. Based on our findings, we propose an update scheme for the bandwidth of translational-invariant kernels, where we let the bandwidth decrease to zero during training, thus circumventing the need for hyper-parameter selection. We demonstrate on real and synthetic data how decreasing the bandwidth during training outperforms using a constant bandwidth, selected by cross-validation and marginal likelihood maximization. We also show theoretically and empirically that using a decreasing bandwidth, we are able to achieve both zero training error in combination with good generalization, and a double descent behavior, phenomena that do not occur for KRR with constant bandwidth but are known to appear for neural networks.
We give an approach for characterizing interference by lower bounding the number of units whose outcome depends on certain groups of treated individuals, such as depending on the treatment of others, or others who are at least a certain distance away. The approach is applicable to randomized experiments with binary-valued outcomes. Asymptotically conservative point estimates and one-sided confidence intervals may be constructed with no assumptions beyond the known randomization design, allowing the approach to be used when interference is poorly understood, or when an observed network might only be a crude proxy for the underlying social mechanisms. Point estimates are equal to Hajek-weighted comparisons of units with differing levels of treatment exposure. Empirically, we find that the size of our interval estimates is competitive with (and often smaller than) those of the EATE, an assumption-lean treatment effect, suggesting that the proposed estimands may be intrinsically easier to estimate than treatment effects.
The decidability of the reachability problem for finitary PCF has been used as a theoretical basis for fully automated verification tools for functional programs. The reachability problem, however, often becomes undecidable for a slight extension of finitary PCF with side effects, such as exceptions, algebraic effects, and references, which hindered the extension of the above verification tools for supporting functional programs with side effects. In this paper, we first give simple proofs of the undecidability of four extensions of finitary PCF, which would help us understand and analyze the source of undecidability. We then focus on an extension with references, and give a decidable fragment using a type system. To our knowledge, this is the first non-trivial decidable fragment that features higher-order recursive functions containing reference cells.
The immersed interface method (IIM) for models of fluid flow and fluid-structure interaction imposes jump conditions that capture stress discontinuities generated by forces that are concentrated along immersed boundaries. Most prior work using the IIM for fluid dynamic applications has focused on smooth interfaces, but boundaries with sharp features such as corners and edges can appear in practical analyses, particularly on engineered structures. The present study builds on our work to integrate finite element-type representations of interface geometries with the IIM. Initial realizations of this approach used a continuous Galerkin (CG) finite element discretization for the boundary, but as we show herein, these approaches generate large errors near sharp geometrical features. To overcome this difficulty, this study introduces an IIM approach using discontinuous Galerkin (DG) representation of the jump conditions. Numerical examples explore the impacts of different interface representations on accuracy for both smooth and sharp boundaries, particularly flows interacting with fixed interface configurations. We demonstrate that using a DG approach provides accuracy that is comparable to the CG method for smooth cases. Further, we identify a time step size restriction for the CG representation that is directly related to the sharpness of the geometry. In contrast, time step size restrictions imposed by DG representations are demonstrated to be insensitive to the presence of sharp features.
We present a computational formulation for the approximate version of several variational inequality problems, investigating their computational complexity and establishing PPAD-completeness. Examining applications in computational game theory, we specifically focus on two key concepts: resilient Nash equilibrium, and multi-leader-follower games -- domains traditionally known for the absence of general solutions. In the presence of standard assumptions and relaxation techniques, we formulate problem versions for such games that are expressible in terms of variational inequalities, ultimately leading to proofs of PPAD-completeness.
The high-performance computing (HPC) community has recently seen a substantial diversification of hardware platforms and their associated programming models. From traditional multicore processors to highly specialized accelerators, vendors and tool developers back up the relentless progress of those architectures. In the context of scientific programming, it is fundamental to consider performance portability frameworks, i.e., software tools that allow programmers to write code once and run it on different computer architectures without sacrificing performance. We report here on the benefits and challenges of performance portability using a field-line tracing simulation and a particle-in-cell code, two relevant applications in computational plasma physics with applications to magnetically-confined nuclear-fusion energy research. For these applications we report performance results obtained on four HPC platforms with server-class CPUs from Intel (Xeon) and AMD (EPYC), and high-end GPUs from Nvidia and AMD, including the latest Nvidia H100 GPU and the novel AMD Instinct MI300A APU. Our results show that both Kokkos and OpenMP are powerful tools to achieve performance portability and decent "out-of-the-box" performance, even for the very latest hardware platforms. For our applications, Kokkos provided performance portability to the broadest range of hardware architectures from different vendors.
Neural machine translation (NMT) is a deep learning based approach for machine translation, which yields the state-of-the-art translation performance in scenarios where large-scale parallel corpora are available. Although the high-quality and domain-specific translation is crucial in the real world, domain-specific corpora are usually scarce or nonexistent, and thus vanilla NMT performs poorly in such scenarios. Domain adaptation that leverages both out-of-domain parallel corpora as well as monolingual corpora for in-domain translation, is very important for domain-specific translation. In this paper, we give a comprehensive survey of the state-of-the-art domain adaptation techniques for NMT.