In this note we give proofs for results relating to the Instrumental Variable (IV) model with binary response $Y$ and binary treatment $X$, but with an instrument $Z$ with $K$ states. These results were originally stated in Richardson & Robins (2014), "ACE Bounds; SEMS with Equilibrium Conditions," arXiv:1410.0470.
Sequential testing, always-valid $p$-values, and confidence sequences promise flexible statistical inference and on-the-fly decision making. However, unlike fixed-$n$ inference based on asymptotic normality, existing sequential tests either make parametric assumptions and end up under-covering/over-rejecting when these fail or use non-parametric but conservative concentration inequalities and end up over-covering/under-rejecting. To circumvent these issues, we sidestep exact at-least-$\alpha$ coverage and focus on asymptotic calibration and asymptotic optimality. That is, we seek sequential tests whose probability of \emph{ever} rejecting a true hypothesis approaches $\alpha$ and whose expected time to reject a false hypothesis approaches a lower bound on all such asymptotically calibrated tests, both "approaches" occurring under an appropriate limit. We permit observations to be both non-parametric and dependent and focus on testing whether the observations form a martingale difference sequence. We propose the universal sequential probability ratio test (uSPRT), a slight modification to the normal-mixture sequential probability ratio test, where we add a burn-in period and adjust thresholds accordingly. We show that even in this very general setting, the uSPRT is asymptotically optimal under mild generic conditions. We apply the results to stabilized estimating equations to test means, treatment effects, {\etc} Our results also provide corresponding guarantees for the implied confidence sequences. Numerical simulations verify our guarantees and the benefits of the uSPRT over alternatives.
In general, diffusion model-based MRI reconstruction methods incrementally remove artificially added noise while imposing data consistency to reconstruct the underlying images. However, real-world MRI acquisitions already contain inherent noise due to thermal fluctuations. This phenomenon is particularly notable when using ultra-fast, high-resolution imaging sequences for advanced research, or using low-field systems favored by low- and middle-income countries. These common scenarios can lead to sub-optimal performance or complete failure of existing diffusion model-based reconstruction techniques. Specifically, as the artificially added noise is gradually removed, the inherent MRI noise becomes increasingly pronounced, making the actual noise level inconsistent with the predefined denoising schedule and consequently inaccurate image reconstruction. To tackle this problem, we propose a posterior sampling strategy with a novel NoIse Level Adaptive Data Consistency (Nila-DC) operation. Extensive experiments are conducted on two public datasets and an in-house clinical dataset with field strength ranging from 0.3T to 3T, showing that our method surpasses the state-of-the-art MRI reconstruction methods, and is highly robust against various noise levels. The code will be released after review.
We give new data-dependent locality sensitive hashing schemes (LSH) for the Earth Mover's Distance ($\mathsf{EMD}$), and as a result, improve the best approximation for nearest neighbor search under $\mathsf{EMD}$ by a quadratic factor. Here, the metric $\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$ consists of sets of $s$ vectors in $\mathbb{R}^d$, and for any two sets $x,y$ of $s$ vectors the distance $\mathsf{EMD}(x,y)$ is the minimum cost of a perfect matching between $x,y$, where the cost of matching two vectors is their $\ell_p$ distance. Previously, Andoni, Indyk, and Krauthgamer gave a (data-independent) locality-sensitive hashing scheme for $\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$ when $p \in [1,2]$ with approximation $O(\log^2 s)$. By being data-dependent, we improve the approximation to $\tilde{O}(\log s)$. Our main technical contribution is to show that for any distribution $\mu$ supported on the metric $\mathsf{EMD}_s(\mathbb{R}^d, \ell_p)$, there exists a data-dependent LSH for dense regions of $\mu$ which achieves approximation $\tilde{O}(\log s)$, and that the data-independent LSH actually achieves a $\tilde{O}(\log s)$-approximation outside of those dense regions. Finally, we show how to "glue" together these two hashing schemes without any additional loss in the approximation. Beyond nearest neighbor search, our data-dependent LSH also gives optimal (distributional) sketches for the Earth Mover's Distance. By known sketching lower bounds, this implies that our LSH is optimal (up to $\mathrm{poly}(\log \log s)$ factors) among those that collide close points with constant probability.
Indexing a set of strings for prefix search or membership queries is a fundamental task with many applications such as information retrieval or database systems. A classic abstract data type for modelling such an index is a trie. Due to the fundamental nature of this problem, it has sparked much interest, leading to a variety of trie implementations with different characteristics. A trie implementation that has been well-used in practice is the double-array (trie) consisting of merely two integer arrays. While a traversal takes constant time per node visit, the needed space consumption in computer words can be as large as the product of the number of nodes and the alphabet size. Despite that several heuristics have been proposed on lowering the space requirements, we are unaware of any theoretical guarantees. In this paper, we study the decision problem whether there exists a double-array of a given size. To this end, we first draw a connection to the sparse matrix compression problem, which makes our problem NP-complete for alphabet sizes linear to the number of nodes. We further propose a reduction from the restricted directed Hamiltonian path problem, leading to NP-completeness even for logarithmic-sized alphabets.
Let $\mathcal{P}$ be a simple polygon with $m$ vertices and let $P$ be a set of $n$ points inside $\mathcal{P}$. We prove that there exists, for any $\varepsilon>0$, a set $\mathcal{C} \subset P$ of size $O(1/\varepsilon^2)$ such that the following holds: for any query point $q$ inside the polygon $\mathcal{P}$, the geodesic distance from $q$ to its furthest neighbor in $\mathcal{C}$ is at least $1-\varepsilon$ times the geodesic distance to its further neighbor in $P$. Thus the set $\mathcal{C}$ can be used for answering $\varepsilon$-approximate furthest-neighbor queries with a data structure whose storage requirement is independent of the size of $P$. The coreset can be constructed in $O\left(\frac{1}{\varepsilon} \left( n\log(1/\varepsilon) + (n+m)\log(n+m)\right) \right)$ time.
Despite advancements in text-to-image generation (T2I), prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional understanding or integrating large language models for improved layout planning. However, the inherent alignment capabilities of T2I models are still inadequate. By reviewing the link between generative and discriminative modeling, we posit that T2I models' discriminative abilities may reflect their text-image alignment proficiency during generation. In this light, we advocate bolstering the discriminative abilities of T2I models to achieve more precise text-to-image alignment for generation. We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. As a bonus of the discriminative adapter, a self-correction mechanism can leverage discriminative gradients to better align generated images to text prompts during inference. Comprehensive evaluations across three benchmark datasets, including both in-distribution and out-of-distribution scenarios, demonstrate our method's superior generation performance. Meanwhile, it achieves state-of-the-art discriminative performance on the two discriminative tasks compared to other generative models.
Consider a binary statistical hypothesis testing problem, where $n$ independent and identically distributed random variables $Z^n$ are either distributed according to the null hypothesis $P$ or the alternative hypothesis $Q$, and only $P$ is known. A well-known test that is suitable for this case is the so-called Hoeffding test, which accepts $P$ if the Kullback-Leibler (KL) divergence between the empirical distribution of $Z^n$ and $P$ is below some threshold. This work characterizes the first and second-order terms of the type-II error probability for a fixed type-I error probability for the Hoeffding test as well as for divergence tests, where the KL divergence is replaced by a general divergence. It is demonstrated that, irrespective of the divergence, divergence tests achieve the first-order term of the Neyman-Pearson test, which is the optimal test when both $P$ and $Q$ are known. In contrast, the second-order term of divergence tests is strictly worse than that of the Neyman-Pearson test. It is further demonstrated that divergence tests with an invariant divergence achieve the same second-order term as the Hoeffding test, but divergence tests with a non-invariant divergence may outperform the Hoeffding test for some alternative hypotheses $Q$. Potentially, this behavior could be exploited by a composite hypothesis test with partial knowledge of the alternative hypothesis $Q$ by tailoring the divergence of the divergence test to the set of possible alternative hypotheses.
I examine a conceptual model of a recommendation system (RS) with user inflow and churn dynamics. When inflow and churn balance out, the user distribution reaches a steady state. Changing the recommendation algorithm alters the steady state and creates a transition period. During this period, the RS behaves differently from its new steady state. In particular, A/B experiment metrics obtained in transition periods are biased indicators of the RS's long term performance. Scholars and practitioners, however, often conduct A/B tests shortly after introducing new algorithms to validate their effectiveness. This A/B experiment paradigm, widely regarded as the gold standard for assessing RS improvements, may consequently yield false conclusions. I also briefly discuss the data bias caused by the user retention dynamics.
The $(k, z)$-Clustering problem in Euclidean space $\mathbb{R}^d$ has been extensively studied. Given the scale of data involved, compression methods for the Euclidean $(k, z)$-Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error $\varepsilon$, remains unclear in existing literature. This paper initiates the study of space complexity for Euclidean $(k, z)$-Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when $k$ is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for $(k, z)$-Clustering establishes a tight space bound of $\Theta( n d )$ for terminal embedding, where $n$ represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.
This paper presents an algorithmic method that, given a positive integer $j$, generates the $j$-th convergence stair containing all natural numbers from where the Collatz conjecture holds by exactly $j$ applications of the Collatz function. To this end, we present a novel formulation of the Collatz conjecture as a concurrent program, and provide the general case specification of the $j$-th convergence stair for any $j > 0$. The proposed specifications provide a layered and linearized orientation of Collatz numbers organized in an infinite set of infinite binary trees. To the best of our knowledge, this is the first time that such a general specification is provided, which can have significant applications in analyzing and testing the behaviors of complex non-linear systems. We have implemented this method as a software tool that generates the Collatz numbers of individual stairs. We also show that starting from any value in any convergence stair the conjecture holds. However, to prove the conjecture, one has to show that every natural number will appear in some stair; i.e., the union of all stairs is equal to the set of natural numbers, which remains an open problem.