Two typical fixed-length random number generation problems in information theory are considered for general sources. One is the source resolvability problem and the other is the intrinsic randomness problem. In each of these problems, the optimum achievable rate with respect to the given approximation measure is one of our main concerns and has been characterized using two different information quantities: the information spectrum and the smooth R\'enyi entropy. Recently, optimum achievable rates with respect to $f$-divergences have been characterized using the information spectrum quantity. The $f$-divergence is a general non-negative measure between two probability distributions on the basis of a convex function $f$. The class of f-divergences includes several important measures such as the variational distance, the KL divergence, the Hellinger distance and so on. Hence, it is meaningful to consider the random number generation problems with respect to $f$-divergences. However, optimum achievable rates with respect to $f$-divergences using the smooth R\'enyi entropy have not been clarified yet in both of two problems. In this paper we try to analyze the optimum achievable rates using the smooth R\'enyi entropy and to extend the class of $f$-divergence. To do so, we first derive general formulas of the first-order optimum achievable rates with respect to $f$-divergences in both problems under the same conditions as imposed by previous studies. Next, we relax the conditions on $f$-divergence and generalize the obtained general formulas. Then, we particularize our general formulas to several specified functions $f$. As a result, we reveal that it is easy to derive optimum achievable rates for several important measures from our general formulas. Furthermore, a kind of duality between the resolvability and the intrinsic randomness is revealed in terms of the smooth R\'enyi entropy.
In this work we establish local limit theorems for q-multinomial and multiple Heine distributions. Specifically, the pointwise convergence of the q-multinomial distribution of the first kind, as well as for its discrete limit, the multiple Heine distribution, to a multivariate Stieltjes-Wigert type distribution, are provided.
Preconditioning is essential in iterative methods for solving linear systems. It is also the implicit objective in updating approximations of Jacobians in optimization methods, e.g., in quasi-Newton methods. Motivated by the latter, we study a nonclassic matrix condition number, the $\omega$-condition number. We do this in the context of optimal conditioning for: (i) our application to low rank updating of generalized Jacobians; (ii) iterative methods for linear systems: (iia) clustering of eigenvalues and (iib) convergence rates. For a positive definite matrix, the $\omega$-condition measure is the ratio of the arithmetic and geometric means of the eigenvalues. In particular, our applications concentrate on linear systems with low rank updates of ill-conditioned positive definite matrices. These systems arise in the context of nonsmooth Newton methods using generalized Jacobians. We are able to use optimality conditions and derive explicit formulae for $\omega$-optimal preconditioners and preconditioned updates. Connections to partial Cholesky sparse preconditioners are made. Evaluating or estimating the classical condition number $\kappa$ can be expensive. We show that the $\omega$-condition number can be evaluated explicitly following a Cholesky or LU factorization. Moreover, the simplicity of $\omega$ allows for the derivation of formulae for optimal preconditioning in various scenarios, i.e., this avoids the need for expensive algorithmic calculations. Our empirics show that $\omega$ estimates the actual condition of a linear system significantly better. Moreover, our empirical results show a significant decrease in the number of iterations required for a requested accuracy in the residual during an iterative method, i.e., these results confirm the efficacy of using the $\omega$-condition number compared to the classical condition number.
In many contexts involving ranked preferences, agents submit partial orders over available alternatives. Statistical models often treat these as marginal in the space of total orders, but this approach overlooks information contained in the list length itself. In this work, we introduce and taxonomize approaches for jointly modeling distributions over top-$k$ partial orders and list lengths $k$, considering two classes of approaches: composite models that view a partial order as a truncation of a total order, and augmented ranking models that model the construction of the list as a sequence of choice decisions, including the decision to stop. For composite models, we consider three dependency structures for joint modeling of order and truncation length. For augmented ranking models, we consider different assumptions on how the stop-token choice is modeled. Using data consisting of partial rankings from San Francisco school choice and San Francisco ranked choice elections, we evaluate how well the models predict observed data and generate realistic synthetic datasets. We find that composite models, explicitly modeling length as a categorical variable, produce synthetic datasets with accurate length distributions, and an augmented model with position-dependent item utilities jointly models length and preferences in the training data best, as measured by negative log loss. Methods from this work have significant implications on the simulation and evaluation of real-world social systems that solicit ranked preferences.
Graphs serve as generic tools to encode the underlying relational structure of data. Often this graph is not given, and so the task of inferring it from nodal observations becomes important. Traditional approaches formulate a convex inverse problem with a smoothness promoting objective and rely on iterative methods to obtain a solution. In supervised settings where graph labels are available, one can unroll and truncate these iterations into a deep network that is trained end-to-end. Such a network is parameter efficient and inherits inductive bias from the optimization formulation, an appealing aspect for data constrained settings in, e.g., medicine, finance, and the natural sciences. But typically such settings care equally about uncertainty over edge predictions, not just point estimates. Here we introduce novel iterations with independently interpretable parameters, i.e., parameters whose values - independent of other parameters' settings - proportionally influence characteristics of the estimated graph, such as edge sparsity. After unrolling these iterations, prior knowledge over such graph characteristics shape prior distributions over these independently interpretable network parameters to yield a Bayesian neural network (BNN) capable of graph structure learning (GSL) from smooth signal observations. Fast execution and parameter efficiency allow for high-fidelity posterior approximation via Markov Chain Monte Carlo (MCMC) and thus uncertainty quantification on edge predictions. Synthetic and real data experiments corroborate this model's ability to provide well-calibrated estimates of uncertainty, in test cases that include unveiling economic sector modular structure from S$\&$P$500$ data and recovering pairwise digit similarities from MNIST images. Overall, this framework enables GSL in modest-scale applications where uncertainty on the data structure is paramount.
Game theory relies heavily on the availability of cardinal utility functions, but in fields such as matching markets, only ordinal preferences are typically elicited. The literature focuses on mechanisms with simple dominant strategies, but many real-world applications lack dominant strategies, making the intensity of preferences between outcomes important for determining strategies. Even though precise information about cardinal utilities is not available, some data about the likelihood of utility functions is often accessible. We propose to use Bayesian games to formalize uncertainty about the decision-makers' utilities by viewing them as a collection of normal-form games. Instead of searching for the Bayes-Nash equilibrium, we study how uncertainty in utilities is reflected in uncertainty of strategic play. To do this, we introduce a novel solution concept called $\alpha$-Rank-collections, which extends $\alpha$-Rank to Bayesian games. This allows us to analyze strategic play in, for example, non-strategyproof matching markets, for which appropriate solution concepts are currently lacking. $\alpha$-Rank-collections characterize the expected probability of encountering a certain strategy profile under replicator dynamics in the long run, rather than predicting a specific equilibrium strategy profile. We experimentally evaluate $\alpha$-Rank-collections using instances of the Boston mechanism, finding that our solution concept provides more nuanced predictions compared to Bayes-Nash equilibria. Additionally, we prove that $\alpha$-Rank-collections are invariant to positive affine transformations, a standard property for a solution concept, and are efficient to approximate.
We study the performance of empirical risk minimization on the $p$-norm linear regression problem for $p \in (1, \infty)$. We show that, in the realizable case, under no moment assumptions, and up to a distribution-dependent constant, $O(d)$ samples are enough to exactly recover the target. Otherwise, for $p \in [2, \infty)$, and under weak moment assumptions on the target and the covariates, we prove a high probability excess risk bound on the empirical risk minimizer whose leading term matches, up to a constant that depends only on $p$, the asymptotically exact rate. We extend this result to the case $p \in (1, 2)$ under mild assumptions that guarantee the existence of the Hessian of the risk at its minimizer.
Multi-modal methods establish comprehensive superiority over uni-modal methods. However, the imbalanced contributions of different modalities to task-dependent predictions constantly degrade the discriminative performance of canonical multi-modal methods. Based on the contribution to task-dependent predictions, modalities can be identified as predominant and auxiliary modalities. Benchmark methods raise a tractable solution: augmenting the auxiliary modality with a minor contribution during training. However, our empirical explorations challenge the fundamental idea behind such behavior, and we further conclude that benchmark approaches suffer from certain defects: insufficient theoretical interpretability and limited exploration capability of discriminative knowledge. To this end, we revisit multi-modal representation learning from a causal perspective and build the Structural Causal Model. Following the empirical explorations, we determine to capture the true causality between the discriminative knowledge of predominant modality and predictive label while considering the auxiliary modality. Thus, we introduce the $\beta$-generalization front-door criterion. Furthermore, we propose a novel network for sufficiently exploring multi-modal discriminative knowledge. Rigorous theoretical analyses and various empirical evaluations are provided to support the effectiveness of the innate mechanism behind our proposed method.
The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.
The Johnson-Lindenstrauss (JL) Lemma introduced the concept of dimension reduction via a random linear map, which has become a fundamental technique in many computational settings. For a set of $n$ points in $\mathbb{R}^d$ and any fixed $\epsilon>0$, it reduces the dimension $d$ to $O(\log n)$ while preserving, with high probability, all the pairwise Euclidean distances within factor $1+\epsilon$. Perhaps surprisingly, the target dimension can be lower if one only wishes to preserve the optimal value of a certain problem on the pointset, e.g., Euclidean max-cut or $k$-means. However, for some notorious problems, like diameter (aka furthest pair), dimension reduction via the JL map to below $O(\log n)$ does not preserve the optimal value within factor $1+\epsilon$. We propose to focus on another regime, of \emph{moderate dimension reduction}, where a problem's value is preserved within factor $\alpha>1$ using target dimension $\tfrac{\log n}{poly(\alpha)}$. We establish the viability of this approach and show that the famous $k$-center problem is $\alpha$-approximated when reducing to dimension $O(\tfrac{\log n}{\alpha^2}+\log k)$. Along the way, we address the diameter problem via the special case $k=1$. Our result extends to several important variants of $k$-center (with outliers, capacities, or fairness constraints), and the bound improves further with the input's doubling dimension. While our $poly(\alpha)$-factor improvement in the dimension may seem small, it actually has significant implications for streaming algorithms, and easily yields an algorithm for $k$-center in dynamic geometric streams, that achieves $O(\alpha)$-approximation using space $poly(kdn^{1/\alpha^2})$. This is the first algorithm to beat $O(n)$ space in high dimension $d$, as all previous algorithms require space at least $\exp(d)$. Furthermore, it extends to the $k$-center variants mentioned above.
Imaging is the process of transforming noisy, incomplete data into a space that humans can interpret. NIFTy is a Bayesian framework for imaging and has already successfully been applied to many fields in astrophysics. Previous design decisions held the performance and the development of methods in NIFTy back. We present a rewrite of NIFTy, coined NIFTy.re, which reworks the modeling principle, extends the inference strategies, and outsources much of the heavy lifting to JAX. The rewrite dramatically accelerates models written in NIFTy, lays the foundation for new types of inference machineries, improves maintainability, and enables interoperability between NIFTy and the JAX machine learning ecosystem.