We introduce a divergence measure between data distributions based on operators in reproducing kernel Hilbert spaces defined by infinitely divisible kernels. The empirical estimator of the divergence is computed using the eigenvalues of positive definite matrices that are obtained by evaluating the kernel over pairs of samples. The new measure shares similar properties to Jensen-Shannon divergence. Convergence of the proposed estimators follows from concentration results based on the difference between the ordered spectrum of the Gram matrices and the integral operators associated with the population quantities. The proposed measure of divergence avoids the estimation of the probability distribution underlying the data. Numerical experiments involving comparing distributions and applications to sampling unbalanced data for classification show that the proposed divergence can achieve state of the art results.
The foremost challenge to causal inference with real-world data is to handle the imbalance in the covariates with respect to different treatment options, caused by treatment selection bias. To address this issue, recent literature has explored domain-invariant representation learning based on different domain divergence metrics (e.g., Wasserstein distance, maximum mean discrepancy, position-dependent metric, and domain overlap). In this paper, we reveal the weaknesses of these strategies, i.e., they lead to the loss of predictive information when enforcing the domain invariance; and the treatment effect estimation performance is unstable, which heavily relies on the characteristics of the domain distributions and the choice of domain divergence metrics. Motivated by information theory, we propose to learn the Infomax and Domain-Independent Representations to solve the above puzzles. Our method utilizes the mutual information between the global feature representations and individual feature representations, and the mutual information between feature representations and treatment assignment predictions, in order to maximally capture the common predictive information for both treatment and control groups. Moreover, our method filters out the influence of instrumental and irrelevant variables, and thus it effectively increases the predictive ability of potential outcomes. Experimental results on both the synthetic and real-world datasets show that our method achieves state-of-the-art performance on causal effect inference. Moreover, our method exhibits reliable prediction performances when facing data with different characteristics of data distributions, complicated variable types, and severe covariate imbalance.
By calculating the Kullback-Leibler divergence between two probability measures belonging to different exponential families, we end up with a formula that generalizes the ordinary Fenchel-Young divergence which is recovered in the special case when we let the two exponential families coincide. Inspired by this formula, we define the duo Fenchel-Young divergence and reports a dominance condition on its pair of generators which guarantees that it is always non-negative.
Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver.
We consider the problem of estimating the autocorrelation operator of an autoregressive Hilbertian process. By means of a Tikhonov approach, we establish a general result that yields the convergence rate of the estimated autocorrelation operator as a function of the rate of convergence of the estimated lag zero and lag one autocovariance operators. The result is general in that it can accommodate any consistent estimators of the lagged autocovariances. Consequently it can be applied to processes under any mode of observation: complete, discrete, sparse, and/or with measurement errors. An appealing feature is that the result does not require delicate spectral decay assumptions on the autocovariances but instead rests on natural source conditions. The result is illustrated by application to important special cases.
We present an efficient basis for imaginary time Green's functions based on a low rank decomposition of the spectral Lehmann representation. The basis functions are simply a set of well-chosen exponentials, so the corresponding expansion may be thought of as a discrete form of the Lehmann representation using an effective spectral density which is a sum of $\delta$ functions. The basis is determined only by an upper bound on the product $\beta \omega_{\max}$, with $\beta$ the inverse temperature and $\omega_{\max}$ an energy cutoff, and a user-defined error tolerance $\epsilon$. The number $r$ of basis functions scales as $\mathcal{O}\left(\log(\beta \omega_{\max}) \log (1/\epsilon)\right)$. The discrete Lehmann representation of a particular imaginary time Green's function can be recovered by interpolation at a set of $r$ imaginary time nodes. Both the basis functions and the interpolation nodes can be obtained rapidly using standard numerical linear algebra routines. Due to the simple form of the basis, the discrete Lehmann representation of a Green's function can be explicitly transformed to the Matsubara frequency domain, or obtained directly by interpolation on a Matsubara frequency grid. We benchmark the efficiency of the representation on simple cases, and with a high precision solution of the Sachdev-Ye-Kitaev equation at low temperature. We compare our approach with the related intermediate representation method, and introduce an improved algorithm to build the intermediate representation basis and a corresponding sampling grid.
Evaluating the quality of learned representations without relying on a downstream task remains one of the challenges in representation learning. In this work, we present Geometric Component Analysis (GeomCA) algorithm that evaluates representation spaces based on their geometric and topological properties. GeomCA can be applied to representations of any dimension, independently of the model that generated them. We demonstrate its applicability by analyzing representations obtained from a variety of scenarios, such as contrastive learning models, generative models and supervised learning models.
To learn intrinsic low-dimensional structures from high-dimensional data that most discriminate between classes, we propose the principle of Maximal Coding Rate Reduction ($\text{MCR}^2$), an information-theoretic measure that maximizes the coding rate difference between the whole dataset and the sum of each individual class. We clarify its relationships with most existing frameworks such as cross-entropy, information bottleneck, information gain, contractive and contrastive learning, and provide theoretical guarantees for learning diverse and discriminative features. The coding rate can be accurately computed from finite samples of degenerate subspace-like distributions and can learn intrinsic representations in supervised, self-supervised, and unsupervised settings in a unified manner. Empirically, the representations learned using this principle alone are significantly more robust to label corruptions in classification than those using cross-entropy, and can lead to state-of-the-art results in clustering mixed data from self-learned invariant features.
Neural models of dialog rely on generalized latent representations of language. This paper introduces a novel training procedure which explicitly learns multiple representations of language at several levels of granularity. The multi-granularity training algorithm modifies the mechanism by which negative candidate responses are sampled in order to control the granularity of learned latent representations. Strong performance gains are observed on the next utterance retrieval task using both the MultiWOZ dataset and the Ubuntu dialog corpus. Analysis significantly demonstrates that multiple granularities of representation are being learned, and that multi-granularity training facilitates better transfer to downstream tasks.
Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time. To handle unpaired training data, we introduce a novel cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative comparisons, we measure realism with user study and diversity with a perceptual distance metric. We apply the proposed model to domain adaptation and show competitive performance when compared to the state-of-the-art on the MNIST-M and the LineMod datasets.