Denoising diffusion probabilistic models (DDPMs) (Ho et al. 2020) have shown impressive results on image and waveform generation in continuous state spaces. Here, we introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs), diffusion-like generative models for discrete data that generalize the multinomial diffusion model of Hoogeboom et al. 2021, by going beyond corruption processes with uniform transition probabilities. This includes corruption with transition matrices that mimic Gaussian kernels in continuous space, matrices based on nearest neighbors in embedding space, and matrices that introduce absorbing states. The third allows us to draw a connection between diffusion models and autoregressive and mask-based generative models. We show that the choice of transition matrix is an important design decision that leads to improved results in image and text domains. We also introduce a new loss function that combines the variational lower bound with an auxiliary cross entropy loss. For text, this model class achieves strong results on character-level text generation while scaling to large vocabularies on LM1B. On the image dataset CIFAR-10, our models approach the sample quality and exceed the log-likelihood of the continuous-space DDPM model.
We present a framework for speeding up the time it takes to sample from discrete distributions $\mu$ defined over subsets of size $k$ of a ground set of $n$ elements, in the regime $k\ll n$. We show that having estimates of marginals $\mathbb{P}_{S\sim \mu}[i\in S]$, the task of sampling from $\mu$ can be reduced to sampling from distributions $\nu$ supported on size $k$ subsets of a ground set of only $n^{1-\alpha}\cdot \operatorname{poly}(k)$ elements. Here, $1/\alpha\in [1, k]$ is the parameter of entropic independence for $\mu$. Further, the sparsified distributions $\nu$ are obtained by applying a sparse (mostly $0$) external field to $\mu$, an operation that often retains algorithmic tractability of sampling from $\nu$. This phenomenon, which we dub domain sparsification, allows us to pay a one-time cost of estimating the marginals of $\mu$, and in return reduce the amortized cost needed to produce many samples from the distribution $\mu$, as is often needed in upstream tasks such as counting and inference. For a wide range of distributions where $\alpha=\Omega(1)$, our result reduces the domain size, and as a corollary, the cost-per-sample, by a $\operatorname{poly}(n)$ factor. Examples include monomers in a monomer-dimer system, non-symmetric determinantal point processes, and partition-constrained Strongly Rayleigh measures. Our work significantly extends the reach of prior work of Anari and Derezi\'nski who obtained domain sparsification for distributions with a log-concave generating polynomial (corresponding to $\alpha=1$). As a corollary of our new analysis techniques, we also obtain a less stringent requirement on the accuracy of marginal estimates even for the case of log-concave polynomials; roughly speaking, we show that constant-factor approximation is enough for domain sparsification, improving over $O(1/k)$ relative error established in prior work.
In this work, we study an inverse problem of recovering a space-time dependent diffusion coefficient in the subdiffusion model from the distributed observation, where the mathematical model involves a Djrbashian-Caputo fractional derivative of order $\alpha\in(0,1)$ in time. The main technical challenges of both theoretical and numerical analysis lie in the limited smoothing properties due to the fractional differential operator and the high degree of nonlinearity of the forward map from the unknown diffusion coefficient to the distributed observation. Theoretically, we establish two conditional stability results using a novel test function, which leads to a stability bound in $L^2(0,T;L^2(\Omega))$ under a suitable positivity condition. The positivity condition is verified for a large class of problem data. Numerically, we develop a rigorous procedure for the recovery of the diffusion coefficient based on a regularized least-squares formulation, which is then discretized by the standard Galerkin method with continuous piecewise linear elements in space and backward Euler convolution quadrature in time. We provide a complete error analysis of the fully discrete formulation, by combining several new error estimates for the direct problem (optimal in terms of data regularity), a discrete version of fractional maximal $L^p$ regularity, and a nonstandard energy argument. Under the positivity condition, we obtain a standard $L^2(0,T; L^2(\Omega))$ error estimate consistent with the conditional stability. Further, we illustrate the analysis with some numerical examples.
Denoising diffusion probabilistic models (DDPMs) have emerged as competitive generative models yet brought challenges to efficient sampling. In this paper, we propose novel bilateral denoising diffusion models (BDDMs), which take significantly fewer steps to generate high-quality samples. From a bilateral modeling objective, BDDMs parameterize the forward and reverse processes with a score network and a scheduling network, respectively. We show that a new lower bound tighter than the standard evidence lower bound can be derived as a surrogate objective for training the two networks. In particular, BDDMs are efficient, simple-to-train, and capable of further improving any pre-trained DDPM by optimizing the inference noise schedules. Our experiments demonstrated that BDDMs can generate high-fidelity samples with as few as 3 sampling steps and produce comparable or even higher quality samples than DDPMs using 1000 steps with only 16 sampling steps (a 62x speedup).
Edge-exchangeable probabilistic network models generate edges as an i.i.d.~sequence from a discrete measure, providing a simple means for statistical inference of latent network properties. The measure is often constructed using the self-product of a realization from a Bayesian nonparametric (BNP) discrete prior; but unlike in standard BNP models, the self-product measure prior is not conjugate the likelihood, hindering the development of exact simulation and inference algorithms. Approximation via finite truncation of the discrete measure is a straightforward alternative, but incurs an unknown approximation error. In this paper, we develop methods for forward simulation and posterior inference in random self-product-measure models based on truncation, and provide theoretical guarantees on the quality of the results as a function of the truncation level. The techniques we present are general and extend to the broader class of discrete Bayesian nonparametric models.
We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energy-based models on high dimensional discrete data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.
In this paper, we propose Latent Relation Language Models (LRLMs), a class of language models that parameterizes the joint distribution over the words in a document and the entities that occur therein via knowledge graph relations. This model has a number of attractive properties: it not only improves language modeling performance, but is also able to annotate the posterior probability of entity spans for a given text through relations. Experiments demonstrate empirical improvements over both a word-based baseline language model and a previous approach that incorporates knowledge graph information. Qualitative analysis further demonstrates the proposed model's ability to learn to predict appropriate relations in context.
This paper focuses on the discrimination capacity of aggregation functions: these are the permutation invariant functions used by graph neural networks to combine the features of nodes. Realizing that the most powerful aggregation functions suffer from a dimensionality curse, we consider a restricted setting. In particular, we show that the standard sum and a novel histogram-based function have the capacity to discriminate between any fixed number of inputs chosen by an adversary. Based on our insights, we design a graph neural network aiming, not to maximize discrimination capacity, but to learn discriminative graph representations that generalize well. Our empirical evaluation provides evidence that our choices can yield benefits to the problem of structural graph classification.
Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.
We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.
Recurrent models for sequences have been recently successful at many tasks, especially for language modeling and machine translation. Nevertheless, it remains challenging to extract good representations from these models. For instance, even though language has a clear hierarchical structure going from characters through words to sentences, it is not apparent in current language models. We propose to improve the representation in sequence models by augmenting current approaches with an autoencoder that is forced to compress the sequence through an intermediate discrete latent space. In order to propagate gradients though this discrete representation we introduce an improved semantic hashing technique. We show that this technique performs well on a newly proposed quantitative efficiency measure. We also analyze latent codes produced by the model showing how they correspond to words and phrases. Finally, we present an application of the autoencoder-augmented model to generating diverse translations.