Many data-science problems can be formulated as an inverse problem, where the parameters are estimated by minimizing a proper loss function. When complicated black-box models are involved, derivative-free optimization tools are often needed. The ensemble Kalman filter (EnKF) is a particle-based derivative-free Bayesian algorithm originally designed for data assimilation. Recently, it has been applied to inverse problems for computational efficiency. The resulting algorithm, known as ensemble Kalman inversion (EKI), involves running an ensemble of particles with EnKF update rules so they can converge to a minimizer. In this article, we investigate EKI convergence in general nonlinear settings. To improve convergence speed and stability, we consider applying EKI with non-constant step-sizes and covariance inflation. We prove that EKI can hit critical points with finite steps in non-convex settings. We further prove that EKI converges to the global minimizer polynomially fast if the loss function is strongly convex. We verify the analysis presented with numerical experiments on two inverse problems.
Over the past few years, numerous computational models have been developed to solve Optimal Transport (OT) in a stochastic setting, where distributions are represented by samples. In such situations, the goal is to find a transport map that has good generalization properties on unseen data, ideally the closest map to the ground truth, unknown in practical settings. However, in the absence of ground truth, no quantitative criterion has been put forward to measure its generalization performance although it is crucial for model selection. We propose to leverage the Brenier formulation of OT to perform this task. Theoretically, we show that this formulation guarantees that, up to a distortion parameter that depends on the smoothness/strong convexity and a statistical deviation term, the selected map achieves the lowest quadratic error to the ground truth. This criterion, estimated via convex optimization, enables parameter and model selection among entropic regularization of OT, input convex neural networks and smooth and strongly convex nearest-Brenier (SSNB) models. Last, we make an experiment questioning the use of OT in Domain-Adaptation. Thanks to the criterion, we can identify the potential that is closest to the true OT map between the source and the target and we observe that this selected potential is not the one that performs best for the downstream transfer classification task.
We present a novel sampling-based method for estimating probabilities of rare or failure events. Our approach is founded on the Ensemble Kalman filter (EnKF) for inverse problems. Therefore, we reformulate the rare event problem as an inverse problem and apply the EnKF to generate failure samples. To estimate the probability of failure, we use the final EnKF samples to fit a distribution model and apply Importance Sampling with respect to the fitted distribution. This leads to an unbiased estimator if the density of the fitted distribution admits positive values within the whole failure domain. To handle multi-modal failure domains, we localise the covariance matrices in the EnKF update step around each particle and fit a mixture distribution model in the Importance Sampling step. For affine linear limit-state functions, we investigate the continuous-time limit and large time properties of the EnKF update. We prove that the mean of the particles converges to a convex combination of the most likely failure point and the mean of the optimal Importance Sampling density if the EnKF is applied without noise. We provide numerical experiments to compare the performance of the EnKF with Sequential Importance Sampling.
In this paper, we address a new problem of reversing the effect of an image filter, which can be linear or nonlinear. The assumption is that the algorithm of the filter is unknown and the filter is available as a black box. We formulate this inverse problem as minimizing a local patch-based cost function and use total derivative to approximate the gradient which is used in gradient descent to solve the problem. We analyze factors affecting the convergence and quality of the output in the Fourier domain. We also study the application of accelerated gradient descent algorithms in three gradient-free reverse filters, including the one proposed in this paper. We present results from extensive experiments to evaluate the complexity and effectiveness of the proposed algorithm. Results demonstrate that the proposed algorithm outperforms the state-of-the-art in that (1) it is at the same level of complexity as that of the fastest reverse filter, but it can reverse a larger number of filters, and (2) it can reverse the same list of filters as that of the very complex reverse filter, but its complexity is much smaller.
We study optimization problems where the objective function is modeled through feedforward neural networks with rectified linear unit (ReLU) activation. Recent literature has explored the use of a single neural network to model either uncertain or complex elements within an objective function. However, it is well known that ensembles of neural networks produce more stable predictions and have better generalizability than models with single neural networks, which suggests the application of ensembles of neural networks in a decision-making pipeline. We study how to incorporate a neural network ensemble as the objective function of an optimization model and explore computational approaches for the ensuing problem. We present a mixed-integer linear program based on existing popular big-$M$ formulations for optimizing over a single neural network. We develop two acceleration techniques for our model, the first one is a preprocessing procedure to tighten bounds for critical neurons in the neural network while the second one is a set of valid inequalities based on Benders decomposition. Experimental evaluations of our solution methods are conducted on one global optimization problem and two real-world data sets; the results suggest that our optimization algorithm outperforms the adaption of an state-of-the-art approach in terms of computational time and optimality gaps.
Subsampling is a general statistical method developed in the 1990s aimed at estimating the sampling distribution of a statistic $\hat \theta _n$ in order to conduct nonparametric inference such as the construction of confidence intervals and hypothesis tests. Subsampling has seen a resurgence in the Big Data era where the standard, full-resample size bootstrap can be infeasible to compute. Nevertheless, even choosing a single random subsample of size $b$ can be computationally challenging with both $b$ and the sample size $n$ being very large. In the paper at hand, we show how a set of appropriately chosen, non-random subsamples can be used to conduct effective -- and computationally feasible -- distribution estimation via subsampling. Further, we show how the same set of subsamples can be used to yield a procedure for subsampling aggregation -- also known as subagging -- that is scalable with big data. Interestingly, the scalable subagging estimator can be tuned to have the same (or better) rate of convergence as compared to $\hat \theta _n$. The paper is concluded by showing how to conduct inference, e.g., confidence intervals, based on the scalable subagging estimator instead of the original $\hat \theta _n$.
Nesterov's accelerated forward-backward algorithm (AFBA) is an efficient algorithm for solving a class of two-term convex optimization models consisting of a differentiable function with a Lipschitz continuous gradient plus a nondifferentiable function with a closed form of its proximity operator. It has been shown that the iterative sequence generated by AFBA with a modified Nesterov's momentum scheme converges to a minimizer of the objective function with an $o\left(\frac{1}{k^2}\right)$ convergence rate in terms of the function value (FV-convergence rate) and an $o\left(\frac{1}{k}\right)$ convergence rate in terms of the distance between consecutive iterates (DCI-convergence rate). In this paper, we propose a more general momentum scheme with an introduced power parameter $\omega\in(0,1]$ and show that AFBA with the proposed momentum scheme converges to a minimizer of the objective function with an $o\left(\frac{1}{k^{2\omega}}\right)$ FV-convergence rate and an $o\left(\frac{1}{k^{\omega}}\right)$ DCI-convergence rate. The generality of the proposed momentum scheme provides us a variety of parameter selections for different scenarios, which makes the resulting algorithm more flexible to achieve better performance. We then employ AFBA with the proposed momentum scheme to solve the smoothed hinge loss $\ell_1$-support vector machine model. Numerical results demonstrate that the proposed generalized momentum scheme outperforms two existing momentum schemes.
Spatially inhomogeneous functions, which may be smooth in some regions and rough in other regions, are modelled naturally in a Bayesian manner using so-called Besov priors which are given by random wavelet expansions with Laplace-distributed coefficients. This paper studies theoretical guarantees for such prior measures - specifically, we examine their frequentist posterior contraction rates in the setting of non-linear inverse problems with Gaussian white noise. Our results are first derived under a general local Lipschitz assumption on the forward map. We then verify the assumption for two non-linear inverse problems arising from elliptic partial differential equations, the Darcy flow model from geophysics as well as a model for the Schr\"odinger equation appearing in tomography. In the course of the proofs, we also obtain novel concentration inequalities for penalized least squares estimators with $\ell^1$ wavelet penalty, which have a natural interpretation as maximum a posteriori (MAP) estimators. The true parameter is assumed to belong to some spatially inhomogeneous Besov class $B^{\alpha}_{11}$, $\alpha>0$. In a setting with direct observations, we complement these upper bounds with a lower bound on the rate of contraction for arbitrary Gaussian priors. An immediate consequence of our results is that while Laplace priors can achieve minimax-optimal rates over $B^{\alpha}_{11}$-classes, Gaussian priors are limited to a (by a polynomial factor) slower contraction rate. This gives information-theoretical justification for the intuition that Laplace priors are more compatible with $\ell^1$ regularity structure in the underlying parameter.
The Bayesian inference is widely used in many scientific and engineering problems, especially in the linear inverse problems in infinite-dimensional setting where the unknowns are functions. In such problems, choosing an appropriate prior distribution is an important task. In particular, when the function to infer has much detail information, such as many sharp jumps, corners, and the discontinuous and nonsmooth oscillation, the so-called total variation-Gaussian (TG) prior is proposed in function space to address it. However, the TG prior is easy to lead the blocky (staircase) effect in numerical results. In this work, we present a fractional order-TG (FTG) hybrid prior to deal with such problems, where the fractional order total variation (FTV) term is used to capture the detail information of the unknowns and simultaneously uses the Gaussian measure to ensure that it results in a well-defined posterior measure. For the numerical implementations of linear inverse problems in function spaces, we also propose an efficient independence sampler based on a transport map, which uses a proposal distribution derived from a diagonal map, and the acceptance probability associated to the proposal is independent of discretization dimensionality. And in order to take full advantage of the transport map, the hierarchical Bayesian framework is applied to flexibly determine the regularization parameter. Finally we provide some numerical examples to demonstrate the performance of the FTG prior and the efficiency and robustness of the proposed independence sampler method.
Diffusion models have recently attained significant interest within the community owing to their strong performance as generative models. Furthermore, its application to inverse problems have demonstrated state-of-the-art performance. Unfortunately, diffusion models have a critical downside - they are inherently slow to sample from, needing few thousand steps of iteration to generate images from pure Gaussian noise. In this work, we show that starting from Gaussian noise is unnecessary. Instead, starting from a single forward diffusion with better initialization significantly reduces the number of sampling steps in the reverse conditional diffusion. This phenomenon is formally explained by the contraction theory of the stochastic difference equations like our conditional diffusion strategy - the alternating applications of reverse diffusion followed by a non-expansive data consistency step. The new sampling strategy, dubbed Come-Closer-Diffuse-Faster (CCDF), also reveals a new insight on how the existing feed-forward neural network approaches for inverse problems can be synergistically combined with the diffusion models. Experimental results with super-resolution, image inpainting, and compressed sensing MRI demonstrate that our method can achieve state-of-the-art reconstruction performance at significantly reduced sampling steps.
Policy gradient methods are widely used in reinforcement learning algorithms to search for better policies in the parameterized policy space. They do gradient search in the policy space and are known to converge very slowly. Nesterov developed an accelerated gradient search algorithm for convex optimization problems. This has been recently extended for non-convex and also stochastic optimization. We use Nesterov's acceleration for policy gradient search in the well-known actor-critic algorithm and show the convergence using ODE method. We tested this algorithm on a scheduling problem. Here an incoming job is scheduled into one of the four queues based on the queue lengths. We see from experimental results that algorithm using Nesterov's acceleration has significantly better performance compared to algorithm which do not use acceleration. To the best of our knowledge this is the first time Nesterov's acceleration has been used with actor-critic algorithm.