亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Plug-and-Play (PnP) methods solve ill-posed inverse problems through iterative proximal algorithms by replacing a proximal operator by a denoising operation. When applied with deep neural network denoisers, these methods have shown state-of-the-art visual performance for image restoration problems. However, their theoretical convergence analysis is still incomplete. Most of the existing convergence results consider nonexpansive denoisers, which is non-realistic, or limit their analysis to strongly convex data-fidelity terms in the inverse problem to solve. Recently, it was proposed to train the denoiser as a gradient descent step on a functional parameterized by a deep neural network. Using such a denoiser guarantees the convergence of the PnP version of the Half-Quadratic-Splitting (PnP-HQS) iterative algorithm. In this paper, we show that this gradient denoiser can actually correspond to the proximal operator of another scalar function. Given this new result, we exploit the convergence theory of proximal algorithms in the nonconvex setting to obtain convergence results for PnP-PGD (Proximal Gradient Descent) and PnP-ADMM (Alternating Direction Method of Multipliers). When built on top of a smooth gradient denoiser, we show that PnP-PGD and PnP-ADMM are convergent and target stationary points of an explicit functional. These convergence results are confirmed with numerical experiments on deblurring, super-resolution and inpainting.

相關內容

We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as the Subsampled Newton and Newton Sketch, which can efficiently construct stochastic Hessian estimates for many tasks. Despite using second-order information, these existing methods do not exhibit superlinear convergence, unless the stochastic noise is gradually reduced to zero during the iteration, which would lead to a computational blow-up in the per-iteration cost. We address this limitation with Hessian averaging: instead of using the most recent Hessian estimate, our algorithm maintains an average of all past estimates. This reduces the stochastic noise while avoiding the computational blow-up. We show that this scheme enjoys local $Q$-superlinear convergence with a non-asymptotic rate of $(\Upsilon\sqrt{\log (t)/t}\,)^{t}$, where $\Upsilon$ is proportional to the level of stochastic noise in the Hessian oracle. A potential drawback of this (uniform averaging) approach is that the averaged estimates contain Hessian information from the global phase of the iteration, i.e., before the iterates converge to a local neighborhood. This leads to a distortion that may substantially delay the superlinear convergence until long after the local neighborhood is reached. To address this drawback, we study a number of weighted averaging schemes that assign larger weights to recent Hessians, so that the superlinear convergence arises sooner, albeit with a slightly slower rate. Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still enjoys a superlinear convergence~rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.

In this work we describe an Adaptive Regularization using Cubics (ARC) method for large-scale nonconvex unconstrained optimization using Limited-memory Quasi-Newton (LQN) matrices. ARC methods are a relatively new family of optimization strategies that utilize a cubic-regularization (CR) term in place of trust-regions and line-searches. LQN methods offer a large-scale alternative to using explicit second-order information by taking identical inputs to those used by popular first-order methods such as stochastic gradient descent (SGD). Solving the CR subproblem exactly requires Newton's method, yet using properties of the internal structure of LQN matrices, we are able to find exact solutions to the CR subproblem in a matrix-free manner, providing large speedups and scaling into modern size requirements. Additionally, we expand upon previous ARC work and explicitly incorporate first-order updates into our algorithm. We provide experimental results when the SR1 update is used, which show substantial speed-ups and competitive performance compared to Adam and other second order optimizers on deep neural networks (DNNs). We find that our new approach, ARCLQN, compares to modern optimizers with minimal tuning, a common pain-point for second order methods.

We employ kernel-based approaches that use samples from a probability distribution to approximate a Kolmogorov operator on a manifold. The self-tuning variable-bandwidth kernel method [Berry & Harlim, Appl. Comput. Harmon. Anal., 40(1):68--96, 2016] computes a large, sparse matrix that approximates the differential operator. Here, we use the eigendecomposition of the discretization to (i) invert the operator, solving a differential equation, and (ii) represent gradient vector fields on the manifold. These methods only require samples from the underlying distribution and, therefore, can be applied in high dimensions or on geometrically complex manifolds when spatial discretizations are not available. We also employ an efficient $k$-$d$ tree algorithm to compute the sparse kernel matrix, which is a computational bottleneck.

Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.

Multigrid is a powerful solver for large-scale linear systems arising from discretized partial differential equations. The convergence theory of multigrid methods for symmetric positive definite problems has been well developed over the past decades, while, for nonsymmetric problems, such theory is still not mature. As a foundation for multigrid analysis, two-grid convergence theory plays an important role in motivating multigrid algorithms. Regarding two-grid methods for nonsymmetric problems, most previous works focus on the spectral radius of iteration matrix or rely on convergence measures that are typically difficult to compute in practice. Moreover, the existing results are confined to two-grid methods with exact solution of the coarse-grid system. In this paper, we analyze the convergence of a two-grid method for nonsymmetric positive definite problems (e.g., linear systems arising from the discretizations of convection-diffusion equations). In the case of exact coarse solver, we establish an elegant identity for characterizing two-grid convergence factor, which is measured by a smoother-induced norm. The identity can be conveniently used to derive a class of optimal restriction operators and analyze how the convergence factor is influenced by restriction. More generally, we present some convergence estimates for an inexact variant of the two-grid method, in which both linear and nonlinear coarse solvers are considered.

In the storied Colonel Blotto game, two colonels allocate $a$ and $b$ troops, respectively, to $k$ distinct battlefields. A colonel wins a battle if they assign more troops to that particular battle, and each colonel seeks to maximize their total number of victories. Despite the problem's formulation in 1921, the first polynomial-time algorithm to compute Nash equilibrium (NE) strategies for this game was discovered only quite recently. In 2016, \citep{ahmadinejad_dehghani_hajiaghayi_lucier_mahini_seddighin_2019} formulated a breakthrough algorithm to compute NE strategies for the Colonel Blotto game\footnote{To the best of our knowledge, the algorithm from \citep{ahmadinejad_dehghani_hajiaghayi_lucier_mahini_seddighin_2019} has computational complexity $O(k^{14}\max\{a,b\}^{13})$}, receiving substantial media coverage (e.g. \citep{Insider}, \citep{NSF}, \citep{ScienceDaily}). In this work, we present the first known $\epsilon$-approximation algorithm to compute NE strategies in the two-player Colonel Blotto game in runtime $\widetilde{O}(\epsilon^{-4} k^8 \max\{a,b\}^2)$ for arbitrary settings of these parameters. Moreover, this algorithm computes approximate coarse correlated equilibrium strategies in the multiplayer (continuous and discrete) Colonel Blotto game (when there are $\ell > 2$ colonels) with runtime $\widetilde{O}(\ell \epsilon^{-4} k^8 n^2 + \ell^2 \epsilon^{-2} k^3 n (n+k))$, where $n$ is the maximum troop count. Before this work, no polynomial-time algorithm was known to compute exact or approximate equilibrium (in any sense) strategies for multiplayer Colonel Blotto with arbitrary parameters. Our algorithm computes these approximate equilibria by a novel (to the author's knowledge) sampling technique with which we implicitly perform multiplicative weights update over the exponentially many strategies available to each player.

Low-rank matrix estimation under heavy-tailed noise is challenging, both computationally and statistically. Convex approaches have been proven statistically optimal but suffer from high computational costs, especially since robust loss functions are usually non-smooth. More recently, computationally fast non-convex approaches via sub-gradient descent are proposed, which, unfortunately, fail to deliver a statistically consistent estimator even under sub-Gaussian noise. In this paper, we introduce a novel Riemannian sub-gradient (RsGrad) algorithm which is not only computationally efficient with linear convergence but also is statistically optimal, be the noise Gaussian or heavy-tailed. Convergence theory is established for a general framework and specific applications to absolute loss, Huber loss, and quantile loss are investigated. Compared with existing non-convex methods, ours reveals a surprising phenomenon of dual-phase convergence. In phase one, RsGrad behaves as in a typical non-smooth optimization that requires gradually decaying stepsizes. However, phase one only delivers a statistically sub-optimal estimator which is already observed in the existing literature. Interestingly, during phase two, RsGrad converges linearly as if minimizing a smooth and strongly convex objective function and thus a constant stepsize suffices. Underlying the phase-two convergence is the smoothing effect of random noise to the non-smooth robust losses in an area close but not too close to the truth. Lastly, RsGrad is applicable for low-rank tensor estimation under heavy-tailed noise where a statistically optimal rate is attainable with the same phenomenon of dual-phase convergence, and a novel shrinkage-based second-order moment method is guaranteed to deliver a warm initialization. Numerical simulations confirm our theoretical discovery and showcase the superiority of RsGrad over prior methods.

We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function. We further show that the trained weights, as a function of the layer index, admits a scaling limit which is H\"older continuous as the depth of the network tends to infinity. The proofs are based on non-asymptotic estimates of the loss function and of norms of the network weights along the gradient descent path. We illustrate the relevance of our theoretical results to practical settings using detailed numerical experiments on supervised learning problems.

CP decomposition (CPD) is prevalent in chemometrics, signal processing, data mining and many more fields. While many algorithms have been proposed to compute the CPD, alternating least squares (ALS) remains one of the most widely used algorithm for computing the decomposition. Recent works have introduced the notion of eigenvalues and singular values of a tensor and explored applications of eigenvectors and singular vectors in areas like signal processing, data analytics and in various other fields. We introduce a new formulation for deriving singular values and vectors of a tensor by considering the critical points of a function different from what is used in the previous work. Computing these critical points in an alternating manner motivates an alternating optimization algorithm which corresponds to alternating least squares algorithm in the matrix case. However, for tensors with order greater than equal to $3$, it minimizes an objective function which is different from the commonly used least squares loss. Alternating optimization of this new objective leads to simple updates to the factor matrices with the same asymptotic computational cost as ALS. We show that a subsweep of this algorithm can achieve a superlinear convergence rate for exact CPD with known rank and verify it experimentally. We then view the algorithm as optimizing a Mahalanobis distance with respect to each factor with ground metric dependent on the other factors. This perspective allows us to generalize our approach to interpolate between updates corresponding to the ALS and the new algorithm to manage the tradeoff between stability and fitness of the decomposition. Our experimental results show that for approximating synthetic and real-world tensors, this algorithm and its variants converge to a better conditioned decomposition with comparable and sometimes better fitness as compared to the ALS algorithm.

Residual networks (ResNets) have displayed impressive results in pattern recognition and, recently, have garnered considerable theoretical interest due to a perceived link with neural ordinary differential equations (neural ODEs). This link relies on the convergence of network weights to a smooth function as the number of layers increases. We investigate the properties of weights trained by stochastic gradient descent and their scaling with network depth through detailed numerical experiments. We observe the existence of scaling regimes markedly different from those assumed in neural ODE literature. Depending on certain features of the network architecture, such as the smoothness of the activation function, one may obtain an alternative ODE limit, a stochastic differential equation or neither of these. These findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit.

北京阿比特科技有限公司