亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Online averaged stochastic gradient algorithms are more and more studied since (i) they can deal quickly with large sample taking values in high dimensional spaces, (ii) they enable to treat data sequentially, (iii) they are known to be asymptotically efficient. In this paper, we focus on giving explicit bounds of the quadratic mean error of the estimates, and this, with very weak assumptions, i.e without supposing that the function we would like to minimize is strongly convex or admits a bounded gradient.

相關內容

Bertrand et al. introduced a model of parameterised systems, where each agent is represented by a finite state system, and studied the following control problem: for any number of agents, does there exist a controller able to bring all agents to a target state? They showed that the problem is decidable and EXPTIME-complete in the adversarial setting, and posed as an open problem the stochastic setting, where the agent is represented by a Markov decision process. In this paper, we show that the stochastic control problem is decidable. Our solution makes significant uses of well quasi orders, of the max-flow min-cut theorem, and of the theory of regular cost functions. We introduce an intermediate problem of independent interest called the sequential flow problem, and study the complexity of solving it.

A randomized Kaczmarz method was recently proposed for phase retrieval, which has been shown numerically to exhibit empirical performance over other state-of-the-art phase retrieval algorithms both in terms of the sampling complexity and in terms of computation time. While the rate of convergence has been studied well in the real case where the signals and measurement vectors are all real-valued, there is no guarantee for the convergence in the complex case. In fact, the linear convergence of the randomized Kaczmarz method for phase retrieval in the complex setting is left as a conjecture by Tan and Vershynin. In this paper, we provide the first theoretical guarantees for it. We show that for random measurements $\mathbf{a}_j \in \mathbb{C}^n, j=1,\ldots,m $ which are drawn independently and uniformly from the complex unit sphere, or equivalent are independent complex Gaussian random vectors, when $m \ge Cn$ for some universal positive constant $C$, the randomized Kaczmarz scheme with a good initialization converges linearly to the target solution (up to a global phase) in expectation with high probability. This gives a positive answer to that conjecture.

We consider the problem of estimating the parameters a Gaussian Mixture Model with K components of known weights, all with an identity covariance matrix. We make two contributions. First, at the population level, we present a sharper analysis of the local convergence of EM and gradient EM, compared to previous works. Assuming a separation of $\Omega(\sqrt{\log K})$, we prove convergence of both methods to the global optima from an initialization region larger than those of previous works. Specifically, the initial guess of each component can be as far as (almost) half its distance to the nearest Gaussian. This is essentially the largest possible contraction region. Our second contribution are improved sample size requirements for accurate estimation by EM and gradient EM. In previous works, the required number of samples had a quadratic dependence on the maximal separation between the K components, and the resulting error estimate increased linearly with this maximal separation. In this manuscript we show that both quantities depend only logarithmically on the maximal separation.

Many Markov Chain Monte Carlo (MCMC) methods leverage gradient information of the potential function of target distribution to explore sample space efficiently. However, computing gradients can often be computationally expensive for large scale applications, such as those in contemporary machine learning. Stochastic Gradient (SG-)MCMC methods approximate gradients by stochastic ones, commonly via uniformly subsampled data points, and achieve improved computational efficiency, however at the price of introducing sampling error. We propose a non-uniform subsampling scheme to improve the sampling accuracy. The proposed exponentially weighted stochastic gradient (EWSG) is designed so that a non-uniform-SG-MCMC method mimics the statistical behavior of a batch-gradient-MCMC method, and hence the inaccuracy due to SG approximation is reduced. EWSG differs from classical variance reduction (VR) techniques as it focuses on the entire distribution instead of just the variance; nevertheless, its reduced local variance is also proved. EWSG can also be viewed as an extension of the importance sampling idea, successful for stochastic-gradient-based optimizations, to sampling tasks. In our practical implementation of EWSG, the non-uniform subsampling is performed efficiently via a Metropolis-Hastings chain on the data index, which is coupled to the MCMC algorithm. Numerical experiments are provided, not only to demonstrate EWSG's effectiveness, but also to guide hyperparameter choices, and validate our \emph{non-asymptotic global error bound} despite of approximations in the implementation. Notably, while statistical accuracy is improved, convergence speed can be comparable to the uniform version, which renders EWSG a practical alternative to VR (but EWSG and VR can be combined too).

This paper considers two fundamental sequential decision-making problems: the problem of prediction with expert advice and the multi-armed bandit problem. We focus on stochastic regimes in which an adversary may corrupt losses, and we investigate what level of robustness can be achieved against adversarial corruptions. The main contribution of this paper is to show that optimal robustness can be expressed by a square-root dependency on the amount of corruption. More precisely, we show that two classes of algorithms, anytime Hedge with decreasing learning rate and algorithms with second-order regret bounds, achieve $O( \frac{\log N}{\Delta} + \sqrt{ \frac{C \log N }{\Delta} } )$-regret, where $N, \Delta$, and $C$ represent the number of experts, the gap parameter, and the corruption level, respectively. We further provide a matching lower bound, which means that this regret bound is tight up to a constant factor. For the multi-armed bandit problem, we also provide a nearly tight lower bound up to a logarithmic factor.

Despite their overwhelming capacity to overfit, deep neural networks trained by specific optimization algorithms tend to generalize well to unseen data. Recently, researchers explained it by investigating the implicit regularization effect of optimization algorithms. A remarkable progress is the work (Lyu&Li, 2019), which proves gradient descent (GD) maximizes the margin of homogeneous deep neural networks. Except GD, adaptive algorithms such as AdaGrad, RMSProp and Adam are popular owing to their rapid training process. However, theoretical guarantee for the generalization of adaptive optimization algorithms is still lacking. In this paper, we study the implicit regularization of adaptive optimization algorithms when they are optimizing the logistic loss on homogeneous deep neural networks. We prove that adaptive algorithms that adopt exponential moving average strategy in conditioner (such as Adam and RMSProp) can maximize the margin of the neural network, while AdaGrad that directly sums historical squared gradients in conditioner can not. It indicates superiority on generalization of exponential moving average strategy in the design of the conditioner. Technically, we provide a unified framework to analyze convergent direction of adaptive optimization algorithms by constructing novel adaptive gradient flow and surrogate margin. Our experiments can well support the theoretical findings on convergent direction of adaptive optimization algorithms.

We show that for the problem of testing if a matrix $A \in F^{n \times n}$ has rank at most $d$, or requires changing an $\epsilon$-fraction of entries to have rank at most $d$, there is a non-adaptive query algorithm making $\widetilde{O}(d^2/\epsilon)$ queries. Our algorithm works for any field $F$. This improves upon the previous $O(d^2/\epsilon^2)$ bound (SODA'03), and bypasses an $\Omega(d^2/\epsilon^2)$ lower bound of (KDD'14) which holds if the algorithm is required to read a submatrix. Our algorithm is the first such algorithm which does not read a submatrix, and instead reads a carefully selected non-adaptive pattern of entries in rows and columns of $A$. We complement our algorithm with a matching query complexity lower bound for non-adaptive testers over any field. We also give tight bounds of $\widetilde{\Theta}(d^2)$ queries in the sensing model for which query access comes in the form of $\langle X_i, A\rangle:=tr(X_i^\top A)$; perhaps surprisingly these bounds do not depend on $\epsilon$. We next develop a novel property testing framework for testing numerical properties of a real-valued matrix $A$ more generally, which includes the stable rank, Schatten-$p$ norms, and SVD entropy. Specifically, we propose a bounded entry model, where $A$ is required to have entries bounded by $1$ in absolute value. We give upper and lower bounds for a wide range of problems in this model, and discuss connections to the sensing model above.

Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space, such as the simplex, the time-discretisation error can dominate when we are near the boundary of the space. We demonstrate that while current SGMCMC methods for the simplex perform well in certain cases, they struggle with sparse simplex spaces; when many of the components are close to zero. However, most popular large-scale applications of Bayesian inference on simplex spaces, such as network or topic models, are sparse. We argue that this poor performance is due to the biases of SGMCMC caused by the discretization error. To get around this, we propose the stochastic CIR process, which removes all discretization error and we prove that samples from the stochastic CIR process are asymptotically unbiased. Use of the stochastic CIR process within a SGMCMC algorithm is shown to give substantially better performance for a topic model and a Dirichlet process mixture model than existing SGMCMC approaches.

In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.

We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.

北京阿比特科技有限公司