97SE亚洲国产综合在线,中文字幕无码乱人伦漫画,国产黄片一区二区三区四区,啊灬啊灬啊灬快灬深用两性

Monotonic linear interpolation (MLI) - on the line connecting a random initialization with the minimizer it converges to, the loss and accuracy are monotonic - is a phenomenon that is commonly observed in the training of neural networks. Such a phenomenon may seem to suggest that optimization of neural networks is easy. In this paper, we show that the MLI property is not necessarily related to the hardness of optimization problems, and empirical observations on MLI for deep neural networks depend heavily on biases. In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output, and when different classes have different last-layer biases on a deep network, there will be a long plateau in both the loss and accuracy interpolation (which existing theory of MLI cannot explain). We also show how the last-layer biases for different classes can be different even on a perfectly balanced dataset using a simple model. Empirically we demonstrate that similar intuitions hold on practical networks and realistic datasets.

相關內容

Networking

關注 22

Networking：IFIP International Conferences on Networking。 Explanation：國際網絡會議。 Publisher：IFIP。 SIT：

因果效應 · 度量 · 因果推斷 · 推斷 · 關鍵特性 ·

2023 年 4 月 6 日

Independence weights for causal inference with continuous treatments

Jared D. Huling,Noah Greifer,Guanhua Chen

Studying causal effects of continuous treatments is important for gaining a deeper understanding of many interventions, policies, or medications, yet researchers are often left with observational studies for doing so. In the observational setting, confounding is a barrier to the estimation of causal effects. Weighting approaches seek to control for confounding by reweighting samples so that confounders are comparable across different treatment values. Yet, for continuous treatments, weighting methods are highly sensitive to model misspecification. In this paper we elucidate the key property that makes weights effective in estimating causal quantities involving continuous treatments. We show that to eliminate confounding, weights should make treatment and confounders independent on the weighted scale. We develop a measure that characterizes the degree to which a set of weights induces such independence. Further, we propose a new model-free method for weight estimation by optimizing our measure. We study the theoretical properties of our measure and our weights, and prove that our weights can explicitly mitigate treatment-confounder dependence. The empirical effectiveness of our approach is demonstrated in a suite of challenging numerical experiments, where we find that our weights are quite robust and work well under a broad range of settings.

貝葉斯 · ReLU · 自由能 · 神經網絡 · 過度參數化 ·

2023 年 4 月 5 日

Bayesian Free Energy of Deep ReLU Neural Network in Overparametrized Cases

Shuya Nagayasu,Sumio Watanabe

from arxiv, 20pages, 2figure

In many research fields in artificial intelligence, it has been shown that deep neural networks are useful to estimate unknown functions on high dimensional input spaces. However, their generalization performance is not yet completely clarified from the theoretical point of view because they are nonidentifiable and singular learning machines. Moreover, a ReLU function is not differentiable, to which algebraic or analytic methods in singular learning theory cannot be applied. In this paper, we study a deep ReLU neural network in overparametrized cases and prove that the Bayesian free energy, which is equal to the minus log marginal likelihoodor the Bayesian stochastic complexity, is bounded even if the number of layers are larger than necessary to estimate an unknown data-generating function. Since the Bayesian generalization error is equal to the increase of the free energy as a function of a sample size, our result also shows that the Bayesian generalization error does not increase even if a deep ReLU neural network is designed to be sufficiently large or in an opeverparametrized state.

線性回歸 · 狀態空間 · 并行 · 線性化 · 非高斯 ·

2023 年 4 月 5 日

Parallel square-root statistical linear regression for inference in nonlinear state space models

Fatemeh Yaghoobi,Adrien Corenflos,Sakira Hassan,Simo S?rkk?

In this article, we introduce parallel-in-time methods for state and parameter estimation in general nonlinear non-Gaussian state-space models using the statistical linear regression and the iterated statistical posterior linearization paradigms. We also reformulate the proposed methods in a square-root form, resulting in improved numerical stability while preserving the parallelization capabilities. We then leverage the fixed-point structure of our methods to perform likelihood-based parameter estimation in logarithmic time with respect to the number of observations. Finally, we demonstrate the practical performance of the methodology with numerical experiments run on a graphics processing unit (GPU).

平滑 · 最小二乘估計 · 非參數 · HAT · 回歸函數 ·

2023 年 4 月 4 日

Confidence intervals in monotone regression

Piet Groeneboom,Geurt Jongbloed

from arxiv, 23 pages, 8 figures

We construct bootstrap confidence intervals for a monotone regression function. It has been shown that the ordinary nonparametric bootstrap , based on the nonparametric least squares estimator (LSE) $\hat f_n$ is inconsistent in this situation. We show, however, that a consistent bootstrap can be based on the smoothed $\hat f_n$, to be called the SLSE (Smoothed Least Squares Estimator). The asymptotic pointwise distribution of the SLSE is derived. The confidence intervals, based on the smoothed bootstrap, are compared to intervals based on the (not necessarily monotone) Nadaraya Watson estimator and the effect of Studentization is investigated. We also give a method for automatic bandwidth choice, correcting work in Sen and Xu (2015). The procedure is illustrated using a well known dataset related to climate change.

收斂速度 · 深層神經網絡 · DNN · 神經網絡 · 組合結構 ·

2023 年 4 月 4 日

Convergence Rates of Training Deep Neural Networks via Alternating Minimization Methods

Jintao Xu,Chenglong Bao,Wenxun Xing

Training deep neural networks (DNNs) is an important and challenging optimization problem in machine learning due to its non-convexity and non-separable structure. The alternating minimization (AM) approaches split the composition structure of DNNs and have drawn great interest in the deep learning and optimization communities. In this paper, we propose a unified framework for analyzing the convergence rate of AM-type network training methods. Our analysis is based on the non-monotone $j$-step sufficient decrease conditions and the Kurdyka-Lojasiewicz (KL) property, which relaxes the requirement of designing descent algorithms. We show the detailed local convergence rate if the KL exponent $\theta$ varies in $[0,1)$. Moreover, the local R-linear convergence is discussed under a stronger $j$-step sufficient decrease condition.

度量空間 · 度量 · 經驗風險 · 概率 · 估計誤差 ·

2023 年 4 月 3 日

On the Concentration of the Minimizers of Empirical Risks

Paul Escande

Obtaining guarantees on the convergence of the minimizers of empirical risks to the ones of the true risk is a fundamental matter in statistical learning. Instead of deriving guarantees on the usual estimation error, the goal of this paper is to provide concentration inequalities on the distance between the sets of minimizers of the risks for a broad spectrum of estimation problems. In particular, the risks are defined on metric spaces through probability measures that are also supported on metric spaces. A particular attention will therefore be given to include unbounded spaces and non-convex cost functions that might also be unbounded. This work identifies a set of assumptions allowing to describe a regime that seem to govern the concentration in many estimation problems, where the empirical minimizers are stable. This stability can then be leveraged to prove parametric concentration rates in probability and in expectation. The assumptions are verified, and the bounds showcased, on a selection of estimation problems such as barycenters on metric space with positive or negative curvature, subspaces of covariance matrices, regression problems and entropic-Wasserstein barycenters.

中位數 · 牛頓法 · 在線 · 魯棒 · 收斂速率 ·

2023 年 4 月 3 日

Online stochastic Newton methods for estimating the geometric median and applications

Antoine Godichon-Baggioni,Wei Lu

In the context of large samples, a small number of individuals might spoil basic statistical indicators like the mean. It is difficult to detect automatically these atypical individuals, and an alternative strategy is using robust approaches. This paper focuses on estimating the geometric median of a random variable, which is a robust indicator of central tendency. In order to deal with large samples of data arriving sequentially, online stochastic Newton algorithms for estimating the geometric median are introduced and we give their rates of convergence. Since estimates of the median and those of the Hessian matrix can be recursively updated, we also determine confidences intervals of the median in any designated direction and perform online statistical tests.

變分不等式 · 變分 · GANs · 異質 · 分布式訓練 ·

2023 年 4 月 2 日

Decentralized Local Stochastic Extra-Gradient for Variational Inequalities

Aleksandr Beznosikov,Pavel Dvurechensky,Anastasia Koloskova,Valentin Samokhin,Sebastian U Stich,Alexander Gasnikov

from arxiv, Appears in: Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Minor modifications with respect to the NeurIPS version. 43 pages, 1 algorithm, 6 figures, 2 tables

We consider distributed stochastic variational inequalities (VIs) on unbounded domains with the problem data that is heterogeneous (non-IID) and distributed across many devices. We make a very general assumption on the computational network that, in particular, covers the settings of fully decentralized calculations with time-varying networks and centralized topologies commonly used in Federated Learning. Moreover, multiple local updates on the workers can be made for reducing the communication frequency between the workers. We extend the stochastic extragradient method to this very general setting and theoretically analyze its convergence rate in the strongly-monotone, monotone, and non-monotone (when a Minty solution exists) settings. The provided rates explicitly exhibit the dependence on network characteristics (e.g., mixing time), iteration counter, data heterogeneity, variance, number of devices, and other standard parameters. As a special case, our method and analysis apply to distributed stochastic saddle-point problems (SPP), e.g., to the training of Deep Generative Adversarial Networks (GANs) for which decentralized training has been reported to be extremely challenging. In experiments for the decentralized training of GANs we demonstrate the effectiveness of our proposed approach.

Networking · 層 · MoDELS · tuning · Performance ·

2020 年 7 月 1 日

Go Wide, Then Narrow: Efficient Training of Deep Thin Networks

Denny Zhou,Mao Ye,Chen Chen,Tianjian Meng,Mingxing Tan,Xiaodan Song,Quoc Le,Qiang Liu,Dale Schuurmans

from arxiv, ICML 2020

For deploying a deep learning model into production, it needs to be both accurate and compact to meet the latency and memory constraints. This usually results in a network that is deep (to ensure performance) and yet thin (to improve computational efficiency). In this paper, we propose an efficient method to train a deep thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. In the first stage, we sufficiently widen the deep thin network and train it until convergence. In the second stage, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by letting the thin network imitate the immediate outputs of the wide network from layer to layer. In the last stage, we further fine tune this well initialized deep thin network. The theoretical guarantee is established by using mean field analysis, which shows the advantage of layerwise imitation over traditional training deep thin networks from scratch by backpropagation. We also conduct large-scale empirical experiments to validate our approach. By training with our method, ResNet50 can outperform ResNet101, and BERT_BASE can be comparable with BERT_LARGE, where both the latter models are trained via the standard training procedures as in the literature.

樣本 · Performer · 注意力機制 · 目標檢測 · MINE ·

2019 年 4 月 9 日

Prime Sample Attention in Object Detection

Yuhang Cao,Kai Chen,Chen Change Loy,Dahua Lin

It is a common paradigm in object detection frameworks to treat all samples equally and target at maximizing the performance on average. In this work, we revisit this paradigm through a careful study on how different samples contribute to the overall performance measured in terms of mAP. Our study suggests that the samples in each mini-batch are neither independent nor equally important, and therefore a better classifier on average does not necessarily mean higher mAP. Motivated by this study, we propose the notion of Prime Samples, those that play a key role in driving the detection performance. We further develop a simple yet effective sampling and learning strategy called PrIme Sample Attention (PISA) that directs the focus of the training process towards such samples. Our experiments demonstrate that it is often more effective to focus on prime samples than hard samples when training a detector. Particularly, On the MSCOCO dataset, PISA outperforms the random sampling baseline and hard mining schemes, e.g. OHEM and Focal Loss, consistently by more than 1% on both single-stage and two-stage detectors, with a strong backbone ResNeXt-101.