亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Recent developments in applications of artificial neural networks with over $n=10^{14}$ parameters make it extremely important to study the large $n$ behaviour of such networks. Most works studying wide neural networks have focused on the infinite width $n \to +\infty$ limit of such networks and have shown that, at initialization, they correspond to Gaussian processes. In this work we will study their behavior for large, but finite $n$. Our main contributions are the following: (1) The computation of the corrections to Gaussianity in terms of an asymptotic series in $n^{-\frac{1}{2}}$. The coefficients in this expansion are determined by the statistics of parameter initialization and by the activation function. (2) Controlling the evolution of the outputs of finite width $n$ networks, during training, by computing deviations from the limiting infinite width case (in which the network evolves through a linear flow). This improves previous estimates and yields sharper decay rates for the (finite width) NTK in terms of $n$, valid during the entire training procedure. As a corollary, we also prove that, with arbitrarily high probability, the training of sufficiently wide neural networks converges to a global minimum of the corresponding quadratic loss function. (3) Estimating how the deviations from Gaussianity evolve with training in terms of $n$. In particular, using a certain metric in the space of measures we find that, along training, the resulting measure is within $n^{-\frac{1}{2}}(\log n)^{1+}$ of the time dependent Gaussian process corresponding to the infinite width network (which is explicitly given by precomposing the initial Gaussian process with the linear flow corresponding to training in the infinite width limit).

相關內容

高斯過程(Gaussian Process, GP)是(shi)概率論和(he)數(shu)理統計中(zhong)隨(sui)(sui)機(ji)過程(stochastic process)的(de)(de)一(yi)種,是(shi)一(yi)系列(lie)服從(cong)正(zheng)態(tai)分(fen)布(bu)的(de)(de)隨(sui)(sui)機(ji)變(bian)(bian)量(liang)(random variable)在一(yi)指數(shu)集(ji)(index set)內的(de)(de)組(zu)合。 高斯過程中(zhong)任意隨(sui)(sui)機(ji)變(bian)(bian)量(liang)的(de)(de)線性組(zu)合都(dou)服從(cong)正(zheng)態(tai)分(fen)布(bu),每個(ge)有(you)限維(wei)分(fen)布(bu)都(dou)是(shi)聯合正(zheng)態(tai)分(fen)布(bu),且其本身在連續指數(shu)集(ji)上(shang)的(de)(de)概率密度函數(shu)即是(shi)所有(you)隨(sui)(sui)機(ji)變(bian)(bian)量(liang)的(de)(de)高斯測度,因此被視為聯合正(zheng)態(tai)分(fen)布(bu)的(de)(de)無限維(wei)廣義延伸。高斯過程由其數(shu)學期望和(he)協方差函數(shu)完全(quan)決定,并繼承了正(zheng)態(tai)分(fen)布(bu)的(de)(de)諸多性質(zhi)

Predict-then-Optimize is a framework for using machine learning to perform decision-making under uncertainty. The central research question it asks is, "How can the structure of a decision-making task be used to tailor ML models for that specific task?" To this end, recent work has proposed learning task-specific loss functions that capture this underlying structure. However, current approaches make restrictive assumptions about the form of these losses and their impact on ML model behavior. These assumptions both lead to approaches with high computational cost, and when they are violated in practice, poor performance. In this paper, we propose solutions to these issues, avoiding the aforementioned assumptions and utilizing the ML model's features to increase the sample efficiency of learning loss functions. We empirically show that our method achieves state-of-the-art results in four domains from the literature, often requiring an order of magnitude fewer samples than comparable methods from past work. Moreover, our approach outperforms the best existing method by nearly 200% when the localness assumption is broken.

The analytic inference, e.g. predictive distribution being in closed form, may be an appealing benefit for machine learning practitioners when they treat wide neural networks as Gaussian process in Bayesian setting. The realistic widths, however, are finite and cause weak deviation from the Gaussianity under which partial marginalization of random variables in a model is straightforward. On the basis of multivariate Edgeworth expansion, we propose a non-Gaussian distribution in differential form to model a finite set of outputs from a random neural network, and derive the corresponding marginal and conditional properties. Thus, we are able to derive the non-Gaussian posterior distribution in Bayesian regression task. In addition, in the bottlenecked deep neural networks, a weight space representation of deep Gaussian process, the non-Gaussianity is investigated through the marginal kernel.

State-of-the-art deep learning models have a parameter count that reaches into the billions. Training, storing and transferring such models is energy and time consuming, thus costly. A big part of these costs is caused by training the network. Model compression lowers storage and transfer costs, and can further make training more efficient by decreasing the number of computations in the forward and/or backward pass. Thus, compressing networks also at training time while maintaining a high performance is an important research topic. This work is a survey on methods which reduce the number of trained weights in deep learning models throughout the training. Most of the introduced methods set network parameters to zero which is called pruning. The presented pruning approaches are categorized into pruning at initialization, lottery tickets and dynamic sparse training. Moreover, we discuss methods that freeze parts of a network at its random initialization. By freezing weights, the number of trainable parameters is shrunken which reduces gradient computations and the dimensionality of the model's optimization space. In this survey we first propose dimensionality reduced training as an underlying mathematical model that covers pruning and freezing during training. Afterwards, we present and discuss different dimensionality reduced training methods.

This paper proposes a minimal contractor and a minimal separator for an area delimited by an hyperbola of the plane. The task is facilitated using actions induced by the hyperoctahedral group of symmetries. An application related to the localization of an object using a TDoA (Time Differential of Arrival) technique is proposed.

A strong blocking set in a finite projective space is a set of points that intersects each hyperplane in a spanning set. We provide a new graph theoretic construction of such sets: combining constant-degree expanders with asymptotically good codes, we explicitly construct strong blocking sets in the $(k-1)$-dimensional projective space over $\mathbb{F}_q$ that have size $O( q k )$. Since strong blocking sets have recently been shown to be equivalent to minimal linear codes, our construction gives the first explicit construction of $\mathbb{F}_q$-linear minimal codes of length $n$ and dimension $k$, for every prime power $q$, for which $n = O (q k)$. This solves one of the main open problems on minimal codes.

Several well known estimators of finite population mean and its functions are investigated under some standard sampling designs. Such functions of mean include the variance, the correlation coefficient and the regression coefficient in the population as special cases. We compare the performance of these estimators under different sampling designs based on their asymptotic distributions. Equivalence classes of estimators under different sampling designs are constructed so that estimators in the same class have equivalent performance in terms of asymptotic mean squared errors (MSEs). Estimators in different equivalence classes are then compared under some superpopulations satisfying linear models. It is shown that the pseudo empirical likelihood (PEML) estimator of the population mean under simple random sampling without replacement (SRSWOR) has the lowest asymptotic MSE among all the estimators under different sampling designs considered in this paper. It is also shown that for the variance, the correlation coefficient and the regression coefficient of the population, the plug-in estimators based on the PEML estimator have the lowest asymptotic MSEs among all the estimators considered in this paper under SRSWOR. On the other hand, for any high entropy $\pi$PS (HE$\pi$PS) sampling design, which uses the auxiliary information, the plug-in estimators of those parameters based on the H\'ajek estimator have the lowest asymptotic MSEs among all the estimators considered in this paper.

It is typically understood that the training of modern neural networks is a process of fitting the probability distribution of desired output. However, recent paradoxical observations in a number of language generation tasks let one wonder if this canonical probability-based explanation can really account for the empirical success of deep learning. To resolve this issue, we propose an alternative utility-based explanation to the standard supervised learning procedure in deep learning. The basic idea is to interpret the learned neural network not as a probability model but as an ordinal utility function that encodes the preference revealed in training data. In this perspective, training of the neural network corresponds to a utility learning process. Specifically, we show that for all neural networks with softmax outputs, the SGD learning dynamic of maximum likelihood estimation (MLE) can be seen as an iteration process that optimizes the neural network toward an optimal utility function. This utility-based interpretation can explain several otherwise-paradoxical observations about the neural networks thus trained. Moreover, our utility-based theory also entails an equation that can transform the learned utility values back to a new kind of probability estimation with which probability-compatible decision rules enjoy dramatic (double-digits) performance improvements. These evidences collectively reveal a phenomenon of utility-probability duality in terms of what modern neural networks are (truly) modeling: We thought they are one thing (probabilities), until the unexplainable showed up; changing mindset and treating them as another thing (utility values) largely reconcile the theory, despite remaining subtleties regarding its original (probabilistic) identity.

Hypergeometric sequences are rational-valued sequences that satisfy first-order linear recurrence relations with polynomial coefficients; that is, a hypergeometric sequence $\langle u_n \rangle_{n=0}^{\infty}$ is one that satisfies a recurrence of the form $f(n)u_n = g(n)u_{n-1}$ where $f,g \in \mathbb{Z}[x]$. In this paper, we consider the Membership Problem for hypergeometric sequences: given a hypergeometric sequence $\langle u_n \rangle_{n=0}^{\infty}$ and a target value $t\in \mathbb{Q}$, determine whether $u_n=t$ for some index $n$. We establish decidability of the Membership Problem under the assumption that either (i) $f$ and $g$ have distinct splitting fields or (ii) $f$ and $g$ are monic polynomials that both split over a quadratic extension of $\mathbb{Q}$. Our results are based on an analysis of the prime divisors of polynomial sequences $\langle f(n) \rangle_{n=1}^\infty$ and $\langle g(n) \rangle_{n=1}^\infty$ appearing in the recurrence relation.

Residual networks (ResNets) have displayed impressive results in pattern recognition and, recently, have garnered considerable theoretical interest due to a perceived link with neural ordinary differential equations (neural ODEs). This link relies on the convergence of network weights to a smooth function as the number of layers increases. We investigate the properties of weights trained by stochastic gradient descent and their scaling with network depth through detailed numerical experiments. We observe the existence of scaling regimes markedly different from those assumed in neural ODE literature. Depending on certain features of the network architecture, such as the smoothness of the activation function, one may obtain an alternative ODE limit, a stochastic differential equation or neither of these. These findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit.

Graph Convolutional Networks (GCNs) have recently become the primary choice for learning from graph-structured data, superseding hash fingerprints in representing chemical compounds. However, GCNs lack the ability to take into account the ordering of node neighbors, even when there is a geometric interpretation of the graph vertices that provides an order based on their spatial positions. To remedy this issue, we propose Geometric Graph Convolutional Network (geo-GCN) which uses spatial features to efficiently learn from graphs that can be naturally located in space. Our contribution is threefold: we propose a GCN-inspired architecture which (i) leverages node positions, (ii) is a proper generalisation of both GCNs and Convolutional Neural Networks (CNNs), (iii) benefits from augmentation which further improves the performance and assures invariance with respect to the desired properties. Empirically, geo-GCN outperforms state-of-the-art graph-based methods on image classification and chemical tasks.

北京阿比特科技有限公司