亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

The modelling of small-scale processes is a major source of error in climate models, hindering the accuracy of low-cost models which must approximate such processes through parameterization. Red noise is essential to many operational parameterization schemes, helping model temporal correlations. We show how to build on the successes of red noise by combining the known benefits of stochasticity with machine learning. This is done using a physically-informed recurrent neural network within a probabilistic framework. Our model is competitive and often superior to both a bespoke baseline and an existing probabilistic machine learning approach (GAN) when applied to the Lorenz 96 atmospheric simulation. This is due to its superior ability to model temporal patterns compared to standard first-order autoregressive schemes. It also generalises to unseen scenarios. We evaluate across a number of metrics from the literature, and also discuss the benefits of using the probabilistic metric of hold-out likelihood.

相關內容

Existing video denoising methods typically assume noisy videos are degraded from clean videos by adding Gaussian noise. However, deep models trained on such a degradation assumption will inevitably give rise to poor performance for real videos due to degradation mismatch. Although some studies attempt to train deep models on noisy and noise-free video pairs captured by cameras, such models can only work well for specific cameras and do not generalize well for other videos. In this paper, we propose to lift this limitation and focus on the problem of general real video denoising with the aim to generalize well on unseen real-world videos. We tackle this problem by firstly investigating the common behaviors of video noises and observing two important characteristics: 1) downscaling helps to reduce the noise level in spatial space and 2) the information from the adjacent frames help to remove the noise of current frame in temporal space. Motivated by these two observations, we propose a multi-scale recurrent architecture by making full use of the above two characteristics. Secondly, we propose a synthetic real noise degradation model by randomly shuffling different noise types to train the denoising model. With a synthesized and enriched degradation space, our degradation model can help to bridge the distribution gap between training data and real-world data. Extensive experiments demonstrate that our proposed method achieves the state-of-the-art performance and better generalization ability than existing methods on both synthetic Gaussian denoising and practical real video denoising.

Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention to locations specified by the predefined sparse patterns. However, leveraging sparsity may sacrifice expressiveness compared to full-attention, when important token correlations are multiple hops away. To combine advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer, we propose \textit{Diffuser}, a new state-of-the-art efficient Transformer. Diffuser incorporates all token interactions within one attention layer while maintaining low computation and memory costs. The key idea is to expand the receptive field of sparse attention using Attention Diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among neighboring tokens. Theoretically, we show the expressiveness of Diffuser as a universal sequence approximator for sequence-to-sequence modeling, and investigate its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective. Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67$\times$ memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.

Since the emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), many contact surveys have been conducted to measure changes in human interactions in the face of the pandemic and non-pharmaceutical interventions. These surveys were typically conducted longitudinally, using protocols that differ from those used in the pre-pandemic era. We present a model-based statistical approach that can reconstruct contact patterns at 1-year resolution even when the age of the contacts is reported coarsely by 5 or 10-year age bands. This innovation is rooted in population-level consistency constraints in how contacts between groups must add up, which prompts us to call the approach presented here the Bayesian rate consistency model. The model incorporates computationally efficient Hilbert Space Gaussian process priors to infer the dynamics in age- and gender-structured social contacts and is designed to adjust for reporting fatigue in longitudinal surveys. We demonstrate on simulations the ability to reconstruct contact patterns by gender and 1-year age interval from coarse data with adequate accuracy and within a fully Bayesian framework to quantify uncertainty. We investigate the patterns of social contact data collected in Germany from April to June 2020 across five longitudinal survey waves. We reconstruct the fine age structure in social contacts during the early stages of the pandemic and demonstrate that social contacts rebounded in a structured, non-homogeneous manner. We also show that by July 2020, social contact intensities remained well below pre-pandemic values despite a considerable easing of non-pharmaceutical interventions. This model-based inference approach is open access, computationally tractable enabling full Bayesian uncertainty quantification, and readily applicable to contemporary survey data as long as the exact age of survey participants is reported.

The $k$-Server Problem covers plenty of resource allocation scenarios, and several variations have been studied extensively for decades. We present a model generalizing the $k$-Server Problem by preferences of the requests, where the servers are not identical and requests can express which specific servers should serve them. In our model, requests can either be answered by any server (general requests) or by a specific one (specific requests). If only general requests appear, the instance is one of the original $k$-Server Problem, and a lower bound for the competitive ratio of $k$ applies. If only specific requests appear, a solution with a competitive ratio of $1$ becomes trivial. We show that if both kinds of requests appear, the lower bound raises to $2k-1$. We study deterministic online algorithms and present two algorithms for uniform metrics. The first one has a competitive ratio dependent on the frequency of specific requests. It achieves a worst-case competitive ratio of $3k-2$ while it is optimal when only general requests appear or when specific requests dominate the input sequence. The second has a worst-case competitive ratio of $2k+14$. For the first algorithm, we show a lower bound of $3k-2$, while the second algorithm has a lower bound of $2k-1$ when only general requests appear. The two algorithms differ in only one behavioral rule that significantly influences the competitive ratio. We show that there is a trade-off between performing well against instances of the $k$-Server Problem and mixed instances based on the rule. Additionally, no deterministic online algorithm can be optimal for both kinds of instances simultaneously. Regarding non-uniform metrics, we present an adaption of the Double Coverage algorithm for $2$ servers on the line achieving a competitive ratio of $6$, and an adaption of the Work-Function-Algorithm achieving a competitive ratio of $4k$.

In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart's law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preference data. In this work, we use a synthetic setup in which a fixed "gold-standard" reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling. We find that this relationship follows a different functional form depending on the method of optimization, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup. We explore the implications of these empirical results for theoretical considerations in AI alignment.

We introduce a Fourier-based fast algorithm for Gaussian process regression. It approximates a translationally-invariant covariance kernel by complex exponentials on an equispaced Cartesian frequency grid of $M$ nodes. This results in a weight-space $M\times M$ system matrix with Toeplitz structure, which can thus be applied to a vector in ${\mathcal O}(M \log{M})$ operations via the fast Fourier transform (FFT), independent of the number of data points $N$. The linear system can be set up in ${\mathcal O}(N + M \log{M})$ operations using nonuniform FFTs. This enables efficient massive-scale regression via an iterative solver, even for kernels with fat-tailed spectral densities (large $M$). We include a rigorous error analysis of the kernel approximation, the resulting accuracy (relative to "exact" GP regression), and the condition number. Numerical experiments for squared-exponential and Mat\'ern kernels in one, two and three dimensions often show 1-2 orders of magnitude acceleration over state-of-the-art rank-structured solvers at comparable accuracy. Our method allows 2D Mat\'ern-${\small \frac{3}{2}}$ regression from $N=10^9$ data points to be performed in 2 minutes on a standard desktop, with posterior mean accuracy $10^{-3}$. This opens up spatial statistics applications 100 times larger than previously possible.

This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.

Residual networks (ResNets) have displayed impressive results in pattern recognition and, recently, have garnered considerable theoretical interest due to a perceived link with neural ordinary differential equations (neural ODEs). This link relies on the convergence of network weights to a smooth function as the number of layers increases. We investigate the properties of weights trained by stochastic gradient descent and their scaling with network depth through detailed numerical experiments. We observe the existence of scaling regimes markedly different from those assumed in neural ODE literature. Depending on certain features of the network architecture, such as the smoothness of the activation function, one may obtain an alternative ODE limit, a stochastic differential equation or neither of these. These findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit.

For deploying a deep learning model into production, it needs to be both accurate and compact to meet the latency and memory constraints. This usually results in a network that is deep (to ensure performance) and yet thin (to improve computational efficiency). In this paper, we propose an efficient method to train a deep thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. In the first stage, we sufficiently widen the deep thin network and train it until convergence. In the second stage, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by letting the thin network imitate the immediate outputs of the wide network from layer to layer. In the last stage, we further fine tune this well initialized deep thin network. The theoretical guarantee is established by using mean field analysis, which shows the advantage of layerwise imitation over traditional training deep thin networks from scratch by backpropagation. We also conduct large-scale empirical experiments to validate our approach. By training with our method, ResNet50 can outperform ResNet101, and BERT_BASE can be comparable with BERT_LARGE, where both the latter models are trained via the standard training procedures as in the literature.

Transfer learning aims at improving the performance of target learners on target domains by transferring the knowledge contained in different but related source domains. In this way, the dependence on a large number of target domain data can be reduced for constructing target learners. Due to the wide application prospects, transfer learning has become a popular and promising area in machine learning. Although there are already some valuable and impressive surveys on transfer learning, these surveys introduce approaches in a relatively isolated way and lack the recent advances in transfer learning. As the rapid expansion of the transfer learning area, it is both necessary and challenging to comprehensively review the relevant studies. This survey attempts to connect and systematize the existing transfer learning researches, as well as to summarize and interpret the mechanisms and the strategies in a comprehensive way, which may help readers have a better understanding of the current research status and ideas. Different from previous surveys, this survey paper reviews over forty representative transfer learning approaches from the perspectives of data and model. The applications of transfer learning are also briefly introduced. In order to show the performance of different transfer learning models, twenty representative transfer learning models are used for experiments. The models are performed on three different datasets, i.e., Amazon Reviews, Reuters-21578, and Office-31. And the experimental results demonstrate the importance of selecting appropriate transfer learning models for different applications in practice.

北京阿比特科技有限公司