亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

We consider minimizing a perturbed function $F(W) = \mathbb{E}_{U}[f(W + U)]$, given a function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ and a random sample $U$ from a distribution $\mathcal{P}$ with mean zero. When $\mathcal{P}$ is the isotropic Gaussian, $F(W)$ is roughly equal to $f(W)$ plus a penalty on the trace of $\nabla^2 f(W)$, scaled by the variance of $\mathcal{P}$. This penalty on the Hessian has the benefit of improving generalization, through PAC-Bayes analysis. It is useful in low-sample regimes, for instance, when a (large) pre-trained model is fine-tuned on a small data set. One way to minimize $F$ is by adding $U$ to $W$, and then run SGD. We observe, empirically, that this noise injection does not provide significant gains over SGD, in our experiments of conducting fine-tuning on three image classification data sets. We design a simple, practical algorithm that adds noise along both $U$ and $-U$, with the option of adding several perturbations and taking their average. We analyze the convergence of this algorithm, showing tight rates on the norm of the output's gradient. We provide a comprehensive empirical analysis of our algorithm, by first showing that in an over-parameterized matrix sensing problem, it can find solutions with lower test loss than naive noise injection. Then, we compare our algorithm with four sharpness-reducing training methods (such as the Sharpness-Aware Minimization (Foret et al., 2021)). We find that our algorithm can outperform them by up to 1.8% test accuracy, for fine-tuning ResNet on six image classification data sets. It leads to a 17.7% (and 12.8%) reduction in the trace (and largest eigenvalue) of the Hessian matrix of the loss surface. This form of regularization on the Hessian is compatible with $\ell_2$ weight decay (and data augmentation), in the sense that combining both can lead to improved empirical performance.

相關內容

We present X-MDPT ($\underline{Cross}$-view $\underline{M}$asked $\underline{D}$iffusion $\underline{P}$rediction $\underline{T}$ransformers), a novel diffusion model designed for pose-guided human image generation. X-MDPT distinguishes itself by employing masked diffusion transformers that operate on latent patches, a departure from the commonly-used Unet structures in existing works. The model comprises three key modules: 1) a denoising diffusion Transformer, 2) an aggregation network that consolidates conditions into a single vector for the diffusion process, and 3) a mask cross-prediction module that enhances representation learning with semantic information from the reference image. X-MDPT demonstrates scalability, improving FID, SSIM, and LPIPS with larger models. Despite its simple design, our model outperforms state-of-the-art approaches on the DeepFashion dataset while exhibiting efficiency in terms of training parameters, training time, and inference speed. Our compact 33MB model achieves an FID of 7.42, surpassing a prior Unet latent diffusion approach (FID 8.07) using only $11\times$ fewer parameters. Our best model surpasses the pixel-based diffusion with $\frac{2}{3}$ of the parameters and achieves $5.43 \times$ faster inference. The code is available at //github.com/trungpx/xmdpt.

Boolean function $F(x,y)$ for $x,y \in \{0,1\}^n$ is an XOR function if $F(x,y)=f(x\oplus y)$ for some function $f$ on $n$ input bits, where $\oplus$ is a bit-wise XOR. XOR functions are relevant in communication complexity, partially for allowing Fourier analytic technique. For total XOR functions it is known that deterministic communication complexity of $F$ is closely related to parity decision tree complexity of $f$. Montanaro and Osbourne (2009) observed that one-sided communication complexity $D_{cc}^{\rightarrow}(F)$ of $F$ is exactly equal to nonadaptive parity decision tree complexity $NADT^{\oplus}(f)$ of $f$. Hatami et al. (2018) showed that unrestricted communication complexity of $F$ is polynomially related to parity decision tree complexity of $f$. We initiate the studies of a similar connection for partial functions. We show that in case of one-sided communication complexity whether these measures are equal, depends on the number of undefined inputs of $f$. On the one hand, if $D_{cc}^{\rightarrow}(F)=t$ and $f$ is undefined on at most $O(\frac{2^{n-t}}{\sqrt{n-t}})$, then $NADT^{\oplus}(f)=t$. On the other hand, for a wide range of values of $D_{cc}^{\rightarrow}(F)$ and $NADT^{\oplus}(f)$ (from constant to $n-2$) we provide partial functions for which $D_{cc}^{\rightarrow}(F) < NADT^{\oplus}(f)$. In particular, we provide a function with an exponential gap between the two measures. Our separation results translate to the case of two-sided communication complexity as well, in particular showing that the result of Hatami et al. (2018) cannot be generalized to partial functions. Previous results for total functions heavily rely on Boolean Fourier analysis and the technique does not translate to partial functions. For the proofs of our results we build a linear algebraic framework instead. Separation results are proved through the reduction to covering codes.

This paper presents an $O^{*}(1.42^{n})$ time algorithm for the Maximum Cut problem on split graphs, along with a subexponential time algorithm for its decision variant.

We provide an online learning algorithm that obtains regret $G\|w_\star\|\sqrt{T\log(\|w_\star\|G\sqrt{T})} + \|w_\star\|^2 + G^2$ on $G$-Lipschitz convex losses for any comparison point $w_\star$ without knowing either $G$ or $\|w_\star\|$. Importantly, this matches the optimal bound $G\|w_\star\|\sqrt{T}$ available with such knowledge (up to logarithmic factors), unless either $\|w_\star\|$ or $G$ is so large that even $G\|w_\star\|\sqrt{T}$ is roughly linear in $T$. Thus, it matches the optimal bound in all cases in which one can achieve sublinear regret, which arguably most "interesting" scenarios.

In Path Set Packing, the input is an undirected graph $G$, a collection $\calp$ of simple paths in $G$, and a positive integer $k$. The problem is to decide whether there exist $k$ edge-disjoint paths in $\calp$. We study the parameterized complexity of Path Set Packing with respect to both natural and structural parameters. We show that the problem is $W[1]$-hard with respect to vertex cover number, and $W[1]$-hard respect to pathwidth plus maximum degree plus solution size. These results answer an open question raised in COCOON 2018. On the positive side, we present an FPT algorithm parameterized by feedback vertex number plus maximum degree, and present an FPT algorithm parameterized by treewidth plus maximum degree plus maximum length of a path in $\calp$. These positive results complement the hardness of Path Set Packing with respect to any subset of the parameters used in the FPT algorithms. We also give a $4$-approximation algorithm for maximum path set packing problem which runs in FPT time when parameterized by feedback edge number.

In this work, we study statistical learning with dependent ($\beta$-mixing) data and square loss in a hypothesis class $\mathscr{F}\subset L_{\Psi_p}$ where $\Psi_p$ is the norm $\|f\|_{\Psi_p} \triangleq \sup_{m\geq 1} m^{-1/p} \|f\|_{L^m} $ for some $p\in [2,\infty]$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent data. Absent any realizability assumption, typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively by the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$ -- that is, $\mathscr{F}$ is a weakly sub-Gaussian class: $\|f\|_{\Psi_p} \lesssim \|f\|_{L^2}^\eta$ for some $\eta\in (0,1]$ -- the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. Our result holds whether the problem is realizable or not and we refer to this as a \emph{near mixing-free rate}, since direct dependence on mixing is relegated to an additive higher order term. We arrive at our result by combining the above notion of a weakly sub-Gaussian class with mixed tail generic chaining. This combination allows us to compute sharp, instance-optimal rates for a wide range of problems. Examples that satisfy our framework include sub-Gaussian linear regression, more general smoothly parameterized function classes, finite hypothesis classes, and bounded smoothness classes.

We study reinforcement learning with multinomial logistic (MNL) function approximation where the underlying transition probability kernel of the Markov decision processes (MDPs) is parametrized by an unknown transition core with features of state and action. For the finite horizon episodic setting with inhomogeneous state transitions, we propose provably efficient algorithms with randomized exploration having frequentist regret guarantees. For our first algorithm, $\texttt{RRL-MNL}$, we adapt optimistic sampling to ensure the optimism of the estimated value function with sufficient frequency and establish that $\texttt{RRL-MNL}$ is both statistically and computationally efficient, achieving a $\tilde{O}(\kappa^{-1} d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T})$ frequentist regret bound with constant-time computational cost per episode. Here, $d$ is the dimension of the transition core, $H$ is the horizon length, $T$ is the total number of steps, and $\kappa$ is a problem-dependent constant. Despite the simplicity and practicality of $\texttt{RRL-MNL}$, its regret bound scales with $\kappa^{-1}$, which is potentially large in the worst case. To improve the dependence on $\kappa^{-1}$, we propose $\texttt{ORRL-MNL}$, which estimates the value function using local gradient information of the MNL transition model. We show that its frequentist regret bound is $\tilde{O}(d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T} + \kappa^{-1} d^2 H^2)$. To the best of our knowledge, these are the first randomized RL algorithms for the MNL transition model that achieve both computational and statistical efficiency. Numerical experiments demonstrate the superior performance of the proposed algorithms.

We present two randomised approximate counting algorithms with $\widetilde{O}(n^{2-c}/\varepsilon^2)$ running time for some constant $c>0$ and accuracy $\varepsilon$: (1) for the hard-core model with fugacity $\lambda$ on graphs with maximum degree $\Delta$ when $\lambda=O(\Delta^{-1.5-c_1})$ where $c_1=c/(2-2c)$; (2) for spin systems with strong spatial mixing (SSM) on planar graphs with quadratic growth, such as $\mathbb{Z}^2$. For the hard-core model, Weitz's algorithm (STOC, 2006) achieves sub-quadratic running time when correlation decays faster than the neighbourhood growth, namely when $\lambda = o(\Delta^{-2})$. Our first algorithm does not require this property and extends the range where sub-quadratic algorithms exist. Our second algorithm appears to be the first to achieve sub-quadratic running time up to the SSM threshold, albeit on a restricted family of graphs. It also extends to (not necessarily planar) graphs with polynomial growth, such as $\mathbb{Z}^d$, but with a running time of the form $\widetilde{O}\left(n^2\varepsilon^{-2}/2^{c(\log n)^{1/d}}\right)$ where $d$ is the exponent of the polynomial growth and $c>0$ is some constant.

Given a graph $G$ and sets $\{\alpha_v~|~v \in V(G)\}$ and $\{\beta_v~|~v \in V(G)\}$ of non-negative integers, it is known that the decision problem whether $G$ contains a spanning tree $T$ such that $\alpha_v \le d_T (v) \le \beta_v $ for all $v \in V(G)$ is $NP$-complete. In this article, we relax the problem by demanding that the degree restrictions apply to vertices $v\in U$ only, where $U$ is a stable set of $G$. In this case, the problem becomes tractable. A. Frank presented a result characterizing the positive instances of that relaxed problem. Using matroid intersection developed by J. Edmonds, we give a new and short proof of Frank's result and show that if $U$ is stable and the edges of $G$ are weighted by arbitrary real numbers, then even a minimum-cost tree $T$ with $\alpha_v \le d_T (v) \le \beta_v $ for all $v \in U$ can be found in polynomial time if such a tree exists.

We prove bounds on the variance of a function $f$ under the empirical measure of the samples obtained by the Sequential Monte Carlo (SMC) algorithm, with time complexity depending on local rather than global Markov chain mixing dynamics. SMC is a Markov Chain Monte Carlo (MCMC) method, which starts by drawing $N$ particles from a known distribution, and then, through a sequence of distributions, re-weights and re-samples the particles, at each instance applying a Markov chain for smoothing. In principle, SMC tries to alleviate problems from multi-modality. However, most theoretical guarantees for SMC are obtained by assuming global mixing time bounds, which are only efficient in the uni-modal setting. We show that bounds can be obtained in the truly multi-modal setting, with mixing times that depend only on local MCMC dynamics.

北京阿比特科技有限公司