亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

The Jeffreys divergence is a renown symmetrization of the oriented Kullback-Leibler divergence broadly used in information sciences. Since the Jeffreys divergence between Gaussian mixture models is not available in closed-form, various techniques with pros and cons have been proposed in the literature to either estimate, approximate, or lower and upper bound this divergence. In this paper, we propose a simple yet fast heuristic to approximate the Jeffreys divergence between two univariate Gaussian mixtures with arbitrary number of components. Our heuristic relies on converting the mixtures into pairs of dually parameterized probability densities belonging to an exponential family. In particular, we consider the versatile polynomial exponential family densities, and design a divergence to measure in closed-form the goodness of fit between a Gaussian mixture and its polynomial exponential density approximation. This goodness-of-fit divergence is a generalization of the Hyv\"arinen divergence used to estimate models with computationally intractable normalizers. It allows us to perform model selection by choosing the orders of the polynomial exponential densities used to approximate the mixtures. We demonstrate experimentally that our heuristic to approximate the Jeffreys divergence improves by several orders of magnitude the computational time of stochastic Monte Carlo estimations while approximating reasonably well the Jeffreys divergence, specially when the mixtures have a very small number of modes. Besides, our mixture-to-exponential family conversion techniques may prove useful in other settings.

相關內容

In this paper, we introduce a novel family of iterative algorithms which carry out $\alpha$-divergence minimisation in a Variational Inference context. They do so by ensuring a systematic decrease at each step in the $\alpha$-divergence between the variational and the posterior distributions. In its most general form, the variational distribution is a mixture model and our framework allows us to simultaneously optimise the weights and components parameters of this mixture model. Notably, our approach permits to build on various methods previously proposed for $\alpha$-divergence minimisation such as Gradient or Power Descent schemes and we also shed a new light on an integrated Expectation Maximization algorithm. Lastly, we provide empirical evidence that our methodology yields improved results on several multimodal target distributions.

Subclassification and matching are often used in empirical studies to adjust for observed covariates; however, they are largely restricted to relatively simple study designs with a binary treatment and less developed for designs with a continuous exposure. Matching with exposure doses is particularly useful in instrumental variable designs and in understanding the dose-response relationships. In this article, we propose two criteria for optimal subclassification based on subclass homogeneity in the context of having a continuous exposure dose, and propose an efficient polynomial-time algorithm that is guaranteed to find an optimal subclassification with respect to one criterion and serves as a 2-approximation algorithm for the other criterion. We discuss how to incorporate dose and use appropriate penalties to control the number of subclasses in the design. Via extensive simulations, we systematically compare our proposed design to optimal non-bipartite pair matching, and demonstrate that combining our proposed subclassification scheme with regression adjustment helps reduce model dependence for parametric causal inference with a continuous dose. We apply the new design and associated randomization-based inferential procedure to study the effect of transesophageal echocardiography (TEE) monitoring during coronary artery bypass graft (CABG) surgery on patients' post-surgery clinical outcomes using Medicare and Medicaid claims data, and find evidence that TEE monitoring lowers patients' all-cause $30$-day mortality rate.

We consider a network of agents. Associated with each agent are her covariate and outcome. Agents influence each other's outcomes according to a certain connection/influence structure. A subset of the agents participate on a platform, and hence, are observable to it. The rest are not observable to the platform and are called the latent agents. The platform does not know the influence structure of the observable or the latent parts of the network. It only observes the data on past covariates and decisions of the observable agents. Observable agents influence each other both directly and indirectly through the influence they exert on the latent agents. We investigate how the platform can estimate the dependence of the observable agents' outcomes on their covariates, taking the latent agents into account. First, we show that this relationship can be succinctly captured by a matrix and provide an algorithm for estimating it under a suitable approximate sparsity condition using historical data of covariates and outcomes for the observable agents. We also obtain convergence rates for the proposed estimator despite the high dimensionality that allows more agents than observations. Second, we show that the approximate sparsity condition holds under the standard conditions used in the literature. Hence, our results apply to a large class of networks. Finally, we apply our results to two practical settings: targeted advertising and promotional pricing. We show that by using the available historical data with our estimator, it is possible to obtain asymptotically optimal advertising/pricing decisions, despite the presence of latent agents.

Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes. Specifically, we show that when using constant stepsize (i.e., $\alpha_k\equiv \alpha$), the algorithm achieves exponential fast convergence to a neighborhood (with radius $O(\alpha\log(1/\alpha))$) around the desired limit point. When using diminishing stepsizes with appropriate decay rate, the algorithm converges with rate $O(\log(k)/k)$. Our proof is based on Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain. To demonstrate the generality of our theoretical results on Markovian SA, we use it to derive the finite-sample bounds of the popular $Q$-learning with linear function approximation algorithm, under a condition on the behavior policy. Importantly, we do not need to make the assumption that the samples are i.i.d., and do not require an artificial projection step in the algorithm to maintain the boundedness of the iterates. Numerical simulations corroborate our theoretical results.

Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish non-asymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular $\mathsf{f}$-divergences -- Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are minimax rate-optimal, achieving the parametric convergence rate.

We study the problem of estimating the diagonal of an implicitly given matrix $A$. For such a matrix we have access to an oracle that allows us to evaluate the matrix vector product $Av$. For random variable $v$ drawn from an appropriate distribution, this may be used to return an estimate of the diagonal of the matrix $A$. Whilst results exist for probabilistic guarantees relating to the error of estimates of the trace of $A$, no such results have yet been derived for the diagonal. We analyse the number of queries $s$ required to guarantee that with probability at least $1-\delta$ the estimates of the relative error of the diagonal entries is at most $\varepsilon$. We extend this analysis to the 2-norm of the difference between the estimate and the diagonal of $A$. We prove, discuss and experiment with bounds on the number of queries $s$ required to guarantee a probabilistic bound on the estimates of the diagonal by employing Rademacher and Gaussian random variables. Two sufficient upper bounds on the minimum number of query vectors are proved, extending the work of Avron and Toledo [JACM 58(2)8, 2011], and later work of Roosta-Khorasani and Ascher [FoCM 15, 1187-1212, 2015]. We find that, generally, there is little difference between the two, with convergence going as $O(\log(1/\delta)/\varepsilon^2)$ for individual diagonal elements. However for small $s$, we find that the Rademacher estimator is superior. These results allow us to then extend the ideas of Meyer, Musco, Musco and Woodruff [SOSA, 142-155, 2021], suggesting algorithm Diag++, to speed up the convergence of diagonal estimation from $O(1/\varepsilon^2)$ to $O(1/\varepsilon)$ and make it robust to the spectrum of any positive semi-definite matrix $A$.

Many representative graph neural networks, $e.g.$, GPR-GNN and ChebyNet, approximate graph convolutions with graph spectral filters. However, existing work either applies predefined filter weights or learns them without necessary constraints, which may lead to oversimplified or ill-posed filters. To overcome these issues, we propose $\textit{BernNet}$, a novel graph neural network with theoretical support that provides a simple but effective scheme for designing and learning arbitrary graph spectral filters. In particular, for any filter over the normalized Laplacian spectrum of a graph, our BernNet estimates it by an order-$K$ Bernstein polynomial approximation and designs its spectral property by setting the coefficients of the Bernstein basis. Moreover, we can learn the coefficients (and the corresponding filter weights) based on observed graphs and their associated signals and thus achieve the BernNet specialized for the data. Our experiments demonstrate that BernNet can learn arbitrary spectral filters, including complicated band-rejection and comb filters, and it achieves superior performance in real-world graph modeling tasks.

We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to guarantee both optimism and convergence of the associated value iteration scheme. We prove that EB-SSP achieves the minimax regret rate $\widetilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions and $B_{\star}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of $B_{\star}$, nor of $T_{\star}$ which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of $T_{\star}$ is available) where the regret only contains a logarithmic dependence on $T_{\star}$, thus yielding the first horizon-free regret bound beyond the finite-horizon MDP setting.

Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable general method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of $10^{50}$. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots.

We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.

北京阿比特科技有限公司