亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

The notion of omnipredictors (Gopalan, Kalai, Reingold, Sharan and Wieder ITCS 2021), suggested a new paradigm for loss minimization. Rather than learning a predictor based on a known loss function, omnipredictors can easily be post-processed to minimize any one of a rich family of loss functions compared with the loss of a class $C$. It has been shown that such omnipredictors exist and are implied (for all convex and Lipschitz loss functions) by the notion of multicalibration from the algorithmic fairness literature. Nevertheless, it is often the case that the action selected must obey some additional constraints (such as capacity or parity constraints). In itself, the original notion of omnipredictors does not apply in this well-motivated and heavily studied the context of constrained loss minimization. In this paper, we introduce omnipredictors for constrained optimization and study their complexity and implications. The notion that we introduce allows the learner to be unaware of the loss function that will be later assigned as well as the constraints that will be later imposed, as long as the subpopulations that are used to define these constraints are known. The paper shows how to obtain omnipredictors for constrained optimization problems, relying on appropriate variants of multicalibration. For some interesting constraints and general loss functions and for general constraints and some interesting loss functions, we show how omnipredictors are implied by a variant of multicalibration that is similar in complexity to standard multicalibration. We demonstrate that in the general case, standard multicalibration is insufficient and show that omnipredictors are implied by multicalibration with respect to a class containing all the level sets of hypotheses in $C$. We also investigate the implications when the constraints are group fairness notions.

相關內容

In NMT we search for the mode of the model distribution to form predictions. The mode and other high-probability translations found by beam search have been shown to often be inadequate in a number of ways. This prevents improving translation quality through better search, as these idiosyncratic translations end up selected by the decoding algorithm, a problem known as the beam search curse. Recently, an approximation to minimum Bayes risk (MBR) decoding has been proposed as an alternative decision rule that would likely not suffer from the same problems. We analyse this approximation and establish that it has no equivalent to the beam search curse. We then design approximations that decouple the cost of exploration from the cost of robust estimation of expected utility. This allows for much larger hypothesis spaces, which we show to be beneficial. We also show that mode-seeking strategies can aid in constructing compact sets of promising hypotheses and that MBR is effective in identifying good translations in them. We conduct experiments on three language pairs varying in amounts of resources available: English into and from German, Romanian, and Nepali.

Real-world optimization problems may have a different underlying structure. In black-box optimization, the dependencies between decision variables remain unknown. However, some techniques can discover such interactions accurately. In Large Scale Global Optimization (LSGO), problems are high-dimensional. It was shown effective to decompose LSGO problems into subproblems and optimize them separately. The effectiveness of such approaches may be highly dependent on the accuracy of problem decomposition. Many state-of-the-art decomposition strategies are derived from Differential Grouping (DG). However, if a given problem consists of non-additively separable subproblems, DG-based strategies may discover many non-existing interactions. On the other hand, monotonicity checking strategies proposed so far do not report non-existing interactions for any separable subproblems but may miss discovering many of the existing ones. Therefore, we propose Incremental Recursive Ranking Grouping (IRRG) that suffers from none of these flaws. IRRG consumes more fitness function evaluations than the recent DG-based propositions, e.g., Recursive DG 3 (RDG3). Nevertheless, the effectiveness of the considered Cooperative Co-evolution frameworks after embedding IRRG or RDG3 was similar for problems with additively separable subproblems that are suitable for RDG3. After replacing the additive separability with non-additive, embedding IRRG leads to results of significantly higher quality.

In distributed or federated optimization and learning, communication between the different computing units is often the bottleneck and gradient compression is widely used to reduce the number of bits sent within each communication round of iterative methods. There are two classes of compression operators and separate algorithms making use of them. In the case of unbiased random compressors with bounded variance (e.g., rand-k), the DIANA algorithm of Mishchenko et al. (2019), which implements a variance reduction technique for handling the variance introduced by compression, is the current state of the art. In the case of biased and contractive compressors (e.g., top-k), the EF21 algorithm of Richt\'arik et al. (2021), which instead implements an error-feedback mechanism, is the current state of the art. These two classes of compression schemes and algorithms are distinct, with different analyses and proof techniques. In this paper, we unify them into a single framework and propose a new algorithm, recovering DIANA and EF21 as particular cases. Our general approach works with a new, larger class of compressors, which has two parameters, the bias and the variance, and includes unbiased and biased compressors as particular cases. This allows us to inherit the best of the two worlds: like EF21 and unlike DIANA, biased compressors, like top-k, whose good performance in practice is recognized, can be used. And like DIANA and unlike EF21, independent randomness at the compressors allows to mitigate the effects of compression, with the convergence rate improving when the number of parallel workers is large. This is the first time that an algorithm with all these features is proposed. We prove its linear convergence under certain conditions. Our approach takes a step towards better understanding of two so-far distinct worlds of communication-efficient distributed learning.

Explaining to users why some items are recommended is critical, as it can help users to make better decisions, increase their satisfaction, and gain their trust in recommender systems (RS). However, existing explainable RS usually consider explanation as a side output of the recommendation model, which has two problems: (1) it is difficult to evaluate the produced explanations because they are usually model-dependent, and (2) as a result, how the explanations impact the recommendation performance is less investigated. In this paper, explaining recommendations is formulated as a ranking task, and learned from data, similar to item ranking for recommendation. This makes it possible for standard evaluation of explanations via ranking metrics (e.g., NDCG). Furthermore, this paper extends traditional item ranking to an item-explanation joint-ranking formalization to study if purposely selecting explanations could reach certain learning goals, e.g., improving recommendation performance. A great challenge, however, is that the sparsity issue in the user-item-explanation data would be inevitably severer than that in traditional user-item interaction data, since not every user-item pair can be associated with all explanations. To mitigate this issue, this paper proposes to perform two sets of matrix factorization by considering the ternary relationship as two groups of binary relationships. Experiments on three large datasets verify the solution's effectiveness on both explanation ranking and item recommendation.

Distributed statistical learning problems arise commonly when dealing with large datasets. In this setup, datasets are partitioned over machines, which compute locally, and communicate short messages. Communication is often the bottleneck. In this paper, we study one-step and iterative weighted parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, send the results to a central server, and take a weighted average of the parameters. Optionally, we iterate, sending back the weighted average and doing local ridge regressions centered at it. How does this work compared to doing linear regression on the full data? Here we study the performance loss in estimation, test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We find the performance loss in one-step weighted averaging, and also give results for iterative averaging. We also find that different problems are affected differently by the distributed framework. Estimation error and confidence interval length increase a lot, while prediction error increases much less. We rely on recent results from random matrix theory, where we develop a new calculus of deterministic equivalents as a tool of broader interest.

Understanding the impact of the most effective policies or treatments on a response variable of interest is desirable in many empirical works in economics, statistics and other disciplines. Due to the widespread winner's curse phenomenon, conventional statistical inference assuming that the top policies are chosen independent of the random sample may lead to overly optimistic evaluations of the best policies. In recent years, given the increased availability of large datasets, such an issue can be further complicated when researchers include many covariates to estimate the policy or treatment effects in an attempt to control for potential confounders. In this manuscript, to simultaneously address the above-mentioned issues, we propose a resampling-based procedure that not only lifts the winner's curse in evaluating the best policies observed in a random sample, but also is robust to the presence of many covariates. The proposed inference procedure yields accurate point estimates and valid frequentist confidence intervals that achieve the exact nominal level as the sample size goes to infinity for multiple best policy effect sizes. We illustrate the finite-sample performance of our approach through Monte Carlo experiments and two empirical studies, evaluating the most effective policies in charitable giving and the most beneficial group of workers in the National Supported Work program.

Bayesian optimization (BO) is a popular approach for sample-efficient optimization of black-box objective functions. While BO has been successfully applied to a wide range of scientific applications, traditional approaches to single-objective BO only seek to find a single best solution. This can be a significant limitation in situations where solutions may later turn out to be intractable. For example, a designed molecule may turn out to violate constraints that can only be reasonably evaluated after the optimization process has concluded. To address this issue, we propose Rank-Ordered Bayesian Optimization with Trust-regions (ROBOT) which aims to find a portfolio of high-performing solutions that are diverse according to a user-specified diversity metric. We evaluate ROBOT on several real-world applications and show that it can discover large sets of high-performing diverse solutions while requiring few additional function evaluations compared to finding a single best solution.

We study the rank of sub-matrices arising out of kernel functions, $F(\pmb{x},\pmb{y}): \mathbb{R}^d \times \mathbb{R}^d \mapsto \mathbb{R}$, where $\pmb{x},\pmb{y} \in \mathbb{R}^d$ with $F(\pmb{x},\pmb{y})$ is smooth everywhere except along the line $\pmb{x}=\pmb{y}$. Such kernel functions are frequently encountered in a wide range of applications such as $N$ body problems, Green's functions, integral equations, geostatistics, kriging, Gaussian processes, etc. One of the challenges in dealing with these kernel functions is that the corresponding matrix associated with these kernels is large and dense and thereby, the computational cost of matrix operations is high. In this article, we prove new theorems bounding the numerical rank of sub-matrices arising out of these kernel functions. Under reasonably mild assumptions, we prove that the rank of certain sub-matrices is rank-deficient in finite precision. This rank depends on the dimension of the ambient space and also on the type of interaction between the hyper-cubes containing the corresponding set of particles. This rank structure can be leveraged to reduce the computational cost of certain matrix operations such as matrix-vector products, solving linear systems, etc. We also present numerical results on the growth of rank of certain sub-matrices in $1$D, $2$D, $3$D and $4$D, which, not surprisingly, agrees with the theoretical results.

In solving multi-modal, multi-objective optimization problems (MMOPs), the objective is not only to find a good representation of the Pareto-optimal front (PF) in the objective space but also to find all equivalent Pareto-optimal subsets (PSS) in the variable space. Such problems are practically relevant when a decision maker (DM) is interested in identifying alternative designs with similar performance. There has been significant research interest in recent years to develop efficient algorithms to deal with MMOPs. However, the existing algorithms still require prohibitive number of function evaluations (often in several thousands) to deal with problems involving as low as two objectives and two variables. The algorithms are typically embedded with sophisticated, customized mechanisms that require additional parameters to manage the diversity and convergence in the variable and the objective spaces. In this letter, we introduce a steady-state evolutionary algorithm for solving MMOPs, with a simple design and no additional userdefined parameters that need tuning compared to a standard EA. We report its performance on 21 MMOPs from various test suites that are widely used for benchmarking using a low computational budget of 1000 function evaluations. The performance of the proposed algorithm is compared with six state-of-the-art algorithms (MO Ring PSO SCD, DN-NSGAII, TriMOEA-TA&R, CPDEA, MMOEA/DC and MMEA-WI). The proposed algorithm exhibits significantly better performance than the above algorithms based on the established metrics including IGDX, PSP and IGD. We hope this study would encourage design of simple, efficient and generalized algorithms to improve its uptake for practical applications.

When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.

北京阿比特科技有限公司