亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

For relational structures A, B of the same signature, the Promise Constraint Satisfaction Problem PCSP(A,B) asks whether a given input structure maps homomorphically to A or does not even map to B. We are promised that the input satisfies exactly one of these two cases. If there exists a structure C with homomorphisms $A\to C\to B$, then PCSP(A,B) reduces naturally to CSP(C). To the best of our knowledge all known tractable PCSPs reduce to tractable CSPs in this way. However Barto showed that some PCSPs over finite structures A, B require solving CSPs over infinite C. We show that even when such a reduction to finite C is possible, this structure may become arbitrarily large. For every integer $n>1$ and every prime p we give A, B of size n with a single relation of arity $n^p$ such that PCSP(A, B) reduces via a chain of homomorphisms $ A\to C\to B$ to a tractable CSP over some C of size p but not over any smaller structure. In a second family of examples, for every prime $p\geq 7$ we construct A, B of size $p-1$ with a single ternary relation such that PCSP(A, B) reduces via $A\to C\to B$ to a tractable CSP over some C of size p but not over any smaller structure. In contrast we show that if A, B are graphs and PCSP(A,B) reduces to tractable CSP(C) for some finite digraph C, then already A or B has a tractable CSP. This extends results and answers a question of Deng et al.

相關內容

Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population. We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's Cp. Next, the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level. Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small. We introduce a nested cross-validation scheme to estimate this variance more accurately, and we show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail.

To avoid the curse of dimensionality, a common approach to clustering high-dimensional data is to first project the data into a space of reduced dimension, and then cluster the projected data. Although effective, this two-stage approach prevents joint optimization of the dimensionality-reduction and clustering models, and obscures how well the complete model describes the data. Here, we show how a family of such two-stage models can be combined into a single, hierarchical model that we call a hierarchical mixture of Gaussians (HMoG). An HMoG simultaneously captures both dimensionality-reduction and clustering, and its performance is quantified in closed-form by the likelihood function. By formulating and extending existing models with exponential family theory, we show how to maximize the likelihood of HMoGs with expectation-maximization. We apply HMoGs to synthetic data and RNA sequencing data, and demonstrate how they exceed the limitations of two-stage models. Ultimately, HMoGs are a rigorous generalization of a common statistical framework, and provide researchers with a method to improve model performance when clustering high-dimensional data.

Capacities on a finite set are sets functions vanishing on the empty set and being monotonic w.r.t. inclusion. Since the set of capacities is an order polytope, the problem of randomly generating capacities amounts to generating all linear extensions of the Boolean lattice. This problem is known to be intractable even as soon as $n>5$, therefore approximate methods have been proposed, most notably one based on Markov chains. Although quite accurate, this method is time consuming. In this paper, we propose the 2-layer approximation method, which generates a subset of linear extensions, eliminating those with very low probability. We show that our method has similar performance compared to the Markov chain but is much less time consuming.

Minimax optimization has served as the backbone of many machine learning (ML) problems. Although the convergence behavior of optimization algorithms has been extensively studied in minimax settings, their generalization guarantees in the stochastic setting, i.e., how the solution trained on empirical data performs on the unseen testing data, have been relatively underexplored. A fundamental question remains elusive: What is a good metric to study generalization of minimax learners? In this paper, we aim to answer this question by first showing that primal risk, a universal metric to study generalization in minimization, fails in simple examples of minimax problems. Furthermore, another popular metric, the primal-dual risk, also fails to characterize the generalization behavior for minimax problems with nonconvexity, due to non-existence of saddle points. We thus propose a new metric to study generalization of minimax learners: the primal gap, to circumvent these issues. Next, we derive generalization bounds for the primal gap in nonconvex-concave settings. As byproducts of our analysis, we also solve two open questions: establishing generalization bounds for primal risk and primal-dual risk in the strong sense, i.e., without strong concavity or assuming that the maximization and expectation can be interchanged, while either of these assumptions was needed in the literature. Finally, we leverage this new metric to compare the generalization behavior of two popular algorithms -- gradient descent-ascent (GDA) and gradient descent-max (GDMax) in stochastic minimax optimization.

Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in $d$ dimensions, and $N$ hidden neurons. We assume that both the sample size $n$ and the dimension $d$ are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime $Nd\gg n$. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as $Nd\gg n$, and therefore the network can exactly interpolate arbitrary labels in the same regime. Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-$\ell_2$ norm interpolation. We prove that, as soon as $Nd\gg n$, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a `self-induced' term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on $\log n/\log d$).

We address facts and open questions concerning the degree of ill-posedness of the composite Hausdorff moment problem aimed at the recovery of a function $x \in L^2(0,1)$ from elements of the infinite dimensional sequence space $\ell^2$ that characterize moments applied to the antiderivative of $x$. This degree, unknown by now, results from the decay rate of the singular values of the associated compact forward operator $A$, which is the composition of the compact simple integration operator mapping in $L^2(0,1)$ and the non-compact Hausdorff moment operator $B^{(H)}$ mapping from $L^2(0,1)$ to $\ell^2$. There is a seeming contradiction between (a) numerical computations, which show (even for large $n$) an exponential decay of the singular values for $n$-dimensional matrices obtained by discretizing the operator $A$, and \linebreak (b) a strongly limited smoothness of the well-known kernel $k$ of the Hilbert-Schmidt operator $A^*A$. Fact (a) suggests severe ill-posedness of the infinite dimensional Hausdorff moment problem, whereas fact (b) lets us expect the opposite, because exponential ill-posedness occurs in common just for $C^\infty$-kernels $k$. We recall arguments for the possible occurrence of a polynomial decay of the singular values of $A$, even if the numerics seems to be against it, and discuss some issues in the numerical approximation of non-compact operators.

We consider decentralized optimization problems in which a number of agents collaborate to minimize the average of their local functions by exchanging over an underlying communication graph. Specifically, we place ourselves in an asynchronous model where only a random portion of nodes perform computation at each iteration, while the information exchange can be conducted between all the nodes and in an asymmetric fashion. For this setting, we propose an algorithm that combines gradient tracking and variance reduction over the entire network. This enables each node to track the average of the gradients of the objective functions. Our theoretical analysis shows that the algorithm converges linearly, when the local objective functions are strongly convex, under mild connectivity conditions on the expected mixing matrices. In particular, our result does not require the mixing matrices to be doubly stochastic. In the experiments, we investigate a broadcast mechanism that transmits information from computing nodes to their neighbors, and confirm the linear convergence of our method on both synthetic and real-world datasets.

We establish the following two main results on order types of points in general position in the plane (realizable simple planar order types, realizable uniform acyclic oriented matroids of rank $3$): (a) The number of extreme points in an $n$-point order type, chosen uniformly at random from all such order types, is on average $4+o(1)$. For labeled order types, this number has average $4- \frac{8}{n^2 - n +2}$ and variance at most $3$. (b) The (labeled) order types read off a set of $n$ points sampled independently from the uniform measure on a convex planar domain, smooth or polygonal, or from a Gaussian distribution are concentrated, i.e. such sampling typically encounters only a vanishingly small fraction of all order types of the given size. Result (a) generalizes to arbitrary dimension $d$ for labeled order types with the average number of extreme points $2d+o(1)$ and constant variance. We also discuss to what extent our methods generalize to the abstract setting of uniform acyclic oriented matroids. Moreover, our methods allow to show the following relative of the Erd\H{o}s-Szekeres theorem: for any fixed $k$, as $n \to \infty$, a proportion $1 - O(1/n)$ of the $n$-point simple order types contain a triangle enclosing a convex $k$-chain over an edge. For the unlabeled case in (a), we prove that for any antipodal, finite subset of the $2$-dimensional sphere, the group of orientation preserving bijections is cyclic, dihedral or one of $A_4$, $S_4$ or $A_5$ (and each case is possible). These are the finite subgroups of $SO(3)$ and our proof follows the lines of their characterization by Felix Klein.

We analyze the convergence rates of stochastic gradient algorithms for smooth finite-sum minimax optimization and show that, for many such algorithms, sampling the data points without replacement leads to faster convergence compared to sampling with replacement. For the smooth and strongly convex-strongly concave setting, we consider gradient descent ascent and the proximal point method, and present a unified analysis of two popular without-replacement sampling strategies, namely Random Reshuffling (RR), which shuffles the data every epoch, and Single Shuffling or Shuffle Once (SO), which shuffles only at the beginning. We obtain tight convergence rates for RR and SO and demonstrate that these strategies lead to faster convergence than uniform sampling. Moving beyond convexity, we obtain similar results for smooth nonconvex-nonconcave objectives satisfying a two-sided Polyak-{\L}ojasiewicz inequality. Finally, we demonstrate that our techniques are general enough to analyze the effect of data-ordering attacks, where an adversary manipulates the order in which data points are supplied to the optimizer. Our analysis also recovers tight rates for the incremental gradient method, where the data points are not shuffled at all.

Inferring missing links in knowledge graphs (KG) has attracted a lot of attention from the research community. In this paper, we tackle a practical query answering task involving predicting the relation of a given entity pair. We frame this prediction problem as an inference problem in a probabilistic graphical model and aim at resolving it from a variational inference perspective. In order to model the relation between the query entity pair, we assume that there exists an underlying latent variable (paths connecting two nodes) in the KG, which carries the equivalent semantics of their relations. However, due to the intractability of connections in large KGs, we propose to use variation inference to maximize the evidence lower bound. More specifically, our framework (\textsc{Diva}) is composed of three modules, i.e. a posterior approximator, a prior (path finder), and a likelihood (path reasoner). By using variational inference, we are able to incorporate them closely into a unified architecture and jointly optimize them to perform KG reasoning. With active interactions among these sub-modules, \textsc{Diva} is better at handling noise and coping with more complex reasoning scenarios. In order to evaluate our method, we conduct the experiment of the link prediction task on multiple datasets and achieve state-of-the-art performances on both datasets.

北京阿比特科技有限公司