{mayi_des}

亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks. This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). Notably, prior DFT datasets were created using CPU supercomputers without leveraging hardware acceleration. In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCF$_{\text{IPU}}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: //github.com/graphcore-research/pyscf-ipu

相關內容

In the semi-streaming model for processing massive graphs, an algorithm makes multiple passes over the edges of a given $n$-vertex graph and is tasked with computing the solution to a problem using $O(n \cdot \text{polylog}(n))$ space. Semi-streaming algorithms for Maximal Independent Set (MIS) that run in $O(\log\log{n})$ passes have been known for almost a decade, however, the best lower bounds can only rule out single-pass algorithms. We close this large gap by proving that the current algorithms are optimal: Any semi-streaming algorithm for finding an MIS with constant probability of success requires $\Omega(\log\log{n})$ passes. This settles the complexity of this fundamental problem in the semi-streaming model, and constitutes one of the first optimal multi-pass lower bounds in this model. We establish our result by proving an optimal round vs communication tradeoff for the (multi-party) communication complexity of MIS. The key ingredient of this result is a new technique, called hierarchical embedding, for performing round elimination: we show how to pack many but small hard $(r-1)$-round instances of the problem into a single $r$-round instance, in a way that enforces any $r$-round protocol to effectively solve all these $(r-1)$-round instances also. These embeddings are obtained via a novel application of results from extremal graph theory -- in particular dense graphs with many disjoint unique shortest paths -- together with a newly designed graph product, and are analyzed via information-theoretic tools such as direct-sum and message compression arguments.

In this paper, we revisit the bilevel optimization problem, in which the upper-level objective function is generally nonconvex and the lower-level objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an ${O}(\epsilon^{-1.5})$ sample complexity in Hessian/Jacobian-free stochastic bilevel optimization without any second-order derivative computation. To fill this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named FdeHBO, which features a simple fully single-loop structure, a projection-aided finite-difference Hessian/Jacobian-vector approximation, and momentum-based updates. Theoretically, we show that FdeHBO requires ${O}(\epsilon^{-1.5})$ iterations (each using ${O}(1)$ samples and only first-order gradient information) to find an $\epsilon$-accurate stationary point. As far as we know, this is the first Hessian/Jacobian-free method with an ${O}(\epsilon^{-1.5})$ sample complexity for nonconvex-strongly-convex stochastic bilevel optimization.

Maximal Independent Set (MIS) is one of the fundamental and most well-studied problems in distributed graph algorithms. Even after four decades of intensive research, the best-known (randomized) MIS algorithms have $O(\log{n})$ round complexity on general graphs [Luby, STOC 1986] (where $n$ is the number of nodes), while the best-known lower bound is $\Omega(\sqrt{\log{n}/\log\log{n}})$ [Kuhn, Moscibroda, Wattenhofer, JACM 2016]. Breaking past the $O(\log{n})$ round complexity upper bound or showing stronger lower bounds have been longstanding open problems. Our main contribution is to show that MIS can be computed in awake complexity that is exponentially better compared to the best known round complexity of $O(\log n)$ and also bypassing its fundamental $\Omega(\sqrt{\log{n}/\log\log{n}})$ round complexity lower bound exponentially. Specifically, we show that MIS can be computed by a randomized distributed (Monte Carlo) algorithm in $O(\log\log{n} )$ awake complexity with high probability (i.e., with probability at least $1 - n^{-1}$). This algorithm has a round complexity of $O((\log^7 n) \log \log n)$. We also show that we can improve the round complexity at the cost of a slight increase in awake complexity, by presenting a randomized distributed (Monte Carlo) algorithm for MIS that, with high probability, computes an MIS in $O((\log\log{n})\log^*n)$ awake complexity and $O((\log^3 n) (\log \log n) \log^*n)$ round complexity. Our algorithms work in the CONGEST model where messages of size $O(\log n)$ bits can be sent per edge per round.

Consider a set $V$ of voters, represented by a multiset in a metric space $(X,d)$. The voters have to reach a decision -- a point in $X$. A choice $p\in X$ is called a $\beta$-plurality point for $V$, if for any other choice $q\in X$ it holds that $|\{v\in V\mid \beta\cdot d(p,v)\le d(q,v)\}|\ge\frac{|V|}{2}$. In other words, at least half of the voters ``prefer'' $p$ over $q$, when an extra factor of $\beta$ is taken in favor of $p$. For $\beta=1$, this is equivalent to Condorcet winner, which rarely exists. The concept of $\beta$-plurality was suggested by Aronov, de Berg, Gudmundsson, and Horton [TALG 2021] as a relaxation of the Condorcet criterion. Let $\beta^*_{(X,d)}=\sup\{\beta\mid \mbox{every finite multiset $V$ in $X$ admits a $\beta$-plurality point}\}$. The parameter $\beta^*$ determines the amount of relaxation required in order to reach a stable decision. Aronov et al. showed that for the Euclidean plane $\beta^*_{(\mathbb{R}^2,\|\cdot\|_2)}=\frac{\sqrt{3}}{2}$, and more generally, for $d$-dimensional Euclidean space, $\frac{1}{\sqrt{d}}\le \beta^*_{(\mathbb{R}^d,\|\cdot\|_2)}\le\frac{\sqrt{3}}{2}$. In this paper, we show that $0.557\le \beta^*_{(\mathbb{R}^d,\|\cdot\|_2)}$ for any dimension $d$ (notice that $\frac{1}{\sqrt{d}}<0.557$ for any $d\ge 4$). In addition, we prove that for every metric space $(X,d)$ it holds that $\sqrt{2}-1\le\beta^*_{(X,d)}$, and show that there exists a metric space for which $\beta^*_{(X,d)}\le \frac12$.

As in other estimation scenarios, likelihood based estimation in the normal mixture set-up is highly non-robust against model misspecification and presence of outliers (apart from being an ill-posed optimization problem). A robust alternative to the ordinary likelihood approach for this estimation problem is proposed which performs simultaneous estimation and data clustering and leads to subsequent anomaly detection. To invoke robustness, the methodology based on the minimization of the density power divergence (or alternatively, the maximization of the $\beta$-likelihood) is utilized under suitable constraints. An iteratively reweighted least squares approach has been followed in order to compute the proposed estimators for the component means (or equivalently cluster centers) and component dispersion matrices which leads to simultaneous data clustering. Some exploratory techniques are also suggested for anomaly detection, a problem of great importance in the domain of statistics and machine learning. The proposed method is validated with simulation studies under different set-ups; it performs competitively or better compared to the popular existing methods like K-medoids, TCLUST, trimmed K-means and MCLUST, especially when the mixture components (i.e., the clusters) share regions with significant overlap or outlying clusters exist with small but non-negligible weights (particularly in higher dimensions). Two real datasets are also used to illustrate the performance of the newly proposed method in comparison with others along with an application in image processing. The proposed method detects the clusters with lower misclassification rates and successfully points out the outlying (anomalous) observations from these datasets.

This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.

Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria. In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at //github.com/VGCQ/DSD2.

Inspired by the split decomposition of graphs and rank-width, we introduce the notion of $r$-splits. We focus on the family of $r$-splits of a graph of order $n$, and we prove that it forms a hypergraph with several properties. We prove that such hypergraphs can be represented using only $\mathcal O(n^{r+1})$ of its hyperedges, despite its potentially exponential number of hyperedges. We also prove that there exist hypergraphs that need at least $\Omega(n^r)$ hyperedges to be represented, using a generalization of set orthogonality.

Click-through rate (CTR) prediction plays a critical role in recommender systems and online advertising. The data used in these applications are multi-field categorical data, where each feature belongs to one field. Field information is proved to be important and there are several works considering fields in their models. In this paper, we proposed a novel approach to model the field information effectively and efficiently. The proposed approach is a direct improvement of FwFM, and is named as Field-matrixed Factorization Machines (FmFM, or $FM^2$). We also proposed a new explanation of FM and FwFM within the FmFM framework, and compared it with the FFM. Besides pruning the cross terms, our model supports field-specific variable dimensions of embedding vectors, which acts as soft pruning. We also proposed an efficient way to minimize the dimension while keeping the model performance. The FmFM model can also be optimized further by caching the intermediate vectors, and it only takes thousands of floating-point operations (FLOPs) to make a prediction. Our experiment results show that it can out-perform the FFM, which is more complex. The FmFM model's performance is also comparable to DNN models which require much more FLOPs in runtime.

Graph convolution networks (GCN) are increasingly popular in many applications, yet remain notoriously hard to train over large graph datasets. They need to compute node representations recursively from their neighbors. Current GCN training algorithms suffer from either high computational costs that grow exponentially with the number of layers, or high memory usage for loading the entire graph and node embeddings. In this paper, we propose a novel efficient layer-wise training framework for GCN (L-GCN), that disentangles feature aggregation and feature transformation during training, hence greatly reducing time and memory complexities. We present theoretical analysis for L-GCN under the graph isomorphism framework, that L-GCN leads to as powerful GCNs as the more costly conventional training algorithm does, under mild conditions. We further propose L^2-GCN, which learns a controller for each layer that can automatically adjust the training epochs per layer in L-GCN. Experiments show that L-GCN is faster than state-of-the-arts by at least an order of magnitude, with a consistent of memory usage not dependent on dataset size, while maintaining comparable prediction performance. With the learned controller, L^2-GCN can further cut the training time in half. Our codes are available at //github.com/Shen-Lab/L2-GCN.

北京阿比特科技有限公司
{\text{IPU}}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: //github.com/graphcore-research/pyscf-ipu ">

亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks. This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). Notably, prior DFT datasets were created using CPU supercomputers without leveraging hardware acceleration. In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCF$_{\text{IPU}}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: //github.com/graphcore-research/pyscf-ipu

相關內容

In the semi-streaming model for processing massive graphs, an algorithm makes multiple passes over the edges of a given $n$-vertex graph and is tasked with computing the solution to a problem using $O(n \cdot \text{polylog}(n))$ space. Semi-streaming algorithms for Maximal Independent Set (MIS) that run in $O(\log\log{n})$ passes have been known for almost a decade, however, the best lower bounds can only rule out single-pass algorithms. We close this large gap by proving that the current algorithms are optimal: Any semi-streaming algorithm for finding an MIS with constant probability of success requires $\Omega(\log\log{n})$ passes. This settles the complexity of this fundamental problem in the semi-streaming model, and constitutes one of the first optimal multi-pass lower bounds in this model. We establish our result by proving an optimal round vs communication tradeoff for the (multi-party) communication complexity of MIS. The key ingredient of this result is a new technique, called hierarchical embedding, for performing round elimination: we show how to pack many but small hard $(r-1)$-round instances of the problem into a single $r$-round instance, in a way that enforces any $r$-round protocol to effectively solve all these $(r-1)$-round instances also. These embeddings are obtained via a novel application of results from extremal graph theory -- in particular dense graphs with many disjoint unique shortest paths -- together with a newly designed graph product, and are analyzed via information-theoretic tools such as direct-sum and message compression arguments.

In this paper, we revisit the bilevel optimization problem, in which the upper-level objective function is generally nonconvex and the lower-level objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an ${O}(\epsilon^{-1.5})$ sample complexity in Hessian/Jacobian-free stochastic bilevel optimization without any second-order derivative computation. To fill this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named FdeHBO, which features a simple fully single-loop structure, a projection-aided finite-difference Hessian/Jacobian-vector approximation, and momentum-based updates. Theoretically, we show that FdeHBO requires ${O}(\epsilon^{-1.5})$ iterations (each using ${O}(1)$ samples and only first-order gradient information) to find an $\epsilon$-accurate stationary point. As far as we know, this is the first Hessian/Jacobian-free method with an ${O}(\epsilon^{-1.5})$ sample complexity for nonconvex-strongly-convex stochastic bilevel optimization.

Maximal Independent Set (MIS) is one of the fundamental and most well-studied problems in distributed graph algorithms. Even after four decades of intensive research, the best-known (randomized) MIS algorithms have $O(\log{n})$ round complexity on general graphs [Luby, STOC 1986] (where $n$ is the number of nodes), while the best-known lower bound is $\Omega(\sqrt{\log{n}/\log\log{n}})$ [Kuhn, Moscibroda, Wattenhofer, JACM 2016]. Breaking past the $O(\log{n})$ round complexity upper bound or showing stronger lower bounds have been longstanding open problems. Our main contribution is to show that MIS can be computed in awake complexity that is exponentially better compared to the best known round complexity of $O(\log n)$ and also bypassing its fundamental $\Omega(\sqrt{\log{n}/\log\log{n}})$ round complexity lower bound exponentially. Specifically, we show that MIS can be computed by a randomized distributed (Monte Carlo) algorithm in $O(\log\log{n} )$ awake complexity with high probability (i.e., with probability at least $1 - n^{-1}$). This algorithm has a round complexity of $O((\log^7 n) \log \log n)$. We also show that we can improve the round complexity at the cost of a slight increase in awake complexity, by presenting a randomized distributed (Monte Carlo) algorithm for MIS that, with high probability, computes an MIS in $O((\log\log{n})\log^*n)$ awake complexity and $O((\log^3 n) (\log \log n) \log^*n)$ round complexity. Our algorithms work in the CONGEST model where messages of size $O(\log n)$ bits can be sent per edge per round.

Consider a set $V$ of voters, represented by a multiset in a metric space $(X,d)$. The voters have to reach a decision -- a point in $X$. A choice $p\in X$ is called a $\beta$-plurality point for $V$, if for any other choice $q\in X$ it holds that $|\{v\in V\mid \beta\cdot d(p,v)\le d(q,v)\}|\ge\frac{|V|}{2}$. In other words, at least half of the voters ``prefer'' $p$ over $q$, when an extra factor of $\beta$ is taken in favor of $p$. For $\beta=1$, this is equivalent to Condorcet winner, which rarely exists. The concept of $\beta$-plurality was suggested by Aronov, de Berg, Gudmundsson, and Horton [TALG 2021] as a relaxation of the Condorcet criterion. Let $\beta^*_{(X,d)}=\sup\{\beta\mid \mbox{every finite multiset $V$ in $X$ admits a $\beta$-plurality point}\}$. The parameter $\beta^*$ determines the amount of relaxation required in order to reach a stable decision. Aronov et al. showed that for the Euclidean plane $\beta^*_{(\mathbb{R}^2,\|\cdot\|_2)}=\frac{\sqrt{3}}{2}$, and more generally, for $d$-dimensional Euclidean space, $\frac{1}{\sqrt{d}}\le \beta^*_{(\mathbb{R}^d,\|\cdot\|_2)}\le\frac{\sqrt{3}}{2}$. In this paper, we show that $0.557\le \beta^*_{(\mathbb{R}^d,\|\cdot\|_2)}$ for any dimension $d$ (notice that $\frac{1}{\sqrt{d}}<0.557$ for any $d\ge 4$). In addition, we prove that for every metric space $(X,d)$ it holds that $\sqrt{2}-1\le\beta^*_{(X,d)}$, and show that there exists a metric space for which $\beta^*_{(X,d)}\le \frac12$.

As in other estimation scenarios, likelihood based estimation in the normal mixture set-up is highly non-robust against model misspecification and presence of outliers (apart from being an ill-posed optimization problem). A robust alternative to the ordinary likelihood approach for this estimation problem is proposed which performs simultaneous estimation and data clustering and leads to subsequent anomaly detection. To invoke robustness, the methodology based on the minimization of the density power divergence (or alternatively, the maximization of the $\beta$-likelihood) is utilized under suitable constraints. An iteratively reweighted least squares approach has been followed in order to compute the proposed estimators for the component means (or equivalently cluster centers) and component dispersion matrices which leads to simultaneous data clustering. Some exploratory techniques are also suggested for anomaly detection, a problem of great importance in the domain of statistics and machine learning. The proposed method is validated with simulation studies under different set-ups; it performs competitively or better compared to the popular existing methods like K-medoids, TCLUST, trimmed K-means and MCLUST, especially when the mixture components (i.e., the clusters) share regions with significant overlap or outlying clusters exist with small but non-negligible weights (particularly in higher dimensions). Two real datasets are also used to illustrate the performance of the newly proposed method in comparison with others along with an application in image processing. The proposed method detects the clusters with lower misclassification rates and successfully points out the outlying (anomalous) observations from these datasets.

This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.

Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria. In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at //github.com/VGCQ/DSD2.

Inspired by the split decomposition of graphs and rank-width, we introduce the notion of $r$-splits. We focus on the family of $r$-splits of a graph of order $n$, and we prove that it forms a hypergraph with several properties. We prove that such hypergraphs can be represented using only $\mathcal O(n^{r+1})$ of its hyperedges, despite its potentially exponential number of hyperedges. We also prove that there exist hypergraphs that need at least $\Omega(n^r)$ hyperedges to be represented, using a generalization of set orthogonality.

Click-through rate (CTR) prediction plays a critical role in recommender systems and online advertising. The data used in these applications are multi-field categorical data, where each feature belongs to one field. Field information is proved to be important and there are several works considering fields in their models. In this paper, we proposed a novel approach to model the field information effectively and efficiently. The proposed approach is a direct improvement of FwFM, and is named as Field-matrixed Factorization Machines (FmFM, or $FM^2$). We also proposed a new explanation of FM and FwFM within the FmFM framework, and compared it with the FFM. Besides pruning the cross terms, our model supports field-specific variable dimensions of embedding vectors, which acts as soft pruning. We also proposed an efficient way to minimize the dimension while keeping the model performance. The FmFM model can also be optimized further by caching the intermediate vectors, and it only takes thousands of floating-point operations (FLOPs) to make a prediction. Our experiment results show that it can out-perform the FFM, which is more complex. The FmFM model's performance is also comparable to DNN models which require much more FLOPs in runtime.

Graph convolution networks (GCN) are increasingly popular in many applications, yet remain notoriously hard to train over large graph datasets. They need to compute node representations recursively from their neighbors. Current GCN training algorithms suffer from either high computational costs that grow exponentially with the number of layers, or high memory usage for loading the entire graph and node embeddings. In this paper, we propose a novel efficient layer-wise training framework for GCN (L-GCN), that disentangles feature aggregation and feature transformation during training, hence greatly reducing time and memory complexities. We present theoretical analysis for L-GCN under the graph isomorphism framework, that L-GCN leads to as powerful GCNs as the more costly conventional training algorithm does, under mild conditions. We further propose L^2-GCN, which learns a controller for each layer that can automatically adjust the training epochs per layer in L-GCN. Experiments show that L-GCN is faster than state-of-the-arts by at least an order of magnitude, with a consistent of memory usage not dependent on dataset size, while maintaining comparable prediction performance. With the learned controller, L^2-GCN can further cut the training time in half. Our codes are available at //github.com/Shen-Lab/L2-GCN.

北京阿比特科技有限公司
{\text{IPU}}$ - 北京阿比特科技有限公司 {mayi_des}

亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks. This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). Notably, prior DFT datasets were created using CPU supercomputers without leveraging hardware acceleration. In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCF$_{\text{IPU}}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: //github.com/graphcore-research/pyscf-ipu

相關內容

In the semi-streaming model for processing massive graphs, an algorithm makes multiple passes over the edges of a given $n$-vertex graph and is tasked with computing the solution to a problem using $O(n \cdot \text{polylog}(n))$ space. Semi-streaming algorithms for Maximal Independent Set (MIS) that run in $O(\log\log{n})$ passes have been known for almost a decade, however, the best lower bounds can only rule out single-pass algorithms. We close this large gap by proving that the current algorithms are optimal: Any semi-streaming algorithm for finding an MIS with constant probability of success requires $\Omega(\log\log{n})$ passes. This settles the complexity of this fundamental problem in the semi-streaming model, and constitutes one of the first optimal multi-pass lower bounds in this model. We establish our result by proving an optimal round vs communication tradeoff for the (multi-party) communication complexity of MIS. The key ingredient of this result is a new technique, called hierarchical embedding, for performing round elimination: we show how to pack many but small hard $(r-1)$-round instances of the problem into a single $r$-round instance, in a way that enforces any $r$-round protocol to effectively solve all these $(r-1)$-round instances also. These embeddings are obtained via a novel application of results from extremal graph theory -- in particular dense graphs with many disjoint unique shortest paths -- together with a newly designed graph product, and are analyzed via information-theoretic tools such as direct-sum and message compression arguments.

In this paper, we revisit the bilevel optimization problem, in which the upper-level objective function is generally nonconvex and the lower-level objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an ${O}(\epsilon^{-1.5})$ sample complexity in Hessian/Jacobian-free stochastic bilevel optimization without any second-order derivative computation. To fill this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named FdeHBO, which features a simple fully single-loop structure, a projection-aided finite-difference Hessian/Jacobian-vector approximation, and momentum-based updates. Theoretically, we show that FdeHBO requires ${O}(\epsilon^{-1.5})$ iterations (each using ${O}(1)$ samples and only first-order gradient information) to find an $\epsilon$-accurate stationary point. As far as we know, this is the first Hessian/Jacobian-free method with an ${O}(\epsilon^{-1.5})$ sample complexity for nonconvex-strongly-convex stochastic bilevel optimization.

Maximal Independent Set (MIS) is one of the fundamental and most well-studied problems in distributed graph algorithms. Even after four decades of intensive research, the best-known (randomized) MIS algorithms have $O(\log{n})$ round complexity on general graphs [Luby, STOC 1986] (where $n$ is the number of nodes), while the best-known lower bound is $\Omega(\sqrt{\log{n}/\log\log{n}})$ [Kuhn, Moscibroda, Wattenhofer, JACM 2016]. Breaking past the $O(\log{n})$ round complexity upper bound or showing stronger lower bounds have been longstanding open problems. Our main contribution is to show that MIS can be computed in awake complexity that is exponentially better compared to the best known round complexity of $O(\log n)$ and also bypassing its fundamental $\Omega(\sqrt{\log{n}/\log\log{n}})$ round complexity lower bound exponentially. Specifically, we show that MIS can be computed by a randomized distributed (Monte Carlo) algorithm in $O(\log\log{n} )$ awake complexity with high probability (i.e., with probability at least $1 - n^{-1}$). This algorithm has a round complexity of $O((\log^7 n) \log \log n)$. We also show that we can improve the round complexity at the cost of a slight increase in awake complexity, by presenting a randomized distributed (Monte Carlo) algorithm for MIS that, with high probability, computes an MIS in $O((\log\log{n})\log^*n)$ awake complexity and $O((\log^3 n) (\log \log n) \log^*n)$ round complexity. Our algorithms work in the CONGEST model where messages of size $O(\log n)$ bits can be sent per edge per round.

Consider a set $V$ of voters, represented by a multiset in a metric space $(X,d)$. The voters have to reach a decision -- a point in $X$. A choice $p\in X$ is called a $\beta$-plurality point for $V$, if for any other choice $q\in X$ it holds that $|\{v\in V\mid \beta\cdot d(p,v)\le d(q,v)\}|\ge\frac{|V|}{2}$. In other words, at least half of the voters ``prefer'' $p$ over $q$, when an extra factor of $\beta$ is taken in favor of $p$. For $\beta=1$, this is equivalent to Condorcet winner, which rarely exists. The concept of $\beta$-plurality was suggested by Aronov, de Berg, Gudmundsson, and Horton [TALG 2021] as a relaxation of the Condorcet criterion. Let $\beta^*_{(X,d)}=\sup\{\beta\mid \mbox{every finite multiset $V$ in $X$ admits a $\beta$-plurality point}\}$. The parameter $\beta^*$ determines the amount of relaxation required in order to reach a stable decision. Aronov et al. showed that for the Euclidean plane $\beta^*_{(\mathbb{R}^2,\|\cdot\|_2)}=\frac{\sqrt{3}}{2}$, and more generally, for $d$-dimensional Euclidean space, $\frac{1}{\sqrt{d}}\le \beta^*_{(\mathbb{R}^d,\|\cdot\|_2)}\le\frac{\sqrt{3}}{2}$. In this paper, we show that $0.557\le \beta^*_{(\mathbb{R}^d,\|\cdot\|_2)}$ for any dimension $d$ (notice that $\frac{1}{\sqrt{d}}<0.557$ for any $d\ge 4$). In addition, we prove that for every metric space $(X,d)$ it holds that $\sqrt{2}-1\le\beta^*_{(X,d)}$, and show that there exists a metric space for which $\beta^*_{(X,d)}\le \frac12$.

As in other estimation scenarios, likelihood based estimation in the normal mixture set-up is highly non-robust against model misspecification and presence of outliers (apart from being an ill-posed optimization problem). A robust alternative to the ordinary likelihood approach for this estimation problem is proposed which performs simultaneous estimation and data clustering and leads to subsequent anomaly detection. To invoke robustness, the methodology based on the minimization of the density power divergence (or alternatively, the maximization of the $\beta$-likelihood) is utilized under suitable constraints. An iteratively reweighted least squares approach has been followed in order to compute the proposed estimators for the component means (or equivalently cluster centers) and component dispersion matrices which leads to simultaneous data clustering. Some exploratory techniques are also suggested for anomaly detection, a problem of great importance in the domain of statistics and machine learning. The proposed method is validated with simulation studies under different set-ups; it performs competitively or better compared to the popular existing methods like K-medoids, TCLUST, trimmed K-means and MCLUST, especially when the mixture components (i.e., the clusters) share regions with significant overlap or outlying clusters exist with small but non-negligible weights (particularly in higher dimensions). Two real datasets are also used to illustrate the performance of the newly proposed method in comparison with others along with an application in image processing. The proposed method detects the clusters with lower misclassification rates and successfully points out the outlying (anomalous) observations from these datasets.

This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.

Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria. In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at //github.com/VGCQ/DSD2.

Inspired by the split decomposition of graphs and rank-width, we introduce the notion of $r$-splits. We focus on the family of $r$-splits of a graph of order $n$, and we prove that it forms a hypergraph with several properties. We prove that such hypergraphs can be represented using only $\mathcal O(n^{r+1})$ of its hyperedges, despite its potentially exponential number of hyperedges. We also prove that there exist hypergraphs that need at least $\Omega(n^r)$ hyperedges to be represented, using a generalization of set orthogonality.

Click-through rate (CTR) prediction plays a critical role in recommender systems and online advertising. The data used in these applications are multi-field categorical data, where each feature belongs to one field. Field information is proved to be important and there are several works considering fields in their models. In this paper, we proposed a novel approach to model the field information effectively and efficiently. The proposed approach is a direct improvement of FwFM, and is named as Field-matrixed Factorization Machines (FmFM, or $FM^2$). We also proposed a new explanation of FM and FwFM within the FmFM framework, and compared it with the FFM. Besides pruning the cross terms, our model supports field-specific variable dimensions of embedding vectors, which acts as soft pruning. We also proposed an efficient way to minimize the dimension while keeping the model performance. The FmFM model can also be optimized further by caching the intermediate vectors, and it only takes thousands of floating-point operations (FLOPs) to make a prediction. Our experiment results show that it can out-perform the FFM, which is more complex. The FmFM model's performance is also comparable to DNN models which require much more FLOPs in runtime.

Graph convolution networks (GCN) are increasingly popular in many applications, yet remain notoriously hard to train over large graph datasets. They need to compute node representations recursively from their neighbors. Current GCN training algorithms suffer from either high computational costs that grow exponentially with the number of layers, or high memory usage for loading the entire graph and node embeddings. In this paper, we propose a novel efficient layer-wise training framework for GCN (L-GCN), that disentangles feature aggregation and feature transformation during training, hence greatly reducing time and memory complexities. We present theoretical analysis for L-GCN under the graph isomorphism framework, that L-GCN leads to as powerful GCNs as the more costly conventional training algorithm does, under mild conditions. We further propose L^2-GCN, which learns a controller for each layer that can automatically adjust the training epochs per layer in L-GCN. Experiments show that L-GCN is faster than state-of-the-arts by at least an order of magnitude, with a consistent of memory usage not dependent on dataset size, while maintaining comparable prediction performance. With the learned controller, L^2-GCN can further cut the training time in half. Our codes are available at //github.com/Shen-Lab/L2-GCN.

北京阿比特科技有限公司
{\text{IPU}}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: //github.com/graphcore-research/pyscf-ipu ">

亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks. This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). Notably, prior DFT datasets were created using CPU supercomputers without leveraging hardware acceleration. In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCF$_{\text{IPU}}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: //github.com/graphcore-research/pyscf-ipu

相關內容

In the semi-streaming model for processing massive graphs, an algorithm makes multiple passes over the edges of a given $n$-vertex graph and is tasked with computing the solution to a problem using $O(n \cdot \text{polylog}(n))$ space. Semi-streaming algorithms for Maximal Independent Set (MIS) that run in $O(\log\log{n})$ passes have been known for almost a decade, however, the best lower bounds can only rule out single-pass algorithms. We close this large gap by proving that the current algorithms are optimal: Any semi-streaming algorithm for finding an MIS with constant probability of success requires $\Omega(\log\log{n})$ passes. This settles the complexity of this fundamental problem in the semi-streaming model, and constitutes one of the first optimal multi-pass lower bounds in this model. We establish our result by proving an optimal round vs communication tradeoff for the (multi-party) communication complexity of MIS. The key ingredient of this result is a new technique, called hierarchical embedding, for performing round elimination: we show how to pack many but small hard $(r-1)$-round instances of the problem into a single $r$-round instance, in a way that enforces any $r$-round protocol to effectively solve all these $(r-1)$-round instances also. These embeddings are obtained via a novel application of results from extremal graph theory -- in particular dense graphs with many disjoint unique shortest paths -- together with a newly designed graph product, and are analyzed via information-theoretic tools such as direct-sum and message compression arguments.

In this paper, we revisit the bilevel optimization problem, in which the upper-level objective function is generally nonconvex and the lower-level objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an ${O}(\epsilon^{-1.5})$ sample complexity in Hessian/Jacobian-free stochastic bilevel optimization without any second-order derivative computation. To fill this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named FdeHBO, which features a simple fully single-loop structure, a projection-aided finite-difference Hessian/Jacobian-vector approximation, and momentum-based updates. Theoretically, we show that FdeHBO requires ${O}(\epsilon^{-1.5})$ iterations (each using ${O}(1)$ samples and only first-order gradient information) to find an $\epsilon$-accurate stationary point. As far as we know, this is the first Hessian/Jacobian-free method with an ${O}(\epsilon^{-1.5})$ sample complexity for nonconvex-strongly-convex stochastic bilevel optimization.

Maximal Independent Set (MIS) is one of the fundamental and most well-studied problems in distributed graph algorithms. Even after four decades of intensive research, the best-known (randomized) MIS algorithms have $O(\log{n})$ round complexity on general graphs [Luby, STOC 1986] (where $n$ is the number of nodes), while the best-known lower bound is $\Omega(\sqrt{\log{n}/\log\log{n}})$ [Kuhn, Moscibroda, Wattenhofer, JACM 2016]. Breaking past the $O(\log{n})$ round complexity upper bound or showing stronger lower bounds have been longstanding open problems. Our main contribution is to show that MIS can be computed in awake complexity that is exponentially better compared to the best known round complexity of $O(\log n)$ and also bypassing its fundamental $\Omega(\sqrt{\log{n}/\log\log{n}})$ round complexity lower bound exponentially. Specifically, we show that MIS can be computed by a randomized distributed (Monte Carlo) algorithm in $O(\log\log{n} )$ awake complexity with high probability (i.e., with probability at least $1 - n^{-1}$). This algorithm has a round complexity of $O((\log^7 n) \log \log n)$. We also show that we can improve the round complexity at the cost of a slight increase in awake complexity, by presenting a randomized distributed (Monte Carlo) algorithm for MIS that, with high probability, computes an MIS in $O((\log\log{n})\log^*n)$ awake complexity and $O((\log^3 n) (\log \log n) \log^*n)$ round complexity. Our algorithms work in the CONGEST model where messages of size $O(\log n)$ bits can be sent per edge per round.

Consider a set $V$ of voters, represented by a multiset in a metric space $(X,d)$. The voters have to reach a decision -- a point in $X$. A choice $p\in X$ is called a $\beta$-plurality point for $V$, if for any other choice $q\in X$ it holds that $|\{v\in V\mid \beta\cdot d(p,v)\le d(q,v)\}|\ge\frac{|V|}{2}$. In other words, at least half of the voters ``prefer'' $p$ over $q$, when an extra factor of $\beta$ is taken in favor of $p$. For $\beta=1$, this is equivalent to Condorcet winner, which rarely exists. The concept of $\beta$-plurality was suggested by Aronov, de Berg, Gudmundsson, and Horton [TALG 2021] as a relaxation of the Condorcet criterion. Let $\beta^*_{(X,d)}=\sup\{\beta\mid \mbox{every finite multiset $V$ in $X$ admits a $\beta$-plurality point}\}$. The parameter $\beta^*$ determines the amount of relaxation required in order to reach a stable decision. Aronov et al. showed that for the Euclidean plane $\beta^*_{(\mathbb{R}^2,\|\cdot\|_2)}=\frac{\sqrt{3}}{2}$, and more generally, for $d$-dimensional Euclidean space, $\frac{1}{\sqrt{d}}\le \beta^*_{(\mathbb{R}^d,\|\cdot\|_2)}\le\frac{\sqrt{3}}{2}$. In this paper, we show that $0.557\le \beta^*_{(\mathbb{R}^d,\|\cdot\|_2)}$ for any dimension $d$ (notice that $\frac{1}{\sqrt{d}}<0.557$ for any $d\ge 4$). In addition, we prove that for every metric space $(X,d)$ it holds that $\sqrt{2}-1\le\beta^*_{(X,d)}$, and show that there exists a metric space for which $\beta^*_{(X,d)}\le \frac12$.

As in other estimation scenarios, likelihood based estimation in the normal mixture set-up is highly non-robust against model misspecification and presence of outliers (apart from being an ill-posed optimization problem). A robust alternative to the ordinary likelihood approach for this estimation problem is proposed which performs simultaneous estimation and data clustering and leads to subsequent anomaly detection. To invoke robustness, the methodology based on the minimization of the density power divergence (or alternatively, the maximization of the $\beta$-likelihood) is utilized under suitable constraints. An iteratively reweighted least squares approach has been followed in order to compute the proposed estimators for the component means (or equivalently cluster centers) and component dispersion matrices which leads to simultaneous data clustering. Some exploratory techniques are also suggested for anomaly detection, a problem of great importance in the domain of statistics and machine learning. The proposed method is validated with simulation studies under different set-ups; it performs competitively or better compared to the popular existing methods like K-medoids, TCLUST, trimmed K-means and MCLUST, especially when the mixture components (i.e., the clusters) share regions with significant overlap or outlying clusters exist with small but non-negligible weights (particularly in higher dimensions). Two real datasets are also used to illustrate the performance of the newly proposed method in comparison with others along with an application in image processing. The proposed method detects the clusters with lower misclassification rates and successfully points out the outlying (anomalous) observations from these datasets.

This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.

Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria. In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at //github.com/VGCQ/DSD2.

Inspired by the split decomposition of graphs and rank-width, we introduce the notion of $r$-splits. We focus on the family of $r$-splits of a graph of order $n$, and we prove that it forms a hypergraph with several properties. We prove that such hypergraphs can be represented using only $\mathcal O(n^{r+1})$ of its hyperedges, despite its potentially exponential number of hyperedges. We also prove that there exist hypergraphs that need at least $\Omega(n^r)$ hyperedges to be represented, using a generalization of set orthogonality.

Click-through rate (CTR) prediction plays a critical role in recommender systems and online advertising. The data used in these applications are multi-field categorical data, where each feature belongs to one field. Field information is proved to be important and there are several works considering fields in their models. In this paper, we proposed a novel approach to model the field information effectively and efficiently. The proposed approach is a direct improvement of FwFM, and is named as Field-matrixed Factorization Machines (FmFM, or $FM^2$). We also proposed a new explanation of FM and FwFM within the FmFM framework, and compared it with the FFM. Besides pruning the cross terms, our model supports field-specific variable dimensions of embedding vectors, which acts as soft pruning. We also proposed an efficient way to minimize the dimension while keeping the model performance. The FmFM model can also be optimized further by caching the intermediate vectors, and it only takes thousands of floating-point operations (FLOPs) to make a prediction. Our experiment results show that it can out-perform the FFM, which is more complex. The FmFM model's performance is also comparable to DNN models which require much more FLOPs in runtime.

Graph convolution networks (GCN) are increasingly popular in many applications, yet remain notoriously hard to train over large graph datasets. They need to compute node representations recursively from their neighbors. Current GCN training algorithms suffer from either high computational costs that grow exponentially with the number of layers, or high memory usage for loading the entire graph and node embeddings. In this paper, we propose a novel efficient layer-wise training framework for GCN (L-GCN), that disentangles feature aggregation and feature transformation during training, hence greatly reducing time and memory complexities. We present theoretical analysis for L-GCN under the graph isomorphism framework, that L-GCN leads to as powerful GCNs as the more costly conventional training algorithm does, under mild conditions. We further propose L^2-GCN, which learns a controller for each layer that can automatically adjust the training epochs per layer in L-GCN. Experiments show that L-GCN is faster than state-of-the-arts by at least an order of magnitude, with a consistent of memory usage not dependent on dataset size, while maintaining comparable prediction performance. With the learned controller, L^2-GCN can further cut the training time in half. Our codes are available at //github.com/Shen-Lab/L2-GCN.

北京阿比特科技有限公司