coding=utf-8

[付費5元查看完整內容]系列教程GNN-algorithms之七：《圖同構網絡—GIN》

2020 年 8 月 9 日

**【導讀】**自GCN異軍突起后，圖神經網絡這個領域也逐漸壯大。但是疑惑也隨之而來，為什么GNN會這么有效？論文How Powerful Are Graph Neural Networks?給出了答案。本文將簡要介紹圖同構網絡GIN的來龍去脈，并手把手教你搭建基于Tensorflow框架的GIN模型。

前言

How Powerful are Graph Neural Networks?相信有很多人會和我一樣在邊掉頭發邊跑模型的時候會產生這樣的疑惑。本教程將手把手教大家搭建基于Tensorflow的GIN模型，同時把該論文（ICLR 2019 best student paper)中關于GNNs為什么有效果的觀點進行簡要介紹。

GIN簡介

GIN的起源如GCN和GraphSAGE，都是通過迭代聚合一階鄰居信息來更新節點的特征表示，可以拆分為三個步驟： Aggregate:聚合一階鄰居節點的特征。 Combine:將鄰域特征與中心節點的特征融合，更新中心節點的特征。 Readout:如果是圖分類任務，需要把Graph中所有節點特征轉換為Graph的特征表示。

上述方法都是基于經驗主義，缺乏從理論的角度來分析GNN。GIN則是借助Weisfeiler-Lehman(WL) test 來分析GNN到底做了什么而變得如何powerful，在何種條件下GNN可以在圖分類任務上和WL test一樣強大。 * WL test WL test是判斷兩個Graph結構是否相同的有效方法，主要通過迭代以下步驟來判斷Graph的同構性：（初始化：將節點的id作為自身的標簽。）

聚合：將鄰居節點和自身的標簽進行聚合。
更新節點標簽：使用Hash表將節點聚合標簽映射作為節點的的新標簽。

WL test迭代過程如下圖：

    (此圖引用自知乎 //zhuanlan.zhihu.com/p/62006729，如有侵權，請聯系刪除)
    上圖a中的G圖中節點1的鄰居有節點4；節點2的鄰居有節點3和節點5；節點3的鄰居有節點2，節點4，節點5；節點4的鄰居有節點1，節點3，節點5；節點5的鄰居有節點2，節點3，節點4。（步驟1）聚合鄰居節點和自身標簽后的結果就是b圖中的G。然后用Hash將聚合后的結果映射為一個新的標簽，進行標簽壓縮，如圖c。用壓縮后的標簽來替代之前的聚合結果，進行標簽更新（步驟二），如圖d，G‘同理。
     對于Graph的特征表示，WL test方法用迭代前后圖中節點標簽的個數作為Graph的表示特征，如圖e所示。從上圖我們可以看出WL_test的迭代過程和GNN的聚合過程非常相似，并且作者也證明了WL_test是圖神經網絡聚合鄰域信息能力的上限。

  GIN節點更新過程
  作者提出如果GNN中的Aggregate,Combine和Readout函數是單射(即原像與像的映射關系為一對一)，則GNN可以達到上限，和WL_test一樣。
  作者證明了當節點特征X可數時，將節點特征的聚合方式(Aggregate)設置為sum，鄰域特征與中心節點特征的融合系數設置為1+![](//cdn.zhuanzhi.ai/vfiles/45c0aa6dda7aacd8d595f4fc4ea93e2e)

，會存在一個函數使得聚合函數(Combine)為單射函數，即：

為單射函數。同時作者進一步證明對于任意的聚合函數

在滿足單射性的條件下可以分解為：

然后用借助多層感知機（MLP）強大的擬合能力來學習上面公式中的和f，最后得到基于MLP+SUM的GIN模型：

對于每輪迭代產生的節點特征求和，然后拼接作為Graph的特征表示：

完整代碼下載地址： //github.com/wangyouze/tf_geometric/blob/sage/demo/demo_gin.py * 論文下載地址： * 文獻參考地址：

教程目錄

開發環境

GIN的實現 * 模型構建 * GIN訓練 * GIN評估

開發環境

操作系統: Windows / Linux / Mac OS

        pip install -U tf_geometric # 這會使用你自帶的TensorFlow，注意你需要tensorflow/tensorflow-gpu >= 1.14.0 or >= 2.0.0b1

 pip install -U tf_geometric[tf1-cpu] # 這會自動安裝TensorFlow 1.x CPU版

 pip install -U tf_geometric[tf1-gpu] # 這會自動安裝TensorFlow 1.x GPU版

 pip install -U tf_geometric[tf2-cpu] # 這會自動安裝TensorFlow 2.x CPU版

 pip install -U tf_geometric[tf2-gpu] # 這會自動安裝TensorFlow 2.x GPU版

教程使用的核心庫是tf_geometric，一個基于TensorFlow的GNN庫。tf_geometric的詳細教程可以在其Github主頁上查詢： *

GIN的實現

GIN聚合節點信息公式為：

GIN的實現很簡單。首先我們聚合中心節點的一階鄰域信息

tf_geometric提供了便捷的鄰域信息聚合機制API。 * * * * * *

        h = aggregate_neighbors(` x, edge_index, edge_weight,`` identity_mapper,`` sum_reducer,`` identity_updater`` )`

然后計算

更新中心節點的特征表示。可以設置為學習參數也可以設置為固定值。 *

         h = x * (1 + eps) + h

MLP擬合特征變換函數和f * * * * *

        h = mlp(h)` if activation is not None:`` h = activation(h)``
`` return h`

模型構建

導入相關庫本教程使用的核心庫是tf_geometric，我們用它來進行圖數據導入、圖數據預處理及圖神經網絡構建。GIN的具體實現已經在上面詳細介紹，另外我們后面會使用keras.metrics.Accuracy評估模型性能。

        # coding=utf-8`import os``import tensorflow as tf``import numpy as np``from tensorflow import keras``from sklearn.model_selection import train_test_split``os.environ["CUDA_VISIBLE_DEVICES"] = "0"`

我們選用論文中的生物數據集NCI1訓練和評估模型性能。第一次加載NCI1數據集，預計需要花費幾分鐘時間。數據集第一次被預處理之后，tf_geometric會自動保存預處理的結果，以便下一次調用。對于一個TU dataset會包含節點標簽，節點屬性等，每個graph的處理結果會被以字典形式保存，多個圖的預處理結果以list的形式返回。``` graph_dicts = tfg.datasets.TUDataset("NCI1").load_data()


* 
自己用數據構建Graph Object，即圖模型輸入的三要素：節點特征，邊連接信息以及標簽。GIN的目標是當模型不依賴于輸入的節點特征時，學習網絡的拓撲結構。因此對于生物數據集NCI1，我們把節點的類別標簽用one_hot表示后作為輸入特征（convert_node_labels_to_one_hot將節點標簽轉換為節點特征，十分簡單，可在源碼中查看該函數的實現）。```
def construct_graph(graph_dict):
 return tfg.Graph(
 x=convert_node_labels_to_one_hot(graph_dict["node_labels"]),
 edge_index=graph_dict["edge_index"],
 y=graph_dict["graph_label"] # graph_dict["graph_label"] is a list with one int element
 )

graphs = [construct_graph(graph_dict) for graph_dict in graph_dicts]

定義模型。根據論文描述，我們的模型有五層GIN作為隱藏層，MLP設置為2層來學習特征變換和f，每個隱藏層后用Batch_normalization對數據進行歸一化(抑制梯度消失和梯度爆炸)。

        class GINPoolNetwork(keras.Model):` def __init__(self, num_gins, units, num_classes, *args, **kwargs):`` super().__init__(*args, **kwargs)``
`` self.gins = [`` tfg.layers.GIN(`` keras.Sequential([`` keras.layers.Dense(units, activation=tf.nn.relu),`` keras.layers.Dense(units),`` keras.layers.BatchNormalization()`` ])`` )`` for _ in range(num_gins) # num_gins blocks`` ]``
`` self.mlp = keras.Sequential([`` keras.layers.Dense(128, activation=tf.nn.relu),`` keras.layers.Dropout(0.3),`` keras.layers.Dense(num_classes)`` ])``
`` def call(self, inputs, training=False, mask=None):``
`` if len(inputs) == 4:`` x, edge_index, edge_weight, node_graph_index = inputs`` else:`` x, edge_index, node_graph_index = inputs`` edge_weight = None``
`` hidden_outputs = []`` h = x``
`` for gin in self.gins:`` h = gin([h, edge_index, edge_weight], training=training)`` hidden_outputs.append(h)`

對每一隱藏層的輸出進行sum pooling，將5層的pooling結果拼接后進行非線性變換輸出。公式如下：

         h = tf.concat(hidden_outputs, axis=-1)` h = tfg.nn.sum_pool(h, node_graph_index)`` logits = self.mlp(h, training=training)`` return logit`

GIN訓練

數據集劃分

        train_graphs, test_graphs = train_test_split(graphs, test_size=0.1)

計算標簽種類 *

        num_classes = np.max([graph.y[0] for graph in graphs]) + 1

初始化模型 *

        model = GIN(32)

模型的訓練與其他基于Tensorflow框架的模型訓練基本一致，主要步驟有定義優化器，計算誤差與梯度，反向傳播等。我們將訓練集中的graphs以batch的形式輸入模型進行訓練，對于graphs劃分為batch可以調用我們tf_geometric中的函數create_graph_generator。

        optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)`train_batch_generator = create_graph_generator(train_graphs, batch_size, shuffle=True, infinite=True)``
``best_test_acc = 0``for step in range(0, 1000):`` batch_graph = next(train_batch_generator)`` with tf.GradientTape() as tape:`` inputs = [batch_graph.x, batch_graph.edge_index, batch_graph.edge_weight,`` batch_graph.node_graph_index]`` logits = model(inputs, training=True)`` losses = tf.nn.softmax_cross_entropy_with_logits(`` logits=logits,`` labels=tf.one_hot(batch_graph.y, depth=num_classes)`` )``
`` loss = tf.reduce_mean(losses)`` vars = tape.watched_variables()`` grads = tape.gradient(loss, vars)`` optimizer.apply_gradients(zip(grads, vars))``
`` if step % 10 == 0:`` train_acc = evaluate(train_graphs, batch_size)`` test_acc = evaluate(test_graphs, batch_size)``
`` if best_test_acc < test_acc:`` best_test_acc = test_acc``
`` print("step = {}\tloss = {}\ttrain_acc = {}\ttest_acc={}".format(step, loss, train_acc, best_test_acc))`

GIN評估

在評估模型性能的時候我們將測試集中的graph以batch的形式輸入到我們的模型之中，用keras自帶的keras.metrics.Accuracy計算準確率。

        def evaluate():
 accuracy_m = keras.metrics.Accuracy()

 for test_batch_graph in create_graph_generator(test_graphs, batch_size, shuffle=False, infinite=False):
 logits = forward(test_batch_graph)
 preds = tf.argmax(logits, axis=-1)
 accuracy_m.update_state(test_batch_graph.y, preds)

 return accuracy_m.result().numpy()

運行結果``` step = 0 loss = 12.347851753234863 train_acc = 0.49905380606651306 test_acc=0.5036496520042419 step = 10 loss = 0.8783968091011047 train_acc = 0.5509597063064575 test_acc=0.525547444820404 step = 20 loss = 0.6645355820655823 train_acc = 0.54044 test_acc=0.525547444820404 step = 30 loss = 0.65831 train_acc = 0.5904298424720764 test_acc=0.5790753960609436 ... step = 820 loss = 0.363844 train_acc = 0.8553662896156311 test_acc=0.89297 step = 830 loss = 0.33948060870170593 train_acc = 0.8645579814910889 test_acc=0.82486 step = 840 loss = 0.3843861520290375 train_acc = 0.8599621653556824 test_acc=0.82486 step = 850 loss = 0.3698282241821289 train_acc = 0.850229799747467 test_acc=0.82486




**完整代碼鏈接**


***


demo_gin.py：
 //github.com/wangyouze/tf_geometric/blob/sage/demo/demo_gin.py


本教程（屬于系列教程**《GNN-algorithms》**）Github鏈接：
* 
//github.com/wangyouze/GNN-algorithms

付費5元查看完整內容

[付費5元查看完整內容]系列教程GNN-algorithms之六：《多核卷積拓撲圖—TAGCN》

2020 年 8 月 8 日

【導讀】 基于譜域的圖卷積網絡用多項式近似卷積核的方式來避免計算的高復雜度，但這會導致模型的性能損失。TAGCN是基于空域方法的圖卷積模型，通過多個固定尺寸的卷積核來學習圖的拓撲結構特征。TAGCN本質上和CNN中的卷積是一致的。本教程將結合TAGCN原理，教你手把手構建基于Tensorflow的TAGCN模型，在Cora數據集上進行節點分類任務。

系列教程《GNN-algorithms》

系列教程《GNN-algorithms》Github鏈接： *

TensorFlow GNN框架tf_geometric的Github鏈接： *

前序講解：

前言

在教程第二節介紹了GCN的變體SGC，這一屆我們繼續介紹GCN的另外一個變體TAGCN模型。本教程將教你如何使用Tensorflow構建GCN的變體TAGCN模型進行節點分類任務。完整代碼可在Github中下載:

TAGCN模型簡介

TAGCN是GCN的變體之一，全稱TOPOLOGY ADAPTIVE GRAPH CONVOLUTIONAL NETWORKS（TAGCN)。相比于GCN對卷積核進行Chebyshev多項式近似后取k=1，TAGCN用k個圖卷積核來提取不同尺寸的局部特征，并且將k保留下來作為超參數。其中的K個卷積核的感受野分別為1到K，類似于GoogleNet中每一個卷積層都有大小不同的卷積核提取特征。 TAGCN的卷積過程如下：

添加自環，對鄰接矩陣進行歸一化處理：

是K個圖卷積核，是每個圖卷積核的系數，相比GCN，TAGCN保留了超參數K:

k個卷積核在圖結構數據上提取特征，進行線性組合：
仿照CNN結構，添加非線性操作：

下圖展示了TAGCN在k=3時的卷積過程，類似于CNN中的每一個卷積層由多個卷積核提取feature map后形成多個channel：

將3個卷積核提取的特征進行線性組合：

總結：

TAGCN仿照CNN在每一層使用K個圖卷積核分別提取不同尺寸的局部特征，避免了之前對卷積核進行近似而不能完整，充分的提取圖信息的缺陷，提高了模型的表達能力。 1. TAGCN可以用于無向圖和有向圖，由于只需計算鄰接矩陣的系數，降低了計算復雜度。

教程完整代碼鏈接： * 論文地址：

教程目錄

開發環境 * TAGCN的實現 * 模型構建 * TAGCN訓練 * TAGCN評估

開發環境

操作系統: Windows / Linux / Mac OS

        pip install -U tf_geometric # 這會使用你自帶的TensorFlow，注意你需要tensorflow/tensorflow-gpu >= 1.14.0 or >= 2.0.0b1

 pip install -U tf_geometric[tf1-cpu] # 這會自動安裝TensorFlow 1.x CPU版

 pip install -U tf_geometric[tf1-gpu] # 這會自動安裝TensorFlow 1.x GPU版

 pip install -U tf_geometric[tf2-cpu] # 這會自動安裝TensorFlow 2.x CPU版

 pip install -U tf_geometric[tf2-gpu] # 這會自動安裝TensorFlow 2.x GPU版

教程使用的核心庫是tf_geometric，一個基于TensorFlow的GNN庫。tf_geometric的詳細教程可以在其Github主頁上查詢： *

TAGCN的實現

首先我們對圖的鄰接矩陣添加自環，進行歸一化處理：

其中xs用來存儲k個多項式卷積核提取的feature map。``` xs = [x] updated_edge_index, normed_edge_weight = gcn_norm_edge(edge_index, x.shape[0], edge_weight, renorm, improved, cache)


分別計算每個圖卷積核提取圖中節點的鄰域信息，即計算k階多項式，并以此將結果存儲到xs中：

for k in range(K): h = aggregate_neighbors( xs[-1], updated_edge_index, normed_edge_weight, gcn_mapper, sum_reducer, identity_updater )

xs.append(h)



將K個圖卷積核提取的feature_map拼接，然后線性變換輸出結果：```
h = tf.concat(xs, axis=-1)

 out = h @ kernel
 if bias is not None:
 out += bias

 if activation is not None:
 out = activation(out)

 return out

模型構建

導入相關庫本教程使用的核心庫是tf_geometric，我們用它來進行圖數據導入、圖數據預處理及圖神經網絡構建。SGC的具體實現已經在上面詳細介紹，另外我們后面會使用keras.metrics.Accuracy評估模型性能。```

coding=utf-8

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0" import tensorflow as tf import numpy as np from tensorflow import keras from tf_geometric.layers.conv.tagcn import TAGCN from tf_geometric.datasets.cora import CoraDataset


* 
使用**tf_geometric**自帶的圖結構數據接口加載Cora數據集：```
graph, (train_index, valid_index, test_index) = CoraDataset().load_data()

定義模型，引入keras.layers中的Dropout層隨機關閉神經元緩解過擬合。由于Dropout層在訓練和預測階段的狀態不同，為此，我們通過參數training來決定是否需要Dropout發揮作用。``` tagcn0 = TAGCN(16) tagcn1 = TAGCN(num_classes) dropout = keras.layers.Dropout(0.3)

def forward(graph, training=False): h = tagcn0([graph.x, graph.edge_index, graph.edge_weight], cache=graph.cache) h = dropout(h, training=training) h = tagcn1([h, graph.edge_index, graph.edge_weight], cache=graph.cache)

return h





**TAGCN訓練**

***


模型的訓練與其他基于Tensorflow框架的模型訓練基本一致，主要步驟有定義優化器，計算誤差與梯度，反向傳播等。TAGCN論文用模型在第100輪訓練后的表現來評估模型，因此這里我們設置epoches=100。```
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

best_test_acc = tmp_valid_acc = 0
for step in range(1, 101):
 with tf.GradientTape() as tape:
 logits = forward(graph, training=True)
 loss = compute_loss(logits, train_index, tape.watched_variables())

 vars = tape.watched_variables()
 grads = tape.gradient(loss, vars)
 optimizer.apply_gradients(zip(grads, vars))

 valid_acc = evaluate(valid_index)
 test_acc = evaluate(test_index)
 if test_acc > best_test_acc:
 best_test_acc = test_acc
 tmp_valid_acc = valid_acc
 print("step = {}\tloss = {}\tvalid_acc = {}\tbest_test_acc = {}".format(step, loss, tmp_valid_acc, best_test_acc))

用交叉熵損失函數計算模型損失。注意在加載Cora數據集的時候，返回值是整個圖數據以及相應的train_mask,valid_mask,test_mask。TAGCN在訓練的時候的輸入時整個Graph，在計算損失的時候通過train_mask來計算模型在訓練集上的迭代損失。因此，此時傳入的mask_index是train_index。由于是多分類任務，需要將節點的標簽轉換為one-hot向量以便于模型輸出的結果維度對應。由于圖神經模型在小數據集上很容易就會瘋狂擬合數據，所以這里用L2正則化緩解過擬合。``` def compute_loss(logits, mask_index, vars): masked_logits = tf.gather(logits, mask_index) masked_labels = tf.gather(graph.y, mask_index) losses = tf.nn.softmax_cross_entropy_with_logits( logits=masked_logits, labels=tf.one_hot(masked_labels, depth=num_classes) )

kernel_vals = [var for var in vars if "kernel" in var.name] l2_losses = [tf.nn.l2_loss(kernel_var) for kernel_var in kernel_vals]

return tf.reduce_mean(losses) + tf.add_n(l2_losses) * 5e-4




**TAGCN評估**

***


在評估模型性能的時候我們只需傳入valid_mask或者test_mask，通過tf.gather函數就可以拿出驗證集或測試集在模型上的預測結果與真實標簽，用keras自帶的keras.metrics.Accuracy計算準確率。

def evaluate(mask): logits = forward(graph) logits = tf.nn.log_softmax(logits, axis=-1) masked_logits = tf.gather(logits, mask) masked_labels = tf.gather(graph.y, mask)

y_pred = tf.argmax(masked_logits, axis=-1, output_type=tf.int32)

accuracy_m = keras.metrics.Accuracy() accuracy_m.update_state(masked_labels, y_pred) return accuracy_m.result().numpy()




**運行結果**

***


****

    step = 1 loss = 1.9557496309280396 valid_acc = 0.3240000009536743 best_test_acc = 0.3700000047683716

step = 2 loss = 1.69263 valid_acc = 0.45869 best_test_acc = 0.54885 step = 3 loss = 1.3922057151794434 valid_acc = 0.5220000147819519 best_test_acc = 0.5849999785423279 step = 4 loss = 1.8711 valid_acc = 0.6539999842643738 best_test_acc = 0.73694 ... step = 96 loss = 0.03752553462982178 valid_acc = 0.7960000038146973 best_test_acc = 0.8209999799728394 step = 97 loss = 0.03963441401720047 valid_acc = 0.7960000038146973 best_test_acc = 0.8209999799728394 step = 98 loss = 0.048121 valid_acc = 0.7960000038146973 best_test_acc = 0.8209999799728394 step = 99 loss = 0.03467567265033722 valid_acc = 0.7960000038146973 best_test_acc = 0.8209999799728394 step = 100 loss = 0.035629 valid_acc = 0.7960000038146973 best_test_acc = 0.8209999799728394


**完整代碼**


教程中的完整代碼鏈接：
* 
demo_tagcn.py：//github.com/CrawlScript/tf_geometric/blob/master/demo/demo_tagcn.py

**

**

本教程（屬于系列教程**《GNN-algorithms》**）Github鏈接：
* 
//github.com/wangyouze/GNN-algorithms

付費5元查看完整內容

[付費5元查看完整內容]系列教程GNN-algorithms之四：《Inductive Learning 大神—GraphSAGE》

2020 年 8 月 6 日

【導讀】 在GCN如日中天的時候，我們不忽略它的巨大缺陷，即無法快速表征新的節點，這限制了它的應用場景。GraphSAGE屬于歸納式學習方法，通過學習聚合函數來學習圖的拓撲結構，進而方便的得到新節點的特征，所得節點特征會根據鄰居關系變化而變化。本文將以MaxPooling聚合方法為例手把手教大家構建GraphSAGE模型。

系列教程《GNN-algorithms》

系列教程《GNN-algorithms》Github鏈接：

TensorFlow GNN框架tf_geometric的Github鏈接：

前序講解：

前言

本教程將帶你一起在PPI（蛋白質網絡）數據集上用Tensorflow搭建GraphSAGE框架中的MaxPooling聚合模型實現有監督下的圖節點標簽預測任務。完整代碼可以在Github中進行下載：

GraphSAGE簡介

GraphSAGE是一種在超大規模圖上利用節點的屬性信息高效產生未知節點特征表示的歸納式學習框架。GraphSAGE可以被用來生成節點的低維向量表示，尤其對于是那些具有豐富節點屬性的Graph效果顯著。目前大多數的框架都是直推式學習模型，也就是說只能夠在一張固定的Graph上進行表示學習，這些模型既不能夠對那些在訓練中未見的節點進行有效的向量表示，也不能夠跨圖進行節點表示學習。GraphSAGE作為一種歸納式的表示學習框架能夠利用節點豐富的屬性信息有效地生成未知節點的特征表示。

GraphSAGE的核心思想是通過學習一個對鄰居節點進行聚合表示的函數來產生中心節點的特征表示而不是學習節點本身的embedding。它既可以進行有監督學習也可以進行無監督學習，GraphSAGE中的聚合函數有以下幾種：

Mean Aggregator Mean 聚合近乎等價于GCN中的卷積傳播操作，具體來說就是對中心節點的鄰居節點的特征向量進行求均值操作，然后和中心節點特征向量進行拼接，中間要經過兩次非線性變換。

GCN Aggregator GCN的歸納式學習版本

Pooling Aggregator

先對中心節點的鄰居節點表示向量進行一次非線性變換，然后對變換后的鄰居表示向量進行池化操作（mean pooling或者max pooling），最后將pooling所得結果與中心節點的特征表示分別進行非線性變換，并將所得結果進行拼接或者相加從而得到中心節點在該層的向量表示。 * LSTM Aggregator LSTM Aggregator 將中心節點的鄰居節點隨機打亂作為輸入序列，將所得向量表示與中心節點的向量表示分別經過非線性變換后拼接得到中心節點在該層的向量表示。LSTM本身是用于序列數據，而鄰居節點沒有明顯的序列關系，因此輸入到LSTM中的鄰居節點需要隨機打亂順序。

下面我們將以MaxPooling聚合方法為例構建GraphSAGE模型進行有監督學習下的分類任務。 * 教程中完整的代碼鏈接： * 論文地址：

教程目錄

數據集PPI

開發環境

max_pooling_graph_sage的具體實現 GraphSAGE的訓練

GraphSAGE的評估

PPI數據集

PPI(Protein-protein interaction networks)數據集由24個對應人體不同組織的圖組成。其中20個圖用于訓練，2個圖用于驗證，2個圖用于測試。平均每張圖有2372個節點，每個節點有50個特征。測試集中的圖與訓練集中的圖沒有交叉，即在訓練階段測試集中的圖是不可見的。每個節點擁有多種標簽，標簽的種類總共有121種。

開發環境

操作系統: Windows / Linux / Mac OS

Python 版本: >= 3.5 * 依賴包:

tf_geometric（一個基于Tensorflow的GNN庫）

根據你的環境（是否已安裝TensorFlow、是否需要GPU）從下面選擇一條安裝命令即可一鍵安裝所有Python依賴:``` pip install -U tf_geometric # 這會使用你自帶的TensorFlow，注意你需要tensorflow/tensorflow-gpu >= 1.14.0 or >= 2.0.0b1

pip install -U tf_geometric[tf1-cpu] # 這會自動安裝TensorFlow 1.x CPU版

pip install -U tf_geometric[tf1-gpu] # 這會自動安裝TensorFlow 1.x GPU版

pip install -U tf_geometric[tf2-cpu] # 這會自動安裝TensorFlow 2.x CPU版

pip install -U tf_geometric[tf2-gpu] # 這會自動安裝TensorFlow 2.x GPU版


教程使用的核心庫是tf_geometric，一個基于TensorFlow的GNN庫。tf_geometric的詳細教程可以在其Github主頁上查詢：

//github.com/CrawlScript/tf_geometric



**max_pooling_graph_sage的具體實現**




MaxPooling 聚合函數是一個帶有max-pooling操作的單層神經網絡。我們首先傳遞每個中心節點的鄰居節點向量到一個非線性層中。由于我們的tf_geometric是基于邊表結構進行相關Graph操作，所以我們先通過tf.gather轉換得到所有節點的鄰居節點的特征向量組成的特征矩陣。

     row, col = edge_index` repeated_x = tf.gather(x, row)`` neighbor_x = tf.gather(x, col)`


row是Graph中的源節點序列，low是Graph中的目標節點序列，x是Graph中的節點特征矩陣。tf.gather是根據節點序列從節點特征矩陣中選取對應的節點特征堆疊形成所有鄰居節點組成的特征矩陣。tf.gather的具體操作如下：
![](//cdn.zhuanzhi.ai/vfiles/004896f9f92645f6ac3434a1a843986c)

tf.gather：//www.tensorflow.org/api_docs/python/tf/gather
得到加權后的鄰居節點特征向量

    neighbor_x = gcn_mapper(repeated_x, neighbor_x, edge_weight=edge_weight)


在進行max-pooling操作之前將所有鄰居節點的特征向量輸入全連接網絡計算鄰居節點的特征表示。（將MLP看做是一組函數）

     neighbor_x = dropout(neighbor_x)` h = neighbor_x @ mlp_kernel`` if mlp_bias is not None:`` h += mlp_bias``

if activation is not None: h = activation(h)`


對鄰居節點特征向量進行max-pooling操作，然后將所得向量與經過變換的中心節點特征向量拼接輸出。一個理想的聚合方法就應該是簡單，可學習且對稱的。換句話說，一個理想的aggregator應該學會如何聚合鄰居節點的特征表示，并對鄰居節點的順序不敏感，同時不會造成巨大的訓練開銷。

     reduced_h = max_reducer(h, row, num_nodes=len(x))` reduced_h = dropout(reduced_h)`` x = dropout(x)``

from_neighs = reduced_h @ neighs_kernel from_x = x @ self_kerneloutput = tf.concat([from_neighs, from_x], axis=1) if bias is not None:output += bias if activation is not None: output = activation(output) if normalize:output = tf.nn.l2_normalize(output, axis=-1) `` return output`




**構建模型**




導入相關庫
本教程使用的核心庫是**tf_geometric**，借助這個GNN庫我們可以方便的對數據集進行導入，預處理圖數據以及搭建圖神經網絡。另外我們還引用了tf.keras.layers中的Dropout用來緩解過擬合以及sklearn中的micro f1_score函數作為評價指標。

    # coding=utf-8`import os``import tensorflow as tf``from tensorflow import keras``import numpy as np``from tf_geometric.layers.conv.graph_sage import MaxPoolingGraphSage``from tf_geometric.datasets.ppi import PPIDataset``from sklearn.metrics import f1_score``from tqdm import tqdm``from tf_geometric.utils.graph_utils import RandomNeighborSampler`



加載數據集，我們使用**tf_geometric**自帶的PPI數據集。tf_geometric提供了非常簡單的圖數據構建接口，只需要傳入簡單的Python數組或Numpy數組作為節點特征和鄰接表就可以構建自己的數據集。示例請看GIN：

    train_graphs, valid_graphs, test_graphs = PPIDataset().load_data()


我們使用tf_geometric自帶的PPI數據集，返回劃分好的訓練集(20)，驗證集(2)，測試集(2)。

對Graph中的每個節點的鄰居節點進行采樣
由于每個節點的鄰居節點的數目不一，出于計算效率的考慮，我們對每個節點采樣一定數量的鄰居節點作為之后聚合領域信息時的鄰居節點。設定采樣數量為num_sample，如果鄰居節點的數量大于num_sample，那我們采用無放回采樣。如果鄰居節點的數量小于num_sample，我們采用有放回采樣，直到所采樣的鄰居節點數量達到num_sample。RandomNeighborSampler提前對每張圖進行預處理，將相關的圖信息與各自的圖綁定。

    # traverse all graphs`for graph in train_graphs + valid_graphs + test_graphs:`` neighbor_sampler = RandomNeighborSampler(graph.edge_index)`` graph.cache["sampler"] = neighbor_sampler`


需要注意的是，由于模型可能會同時作用在多個圖上，為了保證每張圖的鄰居節點在抽樣結束之后不發生混淆，我們將抽樣結果與每個Graph對象綁定在一起，即將抽樣信息保存在“cache"這個緩存字典之中。

采用兩層MaxPooling聚合函數來聚合Graph中鄰居節點蘊含的信息。

    graph_sages = [` MaxPoolingGraphSage(units=128, activation=tf.nn.relu),`` MaxPoolingGraphSage(units=128, activation=tf.nn.relu)``]``

fc = tf.keras.Sequential([ keras.layers.Dropout(0.3),tf.keras.layers.Dense(num_classes)])num_sampled_neighbors_list = [25, 10]def forward(graph, training=False):neighbor_sampler = graph.cache["sampler"] h = graph.xfor i, (graph_sage, num_sampled_neighbors) in enumerate(zip(graph_sages, num_sampled_neighbors_list)): sampled_edge_index, sampled_edge_weight = neighbor_sampler.sample(k=num_sampled_neighbors)h = graph_sage([h, sampled_edge_index, sampled_edge_weight], training=training) h = fc(h, training=training) `` return h`


兩層MaxPooling聚合函數的鄰居節點采樣數目分別為25和10。之前我們已經通過RandomNeighborSampler為每張圖處理好了相關的圖結構信息，現在只需要根據每層的抽樣數目num_sampled_neighbors分別進行抽樣（neighbor_sample.sample()）。將抽樣所得的邊sampled_edge_indext，邊的權重sampled_edge_weights以及節點的特征向量x輸入到GrapSAGE模型中。由于Dropout層在訓練和預測階段的狀態不同，為此，我們通過參數training來決定是否需要Dropout發揮作用。

**GraphSAGE訓練**




模型的訓練與其他基于Tensorflow框架的模型訓練基本一致，主要步驟有定義優化器，計算誤差與梯度，反向傳播等。需要注意的是，訓練階段forward函數的參數training=True，即此時模型執行Dropout操作。當預測階段，輸入為valid_graphs或者test_graphs時，forward的參數training=False，此時不執行Dropout操作。GraphSAGE論文用模型在第10輪訓練后的表現來評估模型，因此這里我們將epoches設置為10。

    optimizer = tf.keras.optimizers.Adam(learning_rate=1e-2)`

for epoch in tqdm(range(10)): for graph in train_graphs:with tf.GradientTape() as tape: logits = forward(graph, training=True)loss = compute_loss(logits, tape.watched_variables()) vars = tape.watched_variables() grads = tape.gradient(loss, vars)optimizer.apply_gradients(zip(grads, vars)) if epoch % 1 == 0: valid_f1_mic = evaluate(valid_graphs)test_f1_mic = evaluate(test_graphs) print("epoch = {}\tloss = {}\tvalid_f1_micro = {}".format(epoch, loss, valid_f1_mic))`` print("epoch = {}\ttest_f1_micro = {}".format(epoch, test_f1_mic))`


* 
計算模型Loss
由于PPI數據集中的每個節點具有多個標簽，屬于多標簽，多分類任務，因此我們選用sigmoid交叉熵函數。這里的logits是模型對節點標簽的預測結果，graph.y是圖節點的真實標簽。為了防止模型出現過擬合現象，我們對模型的參數使用L2正則化。

    def compute_loss(logits, vars):` losses = tf.nn.sigmoid_cross_entropy_with_logits(`` logits=logits,`` labels=tf.convert_to_tensor(graph.y, dtype=tf.float32)`` )``

kernel_vals = [var for var in vars if "kernel" in var.name] l2_losses = [tf.nn.l2_loss(kernel_var) for kernel_var in kernel_vals] return tf.reduce_mean(losses) + tf.add_n(l2_losses) * 1e-5`





**GraphSAGE評估**




我們使用F1 Score來評估MaxPoolingGraphSAGE聚合鄰居節點信息進行分類任務的性能。將測試集中的圖(訓練階段unseen)輸入到經過訓練的MaxPoolingGraphSAGE得到預測結果，最后預測結果與其對應的labels轉換為一維數組，輸入到sklearn中的f1_score方法，得到F1_Score。

    def evaluate(graphs):` y_preds = []`` y_true = []``

for graph in graphs: y_true.append(graph.y)logits = forward(graph) y_preds.append(logits.numpy()) y_pred = np.concatenate(y_preds, axis=0)y = np.concatenate(y_true, axis=0) mic = calc_f1(y, y_pred) `` return mic`




**運行結果**

    epoch = 1 loss = 0.52325 valid_f1_micro = 0.45228990047917433`epoch = 1 test_f1_micro = 0.455067`` 27%|██▋ | 3/11 [01:11<03:12, 24.11s/it]epoch = 2 loss = 0.50827 valid_f1_micro = 0.4825462475136504``epoch = 2 test_f1_micro = 0.4882603340749235``epoch = 3 loss = 0.49998781085014343 valid_f1_micro = 0.4906942451215627``epoch = 3 test_f1_micro = 0.502555249743498`` 45%|████▌ | 5/11 [01:55<02:16, 22.79s/it]epoch = 4 loss = 0.49064 valid_f1_micro = 0.53833``epoch = 4 test_f1_micro = 0.5478608072643453``epoch = 5 loss = 0.484283983707428 valid_f1_micro = 0.5455753374297568``epoch = 5 test_f1_micro = 0.55046`` 64%|██████▎ | 7/11 [02:41<01:31, 22.95s/it]epoch = 6 loss = 0.47615 valid_f1_micro = 0.54828``epoch = 6 test_f1_micro = 0.5504290907273931`` 73%|███████▎ | 8/11 [03:03<01:08, 22.71s/it]epoch = 7 loss = 0.46836230158805847 valid_f1_micro = 0.5720065995217665``epoch = 7 test_f1_micro = 0.58437`` 82%|████████▏ | 9/11 [03:24<00:44, 22.34s/it]epoch = 8 loss = 0.4760943651199341 valid_f1_micro = 0.5752257074185534``epoch = 8 test_f1_micro = 0.5855495700393325`` 91%|█████████ | 10/11 [03:47<00:22, 22.34s/it]epoch = 9 loss = 0.4628 valid_f1_micro = 0.58496``epoch = 9 test_f1_micro = 0.5930584548044271``100%|██████████| 11/11 [04:08<00:00, 22.61s/it]``epoch = 10 loss = 0.4568028450012207 valid_f1_micro = 0.5833869662874881``epoch = 10 test_f1_micro = 0.5964539684054789`




**完整代碼**




教程中的完整代碼鏈接：
 
demo_graph_sage.py://github.com/CrawlScript/tf_geometric/blob/master/demo/demo_graph_sage.py



本教程（屬于系列教程**《GNN-algorithms》**）Github鏈接：

//github.com/wangyouze/GNN-algorithms

付費5元查看完整內容

[付費5元查看完整內容]系列教程GNN-algorithms之三：《將圖卷積簡化進行到底—SGC》

2020 年 8 月 5 日

[付費5元查看完整內容]系列教程GNN-algorithms之二：《切比雪夫顯神威—ChebyNet》

【導讀】 SGC對GCN進行了簡化，通過反復消除GCN層之間的非線性變換并將得到的函數折疊成一個線性變換來減少GCN的額外復雜度。實驗結果表明，這些簡化操作并不會對許多下游應用的準確性產生負面影響。本文將結合公式推導手把手教大家構建基于Tensorflow的SGC模型。

系列教程《GNN-algorithms》

系列教程《GNN-algorithms》Github鏈接： *

TensorFlow GNN框架tf_geometric的Github鏈接： *

前言

GCN作為一種經典的圖神經網絡模型，已經成為了諸多新手入門圖神經網絡的必學模型，而近些年對于GCN的各種魔改也層出不窮。本著愛屋及烏的目的，本教程將教你如何用Tensorflow構建GCN的變體SGC模型進行節點分類任務。完整的代碼可在Github中下載：

SGC模型簡介

SGC是GCN的變體之一，全稱Simplifying Graph Convolutional Networks，論文發表在ICML2019上。相比于GCN，SGC通過消除GCN層之間的非線性，將非線性的GCN轉變為一個簡單的線性模型，減小了模型復雜度，在很多任務上比GCN以及其他GNN模型更加高效。

下面我們對GCN與SGC在節點分類任務上的異同點進行對比：

GCN做節點分類任務時：

對鄰接矩陣進行歸一化并且添加自環：

對輸入的節點特征進行平滑處理：
對接點特征進行非線性轉換：

所以對于節點分類任務，一個K層的GCN可以表示為：

SGC移除了GCN每層之間的激活函數，將原先的非線性變換簡化為線性變換，因此SGC在做節點分類任務時：

對鄰接矩陣進行歸一化并且添加自環：
對輸入的節點特征進行平滑處理：
對節點特征進行線性轉換：

所以對于節點分類任務，一個K層的SGC可以表示為：

簡寫為:

SGC中的可以提前計算，大大減少了計算量。通過以上的對比，我們可以清晰的認識到二者之間的異同點，下面我們將基于tf_geometric實現SGC模型（SGC已經集成到GNN庫tf_geometric中）。教程中完整的代碼鏈接： * demo_sgc.py： * 論文地址：

教程目錄

開發環境 * SGC 的實現 * 模型構建 * SGC訓練 * SGC評估

開發環境

操作系統: Windows / Linux / Mac OS

Python 版本: >= 3.5 * 依賴包:

tf_geometric（一個基于Tensorflow的GNN庫）

pip install -U tf_geometric[tf1-cpu] # 這會自動安裝TensorFlow 1.x CPU版

pip install -U tf_geometric[tf1-gpu] # 這會自動安裝TensorFlow 1.x GPU版

pip install -U tf_geometric[tf2-cpu] # 這會自動安裝TensorFlow 2.x CPU版

pip install -U tf_geometric[tf2-gpu] # 這會自動安裝TensorFlow 2.x GPU版


教程使用的核心庫是tf_geometric，一個基于TensorFlow的GNN庫。tf_geometric的詳細教程可以在其Github主頁上查詢：
* 
//github.com/CrawlScript/tf_geometric



**SGC的實現**

***


對圖的鄰接矩陣添加自環，進行對稱歸一化處理：
![](//cdn.zhuanzhi.ai/vfiles/16ac36198a83fead03e861140736a39e)

updated_edge_index, normed_edge_weight = gcn_norm_edge(edge_index, x.shape[0], edge_weight,renorm, improved, cache)


計算，擴大模型的感受野。aggregator_neighbor聚合一階鄰居節點信息，迭代聚合K次，相當于聚合了距離中心節點k-hop的鄰域信息。```
 h = x
 for _ in range(K):
 h = aggregate_neighbors(
 h,
 updated_edge_index,
 normed_edge_weight,
 gcn_mapper,
 sum_reducer,
 identity_updater
 )

對上面的聚合結果進行線性變換，即計算：

然后返回計算結果：

h = h @ kernel

 if bias is not None:
 h += bias
 return h

以上我們實現了SGC中的

的部分，現在我們只需要在模型的最后一層的輸出上添加softmax激活函數（為了獲得概率輸出）就可以進行節點分類了。模型構建

導入相關庫本教程使用的核心庫是tf_geometric，我們用它來進行圖數據導入、圖數據預處理及圖神經網絡構建。SGC的具體實現已經在上面詳細介紹，另外我們后面會使用keras.metrics.Accuracy評估模型性能。 * * * * * * *

        # coding=utf-8`import os``os.environ["CUDA_VISIBLE_DEVICES"] = "0"``import tensorflow as tf``from tensorflow import keras``from tf_geometric.layers.conv.sgc import SGC``from tf_geometric.datasets.cora import CoraDataset`

使用tf_geometric自帶的圖結構數據接口加載Cora數據集：``` graph, (train_index, valid_index, test_index) = CoraDataset().load_data()


* 
定義模型，這里我們只聚合2階鄰域內的信息。```
model = SGC(num_classes, k=2)

SGC訓練

模型的訓練與其他基于Tensorflow框架的模型訓練基本一致，主要步驟有定義優化器，計算誤差與梯度，反向傳播等。SGC模型K階的計算結果由softmax映射到（0,1）直接進行多分類任務。在每一個step結束的時候，我們分別計算模型在驗證集和測試集上的準確率。``` optimizer = tf.keras.optimizers.Adam(learning_rate=0.2) for step in range(1,101): with tf.GradientTape() as tape: logits = model([graph.x, graph.edge_index, graph.edge_weight], cache=graph.cache) logits = tf.nn.log_softmax(logits,axis=1) loss = compute_loss(logits, train_index, tape.watched_variables())

vars = tape.watched_variables() grads = tape.gradient(loss, vars) optimizer.apply_gradients(zip(grads, vars))

valid_acc = evaluate(valid_index) test_acc = evaluate(test_index)

print("step = {}\tloss = {}\tvalid_acc = {}\ttest_acc = {}".format(step, loss, valid_acc, test_acc))


* 
用交叉熵損失函數計算模型損失。注意在加載Cora數據集的時候，返回值是整個圖數據以及相應的train_mask,valid_mask,test_mask。SGC在訓練的時候的輸入是整個Graph，在計算損失的時候通過train_mask來計算模型在訓練集上的迭代損失。因此，此時傳入的mask_index是train_index。由于是多分類任務，需要將節點的標簽轉換為one-hot向量以便于模型輸出的結果維度對應。由于圖神經模型在小數據集上很容易就會瘋狂擬合數據，所以這里用L2正則化緩解過擬合。```
def compute_loss(logits, mask_index, vars):
 masked_logits = tf.gather(logits, mask_index)
 masked_labels = tf.gather(graph.y, mask_index)

 losses = tf.nn.softmax_cross_entropy_with_logits(
 logits=masked_logits,
 labels=tf.one_hot(masked_labels, depth=num_classes)
 )

 kernel_vals = [var for var in vars if "kernel" in var.name]
 l2_losses = [tf.nn.l2_loss(kernel_var) for kernel_var in kernel_vals]

 return tf.reduce_mean(losses) + tf.add_n(l2_losses) * 5e-5

SGC評估

在評估模型性能的時候我們只需傳入valid_mask或者test_mask，通過tf.gather函數就可以拿出驗證集或測試集在模型上的預測結果與真實標簽，用keras自帶的keras.metrics.Accuracy計算準確率。``` def evaluate(mask): logits = forward(graph) logits = tf.nn.log_softmax(logits, axis=1) masked_logits = tf.gather(logits, mask) masked_labels = tf.gather(graph.y, mask)

y_pred = tf.argmax(masked_logits, axis=-1, output_type=tf.int32) accuracy_m = keras.metrics.Accuracy() accuracy_m.update_state(masked_labels, y_pred) return accuracy_m.result().numpy()




**運行結果**

***


sgc在100輪訓練后在測試集上的準確率最高為0.81

    step = 1 loss = 1.9458770751953125 valid_acc = 0.56951 test_acc = 0.5389999747276306

step = 2 loss = 1.8324840068817139 valid_acc = 0.722000002861023 test_acc = 0.7350000143051147 step = 3 loss = 1.7052000761032104 valid_acc = 0.4740000069141388 test_acc = 0.4729999899864197 step = 4 loss = 1.60918 valid_acc = 0.5580000281333923 test_acc = 0.5360000133514404 ... step = 97 loss = 0.96839 valid_acc = 0.79656 test_acc = 0.80208 step = 98 loss = 0.9678354263305664 valid_acc = 0.79656 test_acc = 0.81858 step = 99 loss = 0.9675441384315491 valid_acc = 0.79656 test_acc = 0.81858 step = 100 loss = 0.967261791229248 valid_acc = 0.79656 test_acc = 0.81858




**完整代碼**

***


教程中的完整代碼鏈接：
* 
demo_sgc.py：//github.com/CrawlScript/tf_geometric/blob/master/demo/demo_sgc.py



本教程（屬于系列教程**《GNN-algorithms》**）Github鏈接：
* 
//github.com/wangyouze/GNN-algorithms

付費5元查看完整內容

圖神經網絡 · 圖卷積網絡 ·

2020 年 8 月 4 日

[付費5元查看完整內容]系列教程GNN-algorithms之一：《圖卷積網絡（GCN）的前世今生》

【導讀】利用Chebyshev多項式擬合圖卷積核應該是GCN中比較普遍的應用方法。Chebyshev多項式核主要解決了兩個問題：1.經過公式推導變換不再需要特征向量的分解。2.通過Chebyshev的迭代定義降低了計算復雜度。本文將結合公式推導詳細介紹基于tensorflow的ChebyNet實現。

系列教程《GNN-algorithms》

系列教程《GNN-algorithms》Github鏈接：
TensorFlow GNN框架tf_geometric的Github鏈接：

付費5元查看完整內容

圖神經網絡 · 圖卷積網絡 ·

2020 年 8 月 2 日