图的向量化表示方法


特征工程是机器学习领域的基础,神经网络的输入永远是向量表示的,当输入为graph时如何对rawdata进行处理并向量化,保留网络结构以及节点、边的特征?本文列举了时下热门的分析方法。

参考资料:

介绍

Documentation | paper

PyTorch Geometric (PyG) is a geometric deep learning extension library for PyTorch.

It consists of various methods for deep learning on graphs and other irregular structures, also known as geometric deep learning, from a variety of published papers. In addition, it consists of an easy-to-use mini-batch loader, multi gpu-support, a large number of common benchmark datasets (based on simple interfaces to create your own), and helpful transforms, both for learning on arbitrary graphs as well as on 3D meshes or point clouds.

安装

  • 亲测有效系统版本ubuntu16.4.6lts
  • 下载anaconda : wget https://www.anaconda.com/distribution/
  • 安装anaconda : bash /path/to/.sh
  • source ~/.bashrc
  • 替换conda源 :
    1. conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
    2. conda config --set show_channel_urls yes
  • 安装pytorch(需要根据处理器下载对应版本,这里时cpu版本):
    1. pip install https://download.pytorch.org/whl/cpu/torch-1.0.1.post2-cp37-cp37m-linux_x86_64.whl
    2. pip install torchvision
  • 安装pytorch_geometric
    • $ pip install --upgrade torch-scatter
    • $ pip install --upgrade torch-sparse
    • $ pip install --upgrade torch-cluster
    • $ pip install --upgrade torch-spline-conv (optional)
    • $ pip install torch-geometric

亲测使用方法(持续更新)

Data数据结构

torch_geometric.data.Data:

  • data.x: Node feature matrix with shape [num_nodes, num_node_features]
  • data.edge_index: Graph connectivity in COO format with shape [2, num_edges] and type torch.long
  • data.edge_attr: Edge feature matrix with shape [num_edges, num_edge_features]
  • data.y: Target to train against (may have arbitrary shape)
  • data.pos: Node position matrix with shape [num_nodes, num_dimensions]
    所有的属性不是必须的,即可以不定义。
1
2
3
4
5
6
7
8
9
10
11
import torch
from torch_geometric.data import Data

edge_index = torch.tensor([[0, 1],
[1, 0],
[1, 2],
[2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index.t().contiguous())
>>> Data(x=[3, 1], edge_index=[2, 4])

graph分类

将几张图表示成一个总的稀疏矩阵

1
2
3
4
5
6
7
8
9
10
11
12
13
from torch_geometric.datasets import TUDataset
from torch_geometric.data import DataLoader


dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')
loader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch in loader:
batch
>>> Batch(x=[1082, 21], edge_index=[2, 4066], y=[32], batch=[1082])

batch.num_graphs
>>> 32

上述代码的一个batch由32张图片组成,一共有1082个节点,y是一个1*32的数组,表示这32张图的分类,batch则标记了node属于哪个具体的graph。

构建自己的dataset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
from torch_geometric.data import Data

from torch_geometric.data import DataLoader

import torch
from torch_geometric.data import InMemoryDataset


class MyOwnDatasetTest(InMemoryDataset):
def __init__(self, root, transform=None, pre_transform=None):
super(MyOwnDatasetTest, self).__init__(root, transform, pre_transform)
self.data, self.slices = torch.load(self.processed_paths[0])

@property
def raw_file_names(self):
return ['some_file_1', 'some_file_2']
# pass

@property
def processed_file_names(self):
return ['data.pt']

def download(self):
# Download to `self.raw_dir`.
pass

def process(self):
# Read data into huge `Data` list.
edge_index = torch.tensor([[0, 1],
[1,2]
], dtype=torch.long)
y1 = torch.tensor([0], dtype=torch.long)
y2 = torch.tensor([1], dtype=torch.long)

data1 = Data(edge_index=edge_index.t().contiguous(), y=y1)
data2 = Data(edge_index=edge_index.t().contiguous(), y=y2)

data_list = [data1, data2, data1, data2]

if self.pre_filter is not None:
data_list [data for data in data_list if self.pre_filter(data)]

if self.pre_transform is not None:
data_list = [self.pre_transform(data) for data in data_list]

data, slices = self.collate(data_list)
torch.save((data, slices), self.processed_paths[0])
if __name__ == '__main__':
dataset = MyOwnDatasetTest(root="/tmp/MyOwnDatasetTest")
loader = DataLoader(dataset, batch_size=2, shuffle=True)
for i in loader:
b = 1
# edge_index = torch.tensor([[0, 1],
# ], dtype=torch.long)
# y = torch.tensor([0], dtype=torch.long)
# data1 = Data(edge_index=edge_index.t().contiguous(), y=y)
print("end")

核心在于process()函数,注意采用datalist装载不同的graph,这里不需要自己合并各个graph,该库会自动合并。在定义标记时,对于graph识别的问题,每张图只要定义一个1*1的标记即可,不需要对每个节点都标记。调用时定义root即可,调用后该框架会在root文件夹底下创建process和raw文件夹并序列化,下次调用直接调用即可。

使用神经网络进行训练

examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import os.path as osp

import torch
import torch.nn.functional as F
from torch.nn import Sequential, Linear, ReLU
from torch_geometric.datasets import TUDataset
from torch_geometric.data import DataLoader
from torch_geometric.nn import GINConv, global_add_pool

path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'MUTAG')
dataset = TUDataset(path, name='MUTAG').shuffle()
test_dataset = dataset[:len(dataset) // 10]
train_dataset = dataset[len(dataset) // 10:]
test_loader = DataLoader(test_dataset, batch_size=128)
train_loader = DataLoader(train_dataset, batch_size=128)

for i in train_loader:
a = 1


class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()

num_features = dataset.num_features
dim = 32

nn1 = Sequential(Linear(num_features, dim), ReLU(), Linear(dim, dim))
self.conv1 = GINConv(nn1)
self.bn1 = torch.nn.BatchNorm1d(dim)

nn2 = Sequential(Linear(dim, dim), ReLU(), Linear(dim, dim))
self.conv2 = GINConv(nn2)
self.bn2 = torch.nn.BatchNorm1d(dim)

nn3 = Sequential(Linear(dim, dim), ReLU(), Linear(dim, dim))
self.conv3 = GINConv(nn3)
self.bn3 = torch.nn.BatchNorm1d(dim)

nn4 = Sequential(Linear(dim, dim), ReLU(), Linear(dim, dim))
self.conv4 = GINConv(nn4)
self.bn4 = torch.nn.BatchNorm1d(dim)

nn5 = Sequential(Linear(dim, dim), ReLU(), Linear(dim, dim))
self.conv5 = GINConv(nn5)
self.bn5 = torch.nn.BatchNorm1d(dim)

self.fc1 = Linear(dim, dim)
self.fc2 = Linear(dim, dataset.num_classes)

def forward(self, x, edge_index, batch):
x = F.relu(self.conv1(x, edge_index))
x = self.bn1(x)
x = F.relu(self.conv2(x, edge_index))
x = self.bn2(x)
x = F.relu(self.conv3(x, edge_index))
x = self.bn3(x)
x = F.relu(self.conv4(x, edge_index))
x = self.bn4(x)
x = F.relu(self.conv5(x, edge_index))
x = self.bn5(x)
x = global_add_pool(x, batch)
x = F.relu(self.fc1(x))
x = F.dropout(x, p=0.5, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=-1)


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)


def train(epoch):
model.train()

if epoch == 51:
for param_group in optimizer.param_groups:
param_group['lr'] = 0.5 * param_group['lr']

loss_all = 0
for data in train_loader:
data = data.to(device)
optimizer.zero_grad()
output = model(data.x, data.edge_index, data.batch)
loss = F.nll_loss(output, data.y)
loss.backward()
loss_all += loss.item() * data.num_graphs
optimizer.step()
return loss_all / len(train_dataset)


def test(loader):
model.eval()

correct = 0
for data in loader:
data = data.to(device)
output = model(data.x, data.edge_index, data.batch)
pred = output.max(dim=1)[1]
correct += pred.eq(data.y).sum().item()
return correct / len(loader.dataset)


for epoch in range(1, 101):
train_loss = train(epoch)
train_acc = test(train_loader)
test_acc = test(test_loader)
print('Epoch: {:03d}, Train Loss: {:.7f}, '
'Train Acc: {:.7f}, Test Acc: {:.7f}'.format(epoch, train_loss,
train_acc, test_acc))
-------------本文结束感谢您的阅读-------------

本文标题:图的向量化表示方法

文章作者:ChengXiao

发布时间:2019年03月12日 - 10:03

最后更新:2019年03月12日 - 11:03

原始链接:http://chengxiao19961022.github.io/2019/03/12/图的向量化表示方法/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

你的鼓励是我前进的动力~