图的向量化表示方法

特征工程是机器学习领域的基础，神经网络的输入永远是向量表示的，当输入为graph时如何对rawdata进行处理并向量化，保留网络结构以及节点、边的特征？本文列举了时下热门的分析方法。

参考资料：

awesome-graph-embedding

pytorch_geometric

介绍

PyTorch Geometric (PyG) is a geometric deep learning extension library for PyTorch.

It consists of various methods for deep learning on graphs and other irregular structures, also known as geometric deep learning, from a variety of published papers. In addition, it consists of an easy-to-use mini-batch loader, multi gpu-support, a large number of common benchmark datasets (based on simple interfaces to create your own), and helpful transforms, both for learning on arbitrary graphs as well as on 3D meshes or point clouds.

安装

亲测有效系统版本ubuntu16.4.6lts
下载anaconda : wget https://www.anaconda.com/distribution/
安装anaconda : bash /path/to/.sh
source ~/.bashrc
替换conda源 :
1. conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
2. conda config --set show_channel_urls yes
安装pytorch（需要根据处理器下载对应版本，这里时cpu版本）:
1. pip install https://download.pytorch.org/whl/cpu/torch-1.0.1.post2-cp37-cp37m-linux_x86_64.whl
2. pip install torchvision
安装pytorch_geometric
- $ pip install --upgrade torch-scatter
- $ pip install --upgrade torch-sparse
- $ pip install --upgrade torch-cluster
- $ pip install --upgrade torch-spline-conv (optional)
- $ pip install torch-geometric

亲测使用方法（持续更新）

Data数据结构

torch_geometric.data.Data:

data.x: Node feature matrix with shape [num_nodes, num_node_features]
data.edge_index: Graph connectivity in COO format with shape [2, num_edges] and type torch.long
data.edge_attr: Edge feature matrix with shape [num_edges, num_edge_features]
data.y: Target to train against (may have arbitrary shape)
data.pos: Node position matrix with shape [num_nodes, num_dimensions]
所有的属性不是必须的，即可以不定义。

import torch
from torch_geometric.data import Data

edge_index = torch.tensor([[0, 1],
                           [1, 0],
                           [1, 2],
                           [2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index.t().contiguous())
>>> Data(x=[3, 1], edge_index=[2, 4])

graph分类

将几张图表示成一个总的稀疏矩阵

from torch_geometric.datasets import TUDataset
from torch_geometric.data import DataLoader


dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')
loader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch in loader:
    batch
    >>> Batch(x=[1082, 21], edge_index=[2, 4066], y=[32], batch=[1082])

    batch.num_graphs
    >>> 32

上述代码的一个batch由32张图片组成，一共有1082个节点，y是一个1*32的数组，表示这32张图的分类，batch则标记了node属于哪个具体的graph。

构建自己的dataset

import torch
from torch_geometric.data import Data

from torch_geometric.data import DataLoader

import torch
from torch_geometric.data import InMemoryDataset


class MyOwnDatasetTest(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(MyOwnDatasetTest, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return ['some_file_1', 'some_file_2']
        # pass

    @property
    def processed_file_names(self):
        return ['data.pt']

    def download(self):
        # Download to `self.raw_dir`.
        pass

    def process(self):
        # Read data into huge `Data` list.
        edge_index = torch.tensor([[0, 1],
                                   [1,2]
                                   ], dtype=torch.long)
        y1 = torch.tensor([0], dtype=torch.long)
        y2 = torch.tensor([1], dtype=torch.long)

        data1 = Data(edge_index=edge_index.t().contiguous(), y=y1)
        data2 = Data(edge_index=edge_index.t().contiguous(), y=y2)

        data_list = [data1, data2, data1, data2]

        if self.pre_filter is not None:
            data_list [data for data in data_list if self.pre_filter(data)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(data) for data in data_list]

        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])
if __name__ == '__main__':
    dataset = MyOwnDatasetTest(root="/tmp/MyOwnDatasetTest")
    loader = DataLoader(dataset, batch_size=2, shuffle=True)
    for i in loader:
        b = 1
    # edge_index = torch.tensor([[0, 1],
    #                            ], dtype=torch.long)
    # y = torch.tensor([0], dtype=torch.long)
    # data1 = Data(edge_index=edge_index.t().contiguous(), y=y)
    print("end")

核心在于process()函数，注意采用datalist装载不同的graph，这里不需要自己合并各个graph，该库会自动合并。在定义标记时，对于graph识别的问题，每张图只要定义一个1*1的标记即可，不需要对每个节点都标记。调用时定义root即可，调用后该框架会在root文件夹底下创建process和raw文件夹并序列化，下次调用直接调用即可。

使用神经网络进行训练

examples

import os.path as osp

import torch
import torch.nn.functional as F
from torch.nn import Sequential, Linear, ReLU
from torch_geometric.datasets import TUDataset
from torch_geometric.data import DataLoader
from torch_geometric.nn import GINConv, global_add_pool

path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'MUTAG')
dataset = TUDataset(path, name='MUTAG').shuffle()
test_dataset = dataset[:len(dataset) // 10]
train_dataset = dataset[len(dataset) // 10:]
test_loader = DataLoader(test_dataset, batch_size=128)
train_loader = DataLoader(train_dataset, batch_size=128)

for i in train_loader:
    a = 1


class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        num_features = dataset.num_features
        dim = 32

        nn1 = Sequential(Linear(num_features, dim), ReLU(), Linear(dim, dim))
        self.conv1 = GINConv(nn1)
        self.bn1 = torch.nn.BatchNorm1d(dim)

        nn2 = Sequential(Linear(dim, dim), ReLU(), Linear(dim, dim))
        self.conv2 = GINConv(nn2)
        self.bn2 = torch.nn.BatchNorm1d(dim)

        nn3 = Sequential(Linear(dim, dim), ReLU(), Linear(dim, dim))
        self.conv3 = GINConv(nn3)
        self.bn3 = torch.nn.BatchNorm1d(dim)

        nn4 = Sequential(Linear(dim, dim), ReLU(), Linear(dim, dim))
        self.conv4 = GINConv(nn4)
        self.bn4 = torch.nn.BatchNorm1d(dim)

        nn5 = Sequential(Linear(dim, dim), ReLU(), Linear(dim, dim))
        self.conv5 = GINConv(nn5)
        self.bn5 = torch.nn.BatchNorm1d(dim)

        self.fc1 = Linear(dim, dim)
        self.fc2 = Linear(dim, dataset.num_classes)

    def forward(self, x, edge_index, batch):
        x = F.relu(self.conv1(x, edge_index))
        x = self.bn1(x)
        x = F.relu(self.conv2(x, edge_index))
        x = self.bn2(x)
        x = F.relu(self.conv3(x, edge_index))
        x = self.bn3(x)
        x = F.relu(self.conv4(x, edge_index))
        x = self.bn4(x)
        x = F.relu(self.conv5(x, edge_index))
        x = self.bn5(x)
        x = global_add_pool(x, batch)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=-1)


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)


def train(epoch):
    model.train()

    if epoch == 51:
        for param_group in optimizer.param_groups:
            param_group['lr'] = 0.5 * param_group['lr']

    loss_all = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        output = model(data.x, data.edge_index, data.batch)
        loss = F.nll_loss(output, data.y)
        loss.backward()
        loss_all += loss.item() * data.num_graphs
        optimizer.step()
    return loss_all / len(train_dataset)


def test(loader):
    model.eval()

    correct = 0
    for data in loader:
        data = data.to(device)
        output = model(data.x, data.edge_index, data.batch)
        pred = output.max(dim=1)[1]
        correct += pred.eq(data.y).sum().item()
    return correct / len(loader.dataset)


for epoch in range(1, 101):
    train_loss = train(epoch)
    train_acc = test(train_loader)
    test_acc = test(test_loader)
    print('Epoch: {:03d}, Train Loss: {:.7f}, '
          'Train Acc: {:.7f}, Test Acc: {:.7f}'.format(epoch, train_loss,
                                                       train_acc, test_acc))