特征工程是机器学习领域的基础,神经网络的输入永远是向量表示的,当输入为graph时如何对rawdata进行处理并向量化,保留网络结构以及节点、边的特征?本文列举了时下热门的分析方法。
参考资料:
介绍
PyTorch Geometric (PyG) is a geometric deep learning extension library for PyTorch.
It consists of various methods for deep learning on graphs and other irregular structures, also known as geometric deep learning, from a variety of published papers. In addition, it consists of an easy-to-use mini-batch loader, multi gpu-support, a large number of common benchmark datasets (based on simple interfaces to create your own), and helpful transforms, both for learning on arbitrary graphs as well as on 3D meshes or point clouds.
安装
- 亲测有效系统版本ubuntu16.4.6lts
- 下载anaconda :
wget https://www.anaconda.com/distribution/
- 安装anaconda :
bash /path/to/.sh
source ~/.bashrc
- 替换conda源 :
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --set show_channel_urls yes
- 安装pytorch(需要根据处理器下载对应版本,这里时cpu版本):
pip install https://download.pytorch.org/whl/cpu/torch-1.0.1.post2-cp37-cp37m-linux_x86_64.whl
pip install torchvision
- 安装pytorch_geometric
$ pip install --upgrade torch-scatter
$ pip install --upgrade torch-sparse
$ pip install --upgrade torch-cluster
$ pip install --upgrade torch-spline-conv (optional)
$ pip install torch-geometric
亲测使用方法(持续更新)
Data数据结构
data.x
: Node feature matrix with shape [num_nodes, num_node_features]data.edge_index
: Graph connectivity in COO format with shape [2, num_edges] and type torch.longdata.edge_attr
: Edge feature matrix with shape [num_edges, num_edge_features]data.y
: Target to train against (may have arbitrary shape)data.pos
: Node position matrix with shape [num_nodes, num_dimensions]
所有的属性不是必须的,即可以不定义。
1 | import torch |
graph分类
将几张图表示成一个总的稀疏矩阵
1 | from torch_geometric.datasets import TUDataset |
上述代码的一个batch由32张图片组成,一共有1082个节点,y是一个1*32的数组,表示这32张图的分类,batch则标记了node属于哪个具体的graph。
构建自己的dataset
1 | import torch |
核心在于process()函数,注意采用datalist装载不同的graph,这里不需要自己合并各个graph,该库会自动合并。在定义标记时,对于graph识别的问题,每张图只要定义一个1*1的标记即可,不需要对每个节点都标记。调用时定义root即可,调用后该框架会在root文件夹底下创建process和raw文件夹并序列化,下次调用直接调用即可。
使用神经网络进行训练
1 | import os.path as osp |