Chapter 14. Data handling in PyG

( 참고 : https://www.youtube.com/watch?v=JtDgmmQ60x8&list=PLGMXrbDNfqTzqxB1IGgimuhtfAhGd8lHF )

1) Introduction

graph data 핸들링을 위한 2개의 메인 모듈

  • (1) torch_geometric.Data
    • classes & methods for creating/managing graphs
    • examples)
      • torch_geometric.data.DataLoader
      • torch_geometric.data.data.Data
      • torch_geometric.data.batch.Batch
      • torch_geometric.data.cluster.ClusterData
      • torch_geometric.data.cluster.ClusterLoader
      • torch_geometric.data.sampler.NeighborSampler
  • (2) torch_geometric.Datasets
    • collection of graph datasets
    • examples)
      • torch_geometric.data.Dataset.len()
      • torch_geometric.data.Dataset.get()


2) Data

dummy graph를 만들어보자.

  • (1) node feature ( embeddings )
  • (2) edge list ( edges )
  • (3) edge feature ( edges_attr )
  • (4) node label ( ys )

\(\rightarrow\) 이 4가지 정보를 사용하여, 그래프를 생성할 수 있다.


(1) node feature ( embeddings )

embeddings = torch.rand((100, 16), dtype=torch.float)
embeddings[77]
tensor([0.9875, 0.9491, 0.0260, 0.9500, 0.5964, 0.4411, 0.8687, 0.2774, 0.5203,
        0.4657, 0.4585, 0.2110, 0.6028, 0.3588, 0.3847, 0.5088])


(2) edge list ( edges )

rows = np.random.choice(100, 500)
cols = np.random.choice(100, 500)
edges = torch.tensor([rows, cols])
edges.shape
torch.Size([2, 500])


(3) edge feature ( edges_attr )

edges_attr = np.random.choice(3,500)
edges_attr.shape
(500,)


(4) node label ( ys )

ys = torch.rand((100)).round().long()
ys
tensor([1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1,
        1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1,
        1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0,
        1, 0, 0, 0])


shape 요약

  • node 개수 : 100
  • node feature의 차원 : 16
  • edge 개수 : 500
  • edge feature의 차원 : 1
print(embeddings.shape)
print(edges.shape)
print(edges_attr.shape)
print(ys.shape)
torch.Size([100, 16])
torch.Size([2, 500])
(500,)
torch.Size([100])


import torch_geometric.data as data

graph = data.Data(x = embeddings, 
                  edge_index = edges, 
                  edge_attr = edges_attr, 
                  y = ys)
graph
Data(x=[100, 16], edge_index=[2, 500], edge_attr=[500], y=[100])


위의 4가지 정보들은, 아래와 같이 불러올 수 있다

# graph.x 
# graph.x.numpy()
# graph.edge_index
# graph.edge_index.numpy()
# graph.edge_attr
# graph.edge_attr.numpy()
# graph.y
# graph.y.numpy()


3) Batch

여러 개의 graph들을, 마치 하나의 큰 그래프에서 disconnected된 여러 subgraph로 생각할 수 있다.

graph2 = graph
batch = data.Batch().from_data_list([graph, graph2])


print(graph)
print(batch)
Data(x=[100, 16], edge_index=[2, 500], edge_attr=[500], y=[100])
DataDataBatch(x=[200, 16], edge_index=[2, 1000], edge_attr=[2], y=[200], batch=[200], ptr=[3])


print("Number of graphs:",batch.num_graphs)
print("Graph at index 1:",batch[1])
print("Retrieve the list of graphs: ",len(batch.to_data_list()))
Number of graphs: 2
Graph at index 1: Data(x=[100, 16], edge_index=[2, 500], edge_attr=[500], y=[100])
Retrieve the list of graphs:  2


4) Cluster

여기서 말하는 cluster는, 일반적으로 우리가 생각하는 일반적인 데이터셋의 batch개념으로 보면 된다

( batches of clusters )

cluster = data.ClusterData(graph, 5)
clusterloader = data.ClusterLoader(cluster)
for i in clusterloader:
    print(i)
Data(x=[20, 16], edge_attr=[500], y=[20], edge_index=[2, 43])
Data(x=[20, 16], edge_attr=[500], y=[20], edge_index=[2, 38])
Data(x=[20, 16], edge_attr=[500], y=[20], edge_index=[2, 37])
Data(x=[20, 16], edge_attr=[500], y=[20], edge_index=[2, 26])
Data(x=[20, 16], edge_attr=[500], y=[20], edge_index=[2, 30])


5) Sampler

sample a maximum of nodes from each neighborhood

sampler = data.NeighborSampler(graph.edge_index, sizes=[3,10], batch_size=5,
                                  shuffle=False)
  • sizes=[3,10] 의 의미 :
    • 2개의 convolution layer
      • (1) 이웃 : 3개의 노드 샘플
      • (2) 이웃의 이웃 : 10개의 노드 샘플
for s in sampler:
    batch_size = s[0]
    node_index = s[1]
    edge_info = s[2]
    print(batch_size)
    print(node_index)
    print(edge_info)
    
    break
5
tensor([ 0,  1,  2,  3,  4, 71, 41,  6, 55, 12, 64, 47, 22, 97, 52, 26, 65, 29,
        69, 67, 63, 54, 80, 86, 46, 15,  5, 92, 57, 56, 35, 78, 23,  7, 87, 48,
         8, 88, 93, 79, 70, 68, 61, 90, 60, 37, 25, 99, 89, 82, 66, 84, 42, 32,
        31, 91, 77, 43, 18,  9, 76, 73, 58, 17, 16])
[EdgeIndex(edge_index=tensor([[ 5,  6,  7,  2,  8,  9, 18, 19, 10, 11, 12, 20, 21, 10, 13, 14, 22, 15,
         16, 17, 23, 24, 17, 23, 25, 26,  7, 27, 28, 29, 30, 16, 23, 31, 32, 33,
         22, 34, 35, 36,  0, 16, 37,  8, 13, 15, 38, 39, 40, 41, 42,  8, 14, 35,
         41, 43, 44, 45, 46,  9, 18, 47, 48, 49, 50, 24, 28, 30, 32, 35, 36, 50,
         33, 51, 52, 53, 54,  3, 45, 55, 56, 57, 58, 59, 28, 57, 60, 61, 62, 63,
          3, 21, 34, 43, 49, 64],
        [ 0,  0,  0,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  3,  3,  3,  3,  4,
          4,  4,  4,  4,  5,  5,  5,  5,  6,  6,  6,  6,  6,  7,  7,  7,  7,  7,
          8,  8,  8,  8,  9,  9,  9, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11,
         11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13,
         14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16,
         17, 17, 17, 17, 17, 17]]), e_id=tensor([238,  60, 117,  76, 396, 459, 402, 179, 262, 484, 137, 343, 245, 195,
         36,  73, 319, 157,   1, 280, 427, 225, 298, 377, 346, 398, 228, 457,
        219, 350, 306,  18, 114, 279, 476, 463, 216, 311, 356, 278, 145, 361,
        201, 297, 149,  11, 116, 121, 321,  54, 409, 227,   7, 150,  85, 231,
         50, 415, 199, 146, 256, 198, 204, 432, 239, 185, 209, 490, 387, 471,
         49, 425, 437,   2, 130, 393, 413, 386, 141, 186,  22,  68, 286, 439,
        135, 384, 168, 392, 407, 187, 418,  88, 212, 257, 162, 229]), size=(65, 18)), EdgeIndex(edge_index=tensor([[ 5,  6,  7,  2,  8,  9, 10, 11, 12, 10, 13, 14, 15, 16, 17],
        [ 0,  0,  0,  1,  1,  1,  2,  2,  2,  3,  3,  3,  4,  4,  4]]), e_id=tensor([238,  60, 117,  76, 396, 459, 262, 484, 137, 195,  36,  73, 157,   1,
        280]), size=(18, 5))]


자세히 들여다보기

print("Batch size:", batch_size)
print("Number of unique nodes involved in the sampling:",len(node_index))
print("Number of neighbors sampled:", len(edge_info[0].edge_index[0]), 
      len(edge_info[1].edge_index[0]))
Batch size: 5
Number of unique nodes involved in the sampling: 65
Number of neighbors sampled: 96 15


6) Dataset

import torch_geometric.datasets as datasets


78종류의 내장 데이터셋들이 존재한다

  • 78개의 그래프가 있다는 뜻이 아니다.
  • 78종류의 데이터셋이 있고,
    • 각각의 데이터셋 내에 다양한 데이터들이 있다.
      • 하나의 데이터 내에도, 여러 그래프가 존재할 수 있다
len(datasets.__all__) # 78


그 중 대표적인 Cora 데이터를 살펴보자

( Cora 데이터Planetoid라는 데이터셋에 속해 있는 데이터 중 하나이다 )

name = 'Cora'


다음과 같이, 여러 개의 transformation을 pipeline식으로 적용한 채, 데이터를 불러올 수 있다.

transform = transforms.Compose([
    transforms.RandomNodeSplit('train_rest', num_val=500, num_test=500),
    transforms.TargetIndegree(),
])

cora = datasets.Planetoid('./data', name, 
                          pre_transform=transforms.NormalizeFeatures(), 
                          transform=transform)


Cora 데이터 들여다보기

print("Cora info:")
print('# of graphs:', len(cora))
print('# Classes (nodes)', cora.num_classes)
print('# Edge features', cora.num_edge_features)
print('# Node features', cora.num_node_features)
Cora info:
# of graphs: 1
# Classes (nodes) 7
# Edge features 1
# Node features 1433

Categories:

Updated: