Chapter 14. Data handling in PyG
( 참고 : https://www.youtube.com/watch?v=JtDgmmQ60x8&list=PLGMXrbDNfqTzqxB1IGgimuhtfAhGd8lHF )
1) Introduction
graph data 핸들링을 위한 2개의 메인 모듈
- (1)
torch_geometric.Data
- classes & methods for creating/managing graphs
- examples)
torch_geometric.data.DataLoader
torch_geometric.data.data.Data
torch_geometric.data.batch.Batch
torch_geometric.data.cluster.ClusterData
torch_geometric.data.cluster.ClusterLoader
torch_geometric.data.sampler.NeighborSampler
- (2)
torch_geometric.Datasets
- collection of graph datasets
- examples)
torch_geometric.data.Dataset.len()
torch_geometric.data.Dataset.get()
2) Data
dummy graph를 만들어보자.
- (1) node feature (
embeddings
) - (2) edge list (
edges
) - (3) edge feature (
edges_attr
) - (4) node label (
ys
)
\(\rightarrow\) 이 4가지 정보를 사용하여, 그래프를 생성할 수 있다.
(1) node feature ( embeddings
)
embeddings = torch.rand((100, 16), dtype=torch.float)
embeddings[77]
tensor([0.9875, 0.9491, 0.0260, 0.9500, 0.5964, 0.4411, 0.8687, 0.2774, 0.5203,
0.4657, 0.4585, 0.2110, 0.6028, 0.3588, 0.3847, 0.5088])
(2) edge list ( edges
)
rows = np.random.choice(100, 500)
cols = np.random.choice(100, 500)
edges = torch.tensor([rows, cols])
edges.shape
torch.Size([2, 500])
(3) edge feature ( edges_attr
)
edges_attr = np.random.choice(3,500)
edges_attr.shape
(500,)
(4) node label ( ys
)
ys = torch.rand((100)).round().long()
ys
tensor([1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1,
1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,
0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0,
1, 0, 0, 0])
shape 요약
- node 개수 : 100
- node feature의 차원 : 16
- edge 개수 : 500
- edge feature의 차원 : 1
print(embeddings.shape)
print(edges.shape)
print(edges_attr.shape)
print(ys.shape)
torch.Size([100, 16])
torch.Size([2, 500])
(500,)
torch.Size([100])
import torch_geometric.data as data
graph = data.Data(x = embeddings,
edge_index = edges,
edge_attr = edges_attr,
y = ys)
graph
Data(x=[100, 16], edge_index=[2, 500], edge_attr=[500], y=[100])
위의 4가지 정보들은, 아래와 같이 불러올 수 있다
# graph.x
# graph.x.numpy()
# graph.edge_index
# graph.edge_index.numpy()
# graph.edge_attr
# graph.edge_attr.numpy()
# graph.y
# graph.y.numpy()
3) Batch
여러 개의 graph들을, 마치 하나의 큰 그래프에서 disconnected된 여러 subgraph로 생각할 수 있다.
graph2 = graph
batch = data.Batch().from_data_list([graph, graph2])
print(graph)
print(batch)
Data(x=[100, 16], edge_index=[2, 500], edge_attr=[500], y=[100])
DataDataBatch(x=[200, 16], edge_index=[2, 1000], edge_attr=[2], y=[200], batch=[200], ptr=[3])
print("Number of graphs:",batch.num_graphs)
print("Graph at index 1:",batch[1])
print("Retrieve the list of graphs: ",len(batch.to_data_list()))
Number of graphs: 2
Graph at index 1: Data(x=[100, 16], edge_index=[2, 500], edge_attr=[500], y=[100])
Retrieve the list of graphs: 2
4) Cluster
여기서 말하는 cluster는, 일반적으로 우리가 생각하는 일반적인 데이터셋의 batch개념으로 보면 된다
( batches of clusters )
cluster = data.ClusterData(graph, 5)
clusterloader = data.ClusterLoader(cluster)
for i in clusterloader:
print(i)
Data(x=[20, 16], edge_attr=[500], y=[20], edge_index=[2, 43])
Data(x=[20, 16], edge_attr=[500], y=[20], edge_index=[2, 38])
Data(x=[20, 16], edge_attr=[500], y=[20], edge_index=[2, 37])
Data(x=[20, 16], edge_attr=[500], y=[20], edge_index=[2, 26])
Data(x=[20, 16], edge_attr=[500], y=[20], edge_index=[2, 30])
5) Sampler
sample a maximum of nodes from each neighborhood
sampler = data.NeighborSampler(graph.edge_index, sizes=[3,10], batch_size=5,
shuffle=False)
sizes=[3,10]
의 의미 :- 2개의 convolution layer
- (1) 이웃 : 3개의 노드 샘플
- (2) 이웃의 이웃 : 10개의 노드 샘플
- 2개의 convolution layer
for s in sampler:
batch_size = s[0]
node_index = s[1]
edge_info = s[2]
print(batch_size)
print(node_index)
print(edge_info)
break
5
tensor([ 0, 1, 2, 3, 4, 71, 41, 6, 55, 12, 64, 47, 22, 97, 52, 26, 65, 29,
69, 67, 63, 54, 80, 86, 46, 15, 5, 92, 57, 56, 35, 78, 23, 7, 87, 48,
8, 88, 93, 79, 70, 68, 61, 90, 60, 37, 25, 99, 89, 82, 66, 84, 42, 32,
31, 91, 77, 43, 18, 9, 76, 73, 58, 17, 16])
[EdgeIndex(edge_index=tensor([[ 5, 6, 7, 2, 8, 9, 18, 19, 10, 11, 12, 20, 21, 10, 13, 14, 22, 15,
16, 17, 23, 24, 17, 23, 25, 26, 7, 27, 28, 29, 30, 16, 23, 31, 32, 33,
22, 34, 35, 36, 0, 16, 37, 8, 13, 15, 38, 39, 40, 41, 42, 8, 14, 35,
41, 43, 44, 45, 46, 9, 18, 47, 48, 49, 50, 24, 28, 30, 32, 35, 36, 50,
33, 51, 52, 53, 54, 3, 45, 55, 56, 57, 58, 59, 28, 57, 60, 61, 62, 63,
3, 21, 34, 43, 49, 64],
[ 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7,
8, 8, 8, 8, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11,
11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13,
14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16,
17, 17, 17, 17, 17, 17]]), e_id=tensor([238, 60, 117, 76, 396, 459, 402, 179, 262, 484, 137, 343, 245, 195,
36, 73, 319, 157, 1, 280, 427, 225, 298, 377, 346, 398, 228, 457,
219, 350, 306, 18, 114, 279, 476, 463, 216, 311, 356, 278, 145, 361,
201, 297, 149, 11, 116, 121, 321, 54, 409, 227, 7, 150, 85, 231,
50, 415, 199, 146, 256, 198, 204, 432, 239, 185, 209, 490, 387, 471,
49, 425, 437, 2, 130, 393, 413, 386, 141, 186, 22, 68, 286, 439,
135, 384, 168, 392, 407, 187, 418, 88, 212, 257, 162, 229]), size=(65, 18)), EdgeIndex(edge_index=tensor([[ 5, 6, 7, 2, 8, 9, 10, 11, 12, 10, 13, 14, 15, 16, 17],
[ 0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4]]), e_id=tensor([238, 60, 117, 76, 396, 459, 262, 484, 137, 195, 36, 73, 157, 1,
280]), size=(18, 5))]
자세히 들여다보기
print("Batch size:", batch_size)
print("Number of unique nodes involved in the sampling:",len(node_index))
print("Number of neighbors sampled:", len(edge_info[0].edge_index[0]),
len(edge_info[1].edge_index[0]))
Batch size: 5
Number of unique nodes involved in the sampling: 65
Number of neighbors sampled: 96 15
6) Dataset
import torch_geometric.datasets as datasets
총 78종류의 내장 데이터셋들이 존재한다
- 78개의 그래프가 있다는 뜻이 아니다.
- 78종류의 데이터셋이 있고,
- 각각의 데이터셋 내에 다양한 데이터들이 있다.
- 하나의 데이터 내에도, 여러 그래프가 존재할 수 있다
- 각각의 데이터셋 내에 다양한 데이터들이 있다.
len(datasets.__all__) # 78
그 중 대표적인 Cora 데이터를 살펴보자
( Cora 데이터는 Planetoid라는 데이터셋에 속해 있는 데이터 중 하나이다 )
name = 'Cora'
다음과 같이, 여러 개의 transformation을 pipeline식으로 적용한 채, 데이터를 불러올 수 있다.
transform = transforms.Compose([
transforms.RandomNodeSplit('train_rest', num_val=500, num_test=500),
transforms.TargetIndegree(),
])
cora = datasets.Planetoid('./data', name,
pre_transform=transforms.NormalizeFeatures(),
transform=transform)
Cora 데이터 들여다보기
print("Cora info:")
print('# of graphs:', len(cora))
print('# Classes (nodes)', cora.num_classes)
print('# Edge features', cora.num_edge_features)
print('# Node features', cora.num_node_features)
Cora info:
# of graphs: 1
# Classes (nodes) 7
# Edge features 1
# Node features 1433