PyG integration: Sample and export

Open In Colab

This Jupyter notebook is hosted here in the Neo4j Graph Data Science Client Github repository.

For a video presentation on this notebook, see the talk GNNs at Scale With Graph Data Science Sampling and Python Client Integration that was given at the NODES 2022 conference.

The notebook exemplifies how to use the graphdatascience and PyTorch Geometric (PyG) Python libraries to:

  • Import the CORA dataset directly into GDS

  • Sample a part of CORA using the GDS Random walk with restarts algorithm

  • Export the CORA sample client side

  • Define and train a Graph Convolutional Neural Network (GCN) on the CORA sample

  • Evaluate the GCN on a test set

1. Prerequisites

Running this notebook requires a Neo4j server with a recent GDS version (2.2+) installed. We recommend using Neo4j Desktop with GDS, or AuraDS.

Also required are of course the Python libraries:

  • graphdatascience (see docs for installation instructions)

  • PyG (see PyG docs for installation instructions)

2. Setup

We start by importing our dependencies and setting up our GDS client connection to the database.

# Install necessary dependencies
%pip install graphdatascience torch torch_scatter torch_sparse torch_geometric
import os
import pandas as pd
from graphdatascience import GraphDataScience
import torch
from torch_geometric.data import Data
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.transforms import RandomNodeSplit
import random
import numpy as np
# Set seeds for consistent results
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
# Get Neo4j DB URI, credentials and name from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_AUTH = None
NEO4J_DB = os.environ.get("NEO4J_DB", "neo4j")
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB)
from graphdatascience.server_version.server_version import ServerVersion

assert gds.server_version() >= ServerVersion(2, 2, 0)

3. Sampling CORA

Next we use the built-in CORA loader to get the data into GDS. We will then sample it to get a smaller graph to train on. In a real world scenario, we would probably project data from a Neo4j database into GDS instead.

G = gds.graph.load_cora()
# Let's make sure we constructed the correct graph
print(f"Metadata for our loaded Cora graph `G`: {G}")
print(f"Node labels present in `G`: {G.node_labels()}")

It’s looks correct! Now let’s go ahead and sample the graph.

# We use the random walk with restarts sampling algorithm with default values
G_sample, _ = gds.alpha.graph.sample.rwr("cora_sample", G, randomSeed=42, concurrency=1)
# We should have somewhere around 0.15 * 2708 ~ 406 nodes in our sample
print(f"Number of nodes in our sample: {G_sample.node_count()}")

# And let's see how many relationships we got
print(f"Number of relationships in our sample: {G_sample.relationship_count()}")

4. Exporting sampled CORA

We can now export the topology and node properties of the sampled graph that we need to train our model.

# Get the relationship data from our sample
sample_topology_df = gds.beta.graph.relationships.stream(G_sample)
# Let's see what we got:
display(sample_topology_df)

We get the right amount of rows, one for each expected relationship. However, the node IDs are rather large whereas PyG expects consecutive IDs starting from zero. We now will start to massage our data structure containing the relationships until PyG can use it.

# By using `by_rel_type` we get the topology in a format that can be used as input to several GNN frameworks:
# {"rel_type": [[source_nodes], [target_nodes]]}

sample_topology = sample_topology_df.by_rel_type()
# We should only have the "CITES" keys since there's only one relationship type
print(f"Relationship type keys: {sample_topology.keys()}")
print(f"Number of  {len(sample_topology['CITES'])}")

# How many source nodes do we have?
print(len(sample_topology["CITES"][0]))

Great, it looks like we have the format we need to create a PyG edge_index later.

# We also need to export the node properties corresponding to our node labels and features, represented by the
# "subject" and "features" node properties respectively
sample_node_properties = gds.graph.nodeProperties.stream(
    G_sample,
    ["subject", "features"],
    separate_property_columns=True,
)
# Let's make sure we got the data we expected
display(sample_node_properties)

5. Constructing GCN input

Now that we have all information we need client side, we can construct the PyG Data object we will use as training input. We will remap the node IDs to be consecutive and starting from zero. We use the ordering of node ids in sample_node_properties as our remapping so that the index is aligned with the node properties.

# In order for the node ids used in the `topology` to be consecutive and starting from zero,
# we will need to remap them. This way they will also align with the row numbering of the
# `sample_node_properties` data frame
def normalize_topology_index(new_idx_to_old, topology):
    # Create a reverse mapping based on new idx -> old idx
    old_idx_to_new = dict((v, k) for k, v in new_idx_to_old.items())
    return [[old_idx_to_new[node_id] for node_id in nodes] for nodes in topology]


# We use the ordering of node ids in `sample_node_properties` as our remapping
# The result is: [[mapped_source_nodes], [mapped_target_nodes]]
normalized_topology = normalize_topology_index(dict(sample_node_properties["nodeId"]), sample_topology["CITES"])
# We use the ordering of node ids in `sample_node_properties` as our remapping
edge_index = torch.tensor(normalized_topology, dtype=torch.long)

# We specify the node property "features" as the zero-layer node embeddings
x = torch.tensor(sample_node_properties["features"], dtype=torch.float)

# We specify the node property "subject" as class labels
y = torch.tensor(sample_node_properties["subject"], dtype=torch.long)

data = Data(x=x, y=y, edge_index=edge_index)

print(data)
# Do a random split of the data so that ~10% goes into a test set and the rest used for training
transform = RandomNodeSplit(num_test=40, num_val=0)
_ = transform(data)

# We can see that our `data` object have been extended with some masks defining the split
print(data)
print(data.train_mask.sum().item())

As a sidenote, if we had wanted to do some hyperarameter tuning, it would have been useful to keep some data for a validation set as well.

6. Training and evaluating a GCN

Let’s now define and train the GCN using PyG and our sampled CORA as input. We adapt the CORA GCN example from the PyG documentation.

In this example we evaluate the model on a test set of the sampled CORA. Please note however, that since GCN is an inductive algorithm we could also have evaluated it on the full CORA dataset, or even another (similar) graph for that matter.

num_classes = y.unique().shape[0]


# Define the GCN architecture
class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(data.num_node_features, 16)
        self.conv2 = GCNConv(16, num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        # We use log_softmax and nll_loss instead of softmax output and cross entropy loss
        # for reasons for performance and numerical stability.
        # They are mathematically equivalent
        return F.log_softmax(x, dim=1)
# Prepare training by setting up for the chosen device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Let's see what device was chosen
print(device)
# In standard PyTorch fashion we instantiate our model, and transfer it to the memory of the chosen device
model = GCN().to(device)

# Let's inspect our model architecture
print(model)

# Pass our input data to the chosen device too
data = data.to(device)

# Since hyperparameter tuning is out of scope for this small example, we initialize an
# Adam optimizer with some fixed learning rate and weight decay
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

From inspecting the model we can see that that the output size is 7, which looks correct since Cora does indeed have 7 different paper subjects.

# Train the GCN using the CORA sample represented by `data` using the standard PyTorch training loop
model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
# Evaluate the trained GCN model on our test set
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())

print(f"Accuracy: {acc:.4f}")

The accuracy looks good. The next step would be to run the GCN model we trained our subsample on the entire Cora graph. This part is left as an exercise.

7. Cleanup

We remove the CORA graphs from the GDS graph catalog.

_ = G_sample.drop()
_ = G.drop()