Node classification with HashGNN

This Jupyter notebook is hosted here in the Neo4j Graph Data Science Client Github repository.

The notebook exemplifies how to use the graphdatascience library to:

Import an IMDB dataset with Movie, Actor and Director nodes directly into GDS using a convenience data loader
Configure a node classification pipeline with HashGNN embeddings for predicting the genre of Movie nodes
Train the pipeline with autotuning and inspecting the results
Make predictions for movie nodes missing without a specified genre

1. Prerequisites

Running this notebook requires a Neo4j database server with a recent version (2.3 or newer) of the Neo4j Graph Data Science library (GDS) plugin installed. We recommend using Neo4j Desktop with GDS, or AuraDS.

Additionally, the version of the graphdatascience library used must be 1.6 or newer.

# Install necessary Python dependencies
%pip install graphdatascience

2. Setup

We start by importing our dependencies and setting up our GDS client connection to the database.

# Import our dependencies
import os
from graphdatascience import GraphDataScience

# Get Neo4j DB URI, credentials and name from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_AUTH = None
NEO4J_DB = os.environ.get("NEO4J_DB", "neo4j")
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB)

from graphdatascience.server_version.server_version import ServerVersion

assert gds.server_version() >= ServerVersion(2, 3, 0)

3. Loading the IMDB dataset

Next we use the graphdatascience built-in IMDB loader to get data into our GDS server. This should give us a graph with Movie, Actor and Director nodes, connected by ACTED_IN and DIRECTED_IN relationships.

Note that a `real world scenario'', we would probably project our own data from a Neo4j database into GDS instead, or using `gds.graph.construct to create a graph from our own client side data.

# Run the loading to obtain a `Graph` object representing our GDS projection
G = gds.graph.load_imdb()

Let’s inspect our graph to see what it contains.

print(f"Overview of G: {G}")
print(f"Node labels in G: {G.node_labels()}")
print(f"Relationship types in G: {G.relationship_types()}")

It looks as expected, though we notice that some nodes are of a UnclassifiedMovie label. Indeed, these are the nodes whose genre we wish to predict with a node classification model. Let’s look at the node properties present on the various node labels to see this more clearly.

print(f"Node properties per label:\n{G.node_properties()}")

So we see that the Movie nodes have the genre property, which means that we can use these nodes when training our model later. The UnclassifiedMovie nodes as expected does not have the genre property, which is exactly what we want to predict.

Additionally, we notice that all nodes have a plot_keywords property. This is a binary ``bag-of-words'' type feature vector representing which out of 1256 plot keywords describe a certain node. These feature vectors will be used as input to our HashGNN node embedding algorithm later.

4. Configuring the training pipeline

Now that we loaded and understood the data we want to analyze, we can move on to look at the tools for actually making the aforementioned genre predictions of the UnclassifiedMovie nodes.

Since we want to predict discrete valued properties of nodes, we will use a node classification pipeline.

# Create an empty node classification pipeline
pipe, _ = gds.beta.pipeline.nodeClassification.create("genre-predictor-pipe")

To be able to compare our accuracy score to the current state of the art methods on this dataset, we want to use the same test set size as in the Graph Transformer Network paper (NIPS paper link). We configure our pipeline accordingly.

# Set the test set size to be 79.6 % of the entire set of `Movie` nodes
_ = pipe.configureSplit(testFraction=0.796)

Please note that we would get a much better model by using a more standard train-test split, like 80/20 or so. And typically this would be the way to go for real use cases.

5. The HashGNN node embedding algorithm

As the last part of the training pipeline, there will be an ML training algorithm. If we use the plot_keywords directly as our feature input to the ML algorithm, we will not utilize any of the relationship data we have in our graph. Since relationships would likely enrich our features with more valuable information, we will use a node embedding algorithm which takes relationships into account, and use its output as input to the ML training algorithm.

In this case we will use the HashGNN node embedding algorithm which is new in GDS 2.3. Contrary to what the name suggests, HashGNN is not a supervised neural learning model. It is in fact an unsupervised algorithm. Its name comes from the fact that the algorithm design is inspired by that of graph neural networks, in that it does message passing interleaved with transformations on each node. But instead of doing neural transformations like most GNNs, its transformations are done by locality sensitive min-hashing. Since the hash functions used are randomly chosen independent of the input data, there is no need for training.

We will give hashGNN the plot_keywords node properties as input, and it will output new feature vectors for each node that has been enriched by message passing over relationships. Since the plot_keywords vectors are already binary we don’t have to do any binarization of the input.

Since we have multiple node labels and relationships, we make sure to enable the heterogeneous capabilities of HashGNN by setting heterogeneous=True. Notably we also declare that we want to include all kinds of nodes, not only the Movie nodes we will train on, by explicitly specifying the contextNodeLabels.

Please see the HashGNN documentation for more on this algorithm.

# Add a HashGNN node property step to the pipeline
_ = pipe.addNodeProperty(
    "beta.hashgnn",
    mutateProperty="embedding",
    iterations=4,
    heterogeneous=True,
    embeddingDensity=512,
    neighborInfluence=0.7,
    featureProperties=["plot_keywords"],
    randomSeed=41,
    contextNodeLabels=G.node_labels(),
)

# Set the embeddings vectors produced by HashGNN as feature input to our ML algorithm
_ = pipe.selectFeatures("embedding")

6. Setting up autotuning

It is time to set up the ML algorithms for the training part of the pipeline.

In this example we will add logistic regression and random forest algorithms as candidates for the final model. Each candidate will be evaluated by the pipeline, and the best one, according to our specified metric, will be chosen.

It is hard to know how much regularization we need so as not to overfit our models on the training dataset, and for this reason we will use the autotuning capabilities of GDS to help us out. The autotuning algorithm will try out several values for the regularization parameters penalty (of logistic regression) and minSplitSize (of random forest) and choose the best ones it finds.

Please see the GDS manual to learn more about autotuning, logistic regression and random forest.

# Add logistic regression as a candidate ML algorithm for the training
# Provide an interval for the `penalty` parameter to enable autotuning for it
_ = pipe.addLogisticRegression(penalty=(0.1, 1.0), maxEpochs=1000, patience=5, tolerance=0.0001, learningRate=0.01)

# Add random forest as a candidate ML algorithm for the training
# Provide an interval for the `minSplitSize` parameter to enable autotuning for it
_ = pipe.addRandomForest(minSplitSize=(2, 100), criterion="ENTROPY")

7. Training the pipeline

The configuration is done, and we are now ready to kick off the training of our pipeline and see what results we get.

In our training call, we provide what node label and property we want the training to target, as well as the metric that will determine how the best model candidate is chosen.

# Call train on our pipeline object to run the entire training pipeline and produce a model
model, _ = pipe.train(
    G,
    modelName="genre-predictor-model",
    targetNodeLabels=["Movie"],
    targetProperty="genre",
    metrics=["F1_MACRO"],
    randomSeed=42,
)

Let’s inspect the model that was created by the training pipeline.

print(f"Accuracy scores of trained model:\n{model.metrics()['F1_MACRO']}")

print(f"Winning ML algorithm candidate config:\n{model.best_parameters()}")

As we can see the best ML algorithm configuration that the autotuning found was logistic regression with penalty=0.159748.

Further we note that the test set F1 score is 0.59118347, which is really good to when comparing to scores of other algorithms on this dataset in the literature. More on this in the Conclusion section below.

8. Making new predictions

We can now use the model produced by our training pipeline to predict genres of the UnclassifiedMovie nodes.

# Predict `genre` for `UnclassifiedMovie` nodes and stream the results
predictions = model.predict_stream(G, targetNodeLabels=["UnclassifiedMovie"], includePredictedProbabilities=True)

print(f"First predictions of unclassified movie nodes:\n{predictions.head()}")

In this case we streamed the prediction results back to our client application, but we could for example also have mutated our GDS graph represented by G by calling model.predict_mutate instead.

9. Cleaning up

Optionally we can now clean up our GDS state, to free up memory for other tasks.

# Drop the GDS graph represented by `G` from the GDS graph catalog
_ = G.drop()

# Drop the GDS training pipeline represented by `pipe` from the GDS pipeline catalog
_ = pipe.drop()

# Drop the GDS model represented by `model` from the GDS model catalog
_ = model.drop()

10. Conclusion

By using only the GDS library and its client, we were able to train a node classification model using the sophisticated HashGNN node embedding algorithm and logistic regression. Our logistic regression configuration was automatically chosen as the best candidate among a number of other algorithms (like random forest with various configurations) through a process of autotuning. We were able to achieve this with very little code, and with very good scores.

Though we used a convenience method of the graphdatascience library to load an IMDB dataset into GDS, it would be very easy to replace this part with something like a projection from a Neo4j database to create a more realistic production workflow.

11. Comparison with other methods

As mentioned we tried to mimic the setup of the benchmarks in the NeurIPS paper Graph Transformer Networks, in order to compare with the current state of the art methods. A difference from this paper is that they have a predefined train-test set split, whereas we just generate a split (with the same size) uniformly at random within our training pipeline. However, we have no reason to think that the predefined split in the paper was not also generated uniformly at random. Additionally, they use length 64 float embeddings (64 * 32 = 2048 bits), whereas we use length 1256 bit embeddings with HashGNN.

The scores they observe are the following:

Algorithm	Test set F1 score (%)
DeepWalk	32.08
metapath2vec	35.21
GCN	56.89
GAT	58.14
HAN	56.77
GTN	60.92

In light of this, it is indeed very impressive that we get a test set F1 score of 59.11 % with HashGNN and logistic regression. Especially considering that: - we use fewer bits to represent the embeddings (1256 vs 2048) - use dramatically fewer training parameters in our gradient descent compared to the deep learning models above - HashGNN is an unsupervised algorithm - HashGNN runs a lot faster (even without a GPU) and requires a lot less memory

12. Further learning

To learn more about the topics covered in this notebook, please check out the following pages of the GDS manual: