Graph construct: Import from Pandas

Open In Colab

This Jupyter notebook is hosted here in the Neo4j Graph Data Science Client Github repository.

The notebook shows the usage of the gds.graph.construct method (available only in GDS 2.1+) to build a graph directly in memory.

If you are using AuraDS, it is currently not possible to write the projected graph back to Neo4j.

1. Setup

We need an environment where Neo4j and GDS are available, for example AuraDS (which comes with GDS preinstalled) or Neo4j Desktop.

Once the credentials to this environment are available, we can install the graphdatascience package and import the client class.

%pip install graphdatascience
import os
from graphdatascience import GraphDataScience

When using a local Neo4j setup, the default connection URI is bolt://localhost:7687:

# Get Neo4j DB URI, credentials and name from environment if applicable
NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_AUTH = None
NEO4J_DB = os.environ.get("NEO4J_DB", "neo4j")
if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
    NEO4J_AUTH = (
        os.environ.get("NEO4J_USER"),
        os.environ.get("NEO4J_PASSWORD"),
    )
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB)

When using AuraDS, the connection URI is slightly different as it uses the neo4j+s protocol. The client should also include the aura_ds=True flag to enable AuraDS-recommended settings.

# On AuraDS:
#
# gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH, database=NEO4J_DB, aura_ds=True)
from graphdatascience.server_version.server_version import ServerVersion

assert gds.server_version() >= ServerVersion(2, 1, 0)

We also import pandas to create a Pandas DataFrame from the original data source.

import pandas as pd

2. Load the Cora dataset

CORA_CONTENT = "https://data.neo4j.com/cora/cora.content"
CORA_CITES = "https://data.neo4j.com/cora/cora.cites"

We can load each CSV locally as a Pandas DataFrame.

content = pd.read_csv(CORA_CONTENT, header=None)
cites = pd.read_csv(CORA_CITES, header=None)

We need to perform an additional preprocessing step to convert the subject field (which is a string in the dataset) into an integer, because node properties have to be numerical in order to be projected into a graph. We can use a map for this.

SUBJECT_TO_ID = {
    "Neural_Networks": 0,
    "Rule_Learning": 1,
    "Reinforcement_Learning": 2,
    "Probabilistic_Methods": 3,
    "Theory": 4,
    "Genetic_Algorithms": 5,
    "Case_Based": 6,
}

We can now create a new DataFrame with a nodeId field, a list of node labels, and the additional node properties subject (using the SUBJECT_TO_ID mapping) and features (converting all the feature columns to a single array column).

nodes = pd.DataFrame().assign(
    nodeId=content[0],
    labels="Paper",
    subject=content[1].replace(SUBJECT_TO_ID),
    features=content.iloc[:, 2:].apply(list, axis=1),
)

Let’s check the first 5 rows of the new DataFrame:

nodes.head()

Now we create a new DataFrame containing the relationships between the nodes. To create the equivalent of an undirected graph, we need to add direct and inverse relationships explicitly.

dir_relationships = pd.DataFrame().assign(sourceNodeId=cites[0], targetNodeId=cites[1], relationshipType="CITES")
inv_relationships = pd.DataFrame().assign(sourceNodeId=cites[1], targetNodeId=cites[0], relationshipType="CITES")

relationships = pd.concat([dir_relationships, inv_relationships]).drop_duplicates()

Again, let’s check the first 5 rows of the new DataFrame:

relationships.head()

Finally, we can create the in-memory graph.

G = gds.graph.construct("cora-graph", nodes, relationships)

3. Use the graph

Let’s check that the new graph has been created:

gds.graph.list()

Let’s also count the nodes in the graph:

G.node_count()

The count matches with the number of rows in the Pandas dataset:

len(content)

We can stream the value of the subject node property for each node in the graph, printing only the first 10.

gds.graph.nodeProperties.stream(G, ["subject"]).head(10)

4. Cleanup

When the graph is no longer needed, it should be dropped to free up memory:

G.drop()