Machine learning pipelines

The Python client has special support for Link prediction pipelines and pipelines for node property prediction. The GDS pipelines are represented as pipeline objects in the GDS Python Client.

Operating pipelines through the client is based entirely on these pipeline objects. This is a more convenient and pythonic API compared to the Cypher procedure API. Once created, the pipeline objects can be passed as arguments to various methods in the Python client, such as the pipeline catalog operations. Additionally, the pipeline objects have convenience methods allowing for inspection of the represented pipeline without explicitly involving the pipeline catalog.

In the examples below we assume that we have an instantiated GraphDataScience object called gds. Read more about this in Getting started.

1. Node classification

This section outlines how to use the Python client to build, configure and train a node classification pipeline, as well as how to use the model that training produces for predictions.

1.1. Pipeline

To create a new node classification pipeline one would make the following call:

pipe = gds.nc_pipe("my-pipe")

where pipe is a pipeline object.

To then go on to build, configure and train the pipeline we would call methods directly on the node classification pipeline object. Below is a description of the methods on such objects:

Table 1. Node classification pipeline methods
Name	Arguments	Return type	Description
`addNodeProperty`	`procedure_name: str, config: **kwargs`	`Series`	Add an algorithm that produces a node property to the pipeline, with optional algorithm-specific configuration.
`selectFeatures`	`node_properties: Union[str, list[str]]`	`Series`	Select node properties to be used as features.
`configureSplit`	`config: **kwargs`	`Series`	Configure the train-test dataset split.
`addLogisticRegression`	`parameter_space: dict[str, any]`	`Series`	Add a logistic regression model configuration to train as a candidate in the model selection phase. ^[1]
`addRandomForest`	`parameter_space: dict[str, any]`	`Series`	Add a random forest model configuration to train as a candidate in the model selection phase. ^[1]
`addMLP`	`parameter_space: dict[str, any]`	`Series`	Add an MLP model configuration to train as a candidate in the model selection phase. ^[1]
`configureAutoTuning`	`config: **kwargs`	`Series`	Configure the auto-tuning.
`train`	`G: Graph, config: **kwargs`	`NCPredictionPipeline, Series`	Train the pipeline on the given input graph using given keyword arguments.
`train_estimate`	`G: Graph, config: **kwargs`	`Series`	Estimate training the pipeline on the given input graph using given keyword arguments.
`feature_properties`	`-`	`Series`	Returns a list of the selected feature properties for the pipeline.
`exists`	`-`	`bool`	`True` if the model exists in the GDS Pipeline Catalog, `False` otherwise.
`name`	`-`	`str`	The name of the pipeline as it appears in the pipeline catalog.
`type`	`-`	`str`	The type of pipeline.
`creation_time`	`-`	`neo4j.time.Datetime`	Time when the pipeline was created.
`node_property_steps`	`-`	`DataFrame`	Returns the node property steps of the pipeline.
`split_config`	`-`	`Series`	Returns the configuration set up for feature-train-test splitting of the dataset.
`parameter_space`	`-`	`Series`	Returns the model parameter space set up for model selection when training.
`auto_tuning_config`	`-`	`Series`	Returns the configuration set up for auto-tuning.
`drop`	`failIfMissing: Optional[bool]`	`Series`	Removes the pipeline from the GDS Pipeline Catalog.
1. Ranges can also be given as length two Tuple`s. I.e. `(x, y) is the same as `{range: [x, y]}`.

There are two main differences when comparing the methods above that map to procedures of the Cypher API:

As the Python methods are called on the pipeline object, one does not need to provide a name when calling them.
Configuration parameters in the Cypher calls are represented by named keyword arguments in the Python method calls.

Another difference is that the train Python call takes a graph object instead of a graph name, and returns a NCModel model object that we can run predictions with as well as a pandas Series with the metadata from the training.

Please consult the node classification Cypher documentation for information about what kind of input the methods expect.

1.1.1. Example

Below is a small example of how one could configure and train a very basic node classification pipeline. Note that we don’t configure splits explicitly, but rather use the default.

To exemplify this, we introduce a small person graph:

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob", fraudster: 0}),
    (b:Person {name: "Alice", fraudster: 0}),
    (c:Person {name: "Eve", fraudster: 1}),
    (d:Person {name: "Chad", fraudster: 1}),
    (e:Person {name: "Dan", fraudster: 0}),
    (f:UnknownPerson {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["fraudster"]}}, "KNOWS")

assert G.node_labels() == ["Person"]

pipe, _ = gds.beta.pipeline.nodeClassification.create("my-pipe")

# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")

# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")

# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"

# Configure the model training to do cross-validation over logistic regression
pipe.addLogisticRegression(tolerance=(0.01, 0.1))
pipe.addLogisticRegression(penalty=1.0)

# Train the pipeline targeting node property "class" as label and "ACCURACY" as only metric
fraud_model, train_result = pipe.train(
    G,
    modelName="fraud-model",
    targetProperty="fraudster",
    metrics=["ACCURACY"],
    randomSeed=111
)
assert train_result["trainMillis"] >= 0

A model referred to as "fraud-model" in the GDS Model Catalog is produced. In the next section we will go over how to use that model to make predictions.

1.2. Model

As we saw in the previous section, node classification models are created when training a node classification pipeline. In addition to inheriting the methods common to all model objects, node classification models have the following methods:

Table 2. Node classification model methods
Name	Arguments	Return type	Description
`classes`	`-`	`List[int]`	List of classes used to train the classification model.
`feature_properties`	`-`	`List[str]`	Node properties used as input model features.
`node_property_steps`	`-`	`List[NodePropertyStep]`	List of algorithms used by the pipeline to compute node properties before training.
`metrics`	`-`	`Series`	Values for the metrics specified when training.
`best_parameters`	`-`	`Series`	Training method parameters which performed best across the validation sets.
`predict_mutate`	`G: Graph, config: **kwargs`	`Series`	Predict classes for nodes of the input graph and mutate graph with predictions.
`predict_mutate_estimate`	`G: Graph, config: **kwargs`	`Series`	Estimate predicting classes for nodes of the input graph and mutating graph with predictions.
`predict_stream`	`G: Graph, config: **kwargs`	`DataFrame`	Predict classes for nodes of the input graph and stream the results.
`predict_stream_estimate`	`G: Graph, config: **kwargs`	`Series`	Estimate predicting classes for nodes of the input graph and streaming the results.
`predict_write`	`G: Graph, config: **kwargs`	`Series`	Predict classes for nodes of the input graph and write results back to the database.
`predict_write_estimate`	`G: Graph, config: **kwargs`	`Series`	Estimate predicting classes for nodes of the input graph and writing the results back to the database.

One can note that the predict methods are indeed very similar to their Cypher counterparts. The three main differences are that:

They take a graph object instead of a graph name.
They have Python keyword arguments representing the keys of the configuration map.
One does not have to provide a "modelName" since the model object used itself have this information.

1.2.1. Example (continued)

We now continue the example above using the node classification model trained_pipe_model we trained there.

# Make sure we indeed obtained an accuracy score
metrics = fraud_model.metrics()
assert "ACCURACY" in metrics

H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")

# Predict on `H` and stream the results with a specific concurrency of 2
predictions = fraud_model.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()

2. Link prediction

This section outlines how to use the Python client to build, configure and train a link prediction pipeline, as well as how to use the model that training produces for predictions.

2.1. Pipeline

To create a new link prediction pipeline one would make the following call:

pipe = gds.lp_pipe("my-pipe")

where pipe is a pipeline object.

To then go on to build, configure and train the pipeline we would call methods directly on the link prediction pipeline object. Below is a description of the methods on such objects:

Table 3. Link prediction pipeline methods
Name	Arguments	Return type	Description
`addNodeProperty`	`procedure_name: str, config: **kwargs`	`Series`	Add an algorithm that produces a node property to the pipeline, with optional algorithm-specific configuration.
`addFeature`	`feature_type: str, config: **kwargs`	`Series`	Add a link feature for model training based on node properties and a feature combiner.
`configureSplit`	`config: **kwargs`	`Series`	Configure the feature-train-test dataset split.
`addLogisticRegression`	`parameter_space: dict[str, any]`	`Series`	Add a logistic regression model configuration to train as a candidate in the model selection phase. ^[2]
`addRandomForest`	`parameter_space: dict[str, any]`	`Series`	Add a random forest model configuration to train as a candidate in the model selection phase. ^[2]
`addMLP`	`parameter_space: dict[str, any]`	`Series`	Add an MLP model configuration to train as a candidate in the model selection phase. ^[2]
`configureAutoTuning`	`config: **kwargs`	`Series`	Configure the auto-tuning.
`train`	`G: Graph, config: **kwargs`	`LPPredictionPipeline, Series`	Train the model on the given input graph using given keyword arguments.
`train_estimate`	`G: Graph, config: **kwargs`	`Series`	Estimate training the pipeline on the given input graph using given keyword arguments.
`feature_steps`	`-`	`DataFrame`	Returns a list of the selected feature steps for the pipeline.
`exists`	`-`	`bool`	`True` if the model exists in the GDS Pipeline Catalog, `False` otherwise.
`name`	`-`	`str`	The name of the pipeline as it appears in the pipeline catalog.
`type`	`-`	`str`	The type of pipeline.
`creation_time`	`-`	`neo4j.time.Datetime`	Time when the pipeline was created.
`node_property_steps`	`-`	`DataFrame`	Returns the node property steps of the pipeline.
`split_config`	`-`	`Series`	Returns the configuration set up for feature-train-test splitting of the dataset.
`parameter_space`	`-`	`Series`	Returns the model parameter space set up for model selection when training.
`auto_tuning_config`	`-`	`Series`	Returns the configuration set up for auto-tuning.
`drop`	`failIfMissing: Optional[bool]`	`Series`	Removes the pipeline from the GDS Pipeline Catalog.
2. Ranges can also be given as length two Tuple`s. I.e. `(x, y) is the same as `{range: [x, y]}`.

There are two main differences when comparing the methods above that map to procedures of the Cypher API:

As the Python methods are called on the pipeline object, one does not need to provide a name when calling them.
Configuration parameters in the Cypher calls are represented by named keyword arguments in the Python method calls.

Another difference is that the train Python call takes a graph object instead of a graph name, and returns a LPModel model object that we can run predictions with as well as a pandas Series with the metadata from the training.

Please consult the link prediction Cypher documentation for information about what kind of input the methods expect.

2.1.1. Example

Below is a small example of how one could configure and train a very basic link prediction pipeline. Note that we don’t configure training parameters explicitly, but rather use the default.

To exemplify this, we introduce a small person graph:

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob"}),
    (b:Person {name: "Alice"}),
    (c:Person {name: "Eve"}),
    (d:Person {name: "Chad"}),
    (e:Person {name: "Dan"}),
    (f:Person {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", "Person", {"KNOWS": {"orientation":"UNDIRECTED"}})

assert G.relationship_types() == ["KNOWS"]

pipe, _ = gds.beta.pipeline.linkPrediction.create("lp-pipe")

# Add FastRP as a property step producing "embedding" node properties
pipe.addNodeProperty("fastRP", embeddingDimension=128, mutateProperty="embedding", randomSeed=1337)

# Combine our "embedding" node properties with Hadamard to create link features for training
pipe.addFeature("hadamard", nodeProperties=["embedding"])

# Verify that the features to be used in model training are what we expect
steps = pipe.feature_steps()
assert len(steps) == 1
assert steps["name"][0] == "HADAMARD"

# Specify the fractions we want for our dataset split
pipe.configureSplit(trainFraction=0.2, testFraction=0.2, validationFolds=2)

# Add a random forest model with tuning over `maxDepth`
pipe.addRandomForest(maxDepth=(2, 20))

# Train the pipeline and produce a model named "friend-recommender"
friend_recommender, train_result = pipe.train(
    G,
    modelName="friend-recommender",
    targetRelationshipType="KNOWS",
    randomSeed=42
)
assert train_result["trainMillis"] >= 0

A model referred to as "my-model" in the GDS Model Catalog is produced. In the next section we will go over how to use that model to make predictions.

2.2. Model

As we saw in the previous section, link prediction models are created when training a link prediction pipeline. In addition to inheriting the methods common to all model objects, link prediction models have the following methods:

Table 4. Link prediction model methods
Name	Arguments	Return type	Description
`link_features`	`-`	`List[LinkFeature]`	The input link features used to train the model.
`node_property_steps`	`-`	`List[NodePropertyStep]`	List of algorithms used by the pipeline to compute node properties before training.
`metrics`	`-`	`Series`	Values for the metrics specified when training.
`best_parameters`	`-`	`Series`	Training method parameters which performed best across the validation sets.
`predict_mutate`	`G: Graph, config: **kwargs`	`Series`	Predict links between non-neighboring nodes of the input graph and mutate graph with predictions.
`predict_mutate_estimate`	`G: Graph, config: **kwargs`	`Series`	Estimate predicting links between non-neighboring nodes of the input graph and mutating graph with predictions.
`predict_stream`	`G: Graph, config: **kwargs`	`DataFrame`	Predict links between non-neighboring nodes of the input graph and stream the results.
`predict_stream_estimate`	`G: Graph, config: **kwargs`	`Series`	Estimate predicting links between non-neighboring nodes of the input graph and streaming the results.

One can note that the predict methods are indeed very similar to their Cypher counterparts. The three main differences are that:

They take a graph object instead of a graph name.
They have Python keyword arguments representing the keys of the configuration map.
One does not have to provide a "modelName" since the model object used itself have this information.

2.2.1. Example (continued)

We now continue the example above using the link prediction model trained_pipe_model we trained there.

# Make sure we indeed obtained an AUCPR score
metrics = friend_recommender.metrics()
assert "AUCPR" in metrics

# Predict on `G` and mutate it with the relationship predictions
mutate_result = friend_recommender.predict_mutate(G, topN=5, mutateRelationshipType="PRED_REL")
assert mutate_result["relationshipsWritten"] == 5 * 2  # Undirected relationships

3. Node regression

This section outlines how to use the Python client to build, configure and train a node regression pipeline, as well as how to use the model that training produces for predictions.

3.1. Pipeline

To create a new node regression pipeline one would make the following call:

pipe = gds.nr_pipe("my-pipe")

where pipe is a pipeline object.

To then go on to build, configure and train the pipeline we would call methods directly on the node regression pipeline object. Below is a description of the methods on such objects:

Table 5. Node regression pipeline methods
Name	Arguments	Return type	Description
`addNodeProperty`	`procedure_name: str, config: **kwargs`	`Series`	Add an algorithm that produces a node property to the pipeline, with optional algorithm-specific configuration.
`selectFeatures`	`node_properties: Union[str, list[str]]`	`Series`	Select node properties to be used as features.
`configureSplit`	`config: **kwargs`	`Series`	Configure the train-test dataset split.
`addLinearRegression`	`parameter_space: dict[str, any]`	`Series`	Add a linear regression model configuration to train as a candidate in the model selection phase. ^[3]
`addRandomForest`	`parameter_space: dict[str, any]`	`Series`	Add a random forest model configuration to train as a candidate in the model selection phase. ^[3]
`configureAutoTuning`	`config: **kwargs`	`Series`	Configure the auto-tuning.
`train`	`G: Graph, config: **kwargs`	`NCPredictionPipeline, Series`	Train the pipeline on the given input graph using given keyword arguments.
`feature_properties`	`-`	`Series`	Returns a list of the selected feature properties for the pipeline.
`exists`	`-`	`bool`	`True` if the model exists in the GDS Pipeline Catalog, `False` otherwise.
`name`	`-`	`str`	The name of the pipeline as it appears in the pipeline catalog.
`type`	`-`	`str`	The type of pipeline.
`creation_time`	`-`	`neo4j.time.Datetime`	Time when the pipeline was created.
`node_property_steps`	`-`	`DataFrame`	Returns the node property steps of the pipeline.
`split_config`	`-`	`Series`	Returns the configuration set up for feature-train-test splitting of the dataset.
`parameter_space`	`-`	`Series`	Returns the model parameter space set up for model selection when training.
`auto_tuning_config`	`-`	`Series`	Returns the configuration set up for auto-tuning.
`drop`	`failIfMissing: Optional[bool]`	`Series`	Removes the pipeline from the GDS Pipeline Catalog.
3. Ranges can also be given as length two Tuple`s. I.e. `(x, y) is the same as `{range: [x, y]}`.

There are two main differences when comparing the methods above that map to procedures of the Cypher API:

As the Python methods are called on the pipeline object, one does not need to provide a name when calling them.
Configuration parameters in the Cypher calls are represented by named keyword arguments in the Python method calls.

Another difference is that the train Python call takes a graph object instead of a graph name, and returns a NRModel model object that we can run predictions with as well as a pandas Series with the metadata from the training.

Please consult the node regression Cypher documentation for information about what kind of input the methods expect.

3.1.1. Example

Below is a small example of how one could configure and train a very basic node regression pipeline. Note that we don’t configure splits explicitly, but rather use the default.

To exemplify this, we introduce a small person graph:

gds.run_cypher(
  """
  CREATE
    (a:Person {name: "Bob", age: 22}),
    (b:Person {name: "Alice", age: 5}),
    (c:Person {name: "Eve", age: 53}),
    (d:Person {name: "Chad", age: 44}),
    (e:Person {name: "Dan", age: 60}),
    (f:UnknownPerson {name: "Judy"}),

    (a)-[:KNOWS]->(b),
    (a)-[:KNOWS]->(c),
    (a)-[:KNOWS]->(d),
    (b)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(d),
    (c)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(e),
    (d)-[:KNOWS]->(f),
    (e)-[:KNOWS]->(f)
  """
)
G, project_result = gds.graph.project("person_graph", {"Person": {"properties": ["age"]}}, "KNOWS")

assert G.relationship_types() == ["KNOWS"]

pipe, _ = gds.alpha.pipeline.nodeRegression.create("nr-pipe")

# Add Degree centrality as a property step producing "rank" node properties
pipe.addNodeProperty("degree", mutateProperty="rank")

# Select our "rank" property as a feature for the model training
pipe.selectFeatures("rank")

# Verify that the features to be used in model training are what we expect
feature_properties = pipe.feature_properties()
assert len(feature_properties) == 1
assert feature_properties[0]["feature"] == "rank"

# Configure the model training to do cross-validation over linear regression
pipe.addLinearRegression(tolerance=(0.01, 0.1))
pipe.addLinearRegression(penalty=1.0)

# Train the pipeline targeting node property "age" as label and "MEAN_SQUARED_ERROR" as only metric
age_predictor, train_result = pipe.train(
    G,
    modelName="age-predictor",
    targetProperty="age",
    metrics=["MEAN_SQUARED_ERROR"],
    randomSeed=42
)
assert train_result["trainMillis"] >= 0

A model referred to as "my-model" in the GDS Model Catalog is produced. In the next section we will go over how to use that model to make predictions.

3.2. Model

As we saw in the previous section, node regression models are created when training a node regression pipeline. In addition to inheriting the methods common to all model objects, node regression models have the following methods:

Table 6. Node regression model methods
Name	Arguments	Return type	Description
`feature_properties`	`-`	`List[str]`	Returns the node properties that were used as input model features.
`node_property_steps`	`-`	`List[NodePropertyStep]`	List of algorithms used by the pipeline to compute node properties before training.
`metrics`	`-`	`Series`	Values for the metrics specified when training.
`best_parameters`	`-`	`Series`	Training method parameters which performed best across the validation sets.
`predict_mutate`	`G: Graph, config: **kwargs`	`Series`	Predict property values for nodes of the input graph and mutate graph with predictions.
`predict_stream`	`G: Graph, config: **kwargs`	`DataFrame`	Predict property values for nodes of the input graph and stream the results.

One can note that the predict methods are indeed very similar to their Cypher counterparts. The three main differences are that:

They take a graph object instead of a graph name.
They have Python keyword arguments representing the keys of the configuration map.
One does not have to provide a "modelName" since the model object used itself have this information.

3.2.1. Example (continued)

We now continue the example above using the node regression model age_predictor we trained there. Suppose that we have a new graph H that we want to run predictions on.

# Make sure we indeed obtained an MEAN_SQUARED_ERROR score
metrics = age_predictor.metrics()
assert "MEAN_SQUARED_ERROR" in metrics

H, project_result = gds.graph.project("full_person_graph", ["Person", "UnknownPerson"], "KNOWS")

# Predict on `H` and stream the results with a specific concurrency of 2
predictions = age_predictor.predict_stream(H, concurrency=2)
assert len(predictions) == H.node_count()

4. The pipeline catalog

The primary way to use pipeline objects is for training models. Additionally, pipeline objects can be used as input to GDS Pipeline Catalog operations. For instance, supposing we have a pipeline object pipe, we could:

exists_result = gds.beta.pipeline.exists(pipe.name())

if exists_result["exists"]:
	gds.beta.pipeline.drop(pipe)  # same as pipe.drop()

A pipeline object that has already been created and is present in the pipeline catalog can be retrieved calling the get method with its name. For example, we can list from the catalog and use the first pipeline name we find to get a pipeline object representing that pipeline, which will be the NodeClassification pipeline we created in the example above.

list_result = gds.beta.pipeline.list()
first_pipeline_name = list_result["pipelineName"][0]
pipe = gds.pipeline.get(first_pipeline_name)
assert pipe.name() == "my-pipe"