The graph object
In order to utilize most the functionality in GDS, you must first project a graph into the GDS Graph Catalog.
When projecting a graph with the Python client, a client-side reference to the projected graph is returned.
We call these references Graph
objects.
Once created, the Graph
objects can be passed as arguments to other methods in the Python client, for example for running algorithms or training machine learning models.
Additionally, the Graph
objects have convenience methods allowing for inspection of the projected graph represented without explicitly involving the graph catalog.
In the examples below we assume that we have an instantiated GraphDataScience
object called gds
.
Read more about this in Getting started.
1. Projecting a graph object
There are several ways of projecting a graph object. The simplest way is to do a native projection:
# We put this simple graph in our database
gds.run_cypher(
"""
CREATE
(m: City {name: "Malmö"}),
(l: City {name: "London"}),
(s: City {name: "San Mateo"}),
(m)-[:FLY_TO]->(l),
(l)-[:FLY_TO]->(m),
(l)-[:FLY_TO]->(s),
(s)-[:FLY_TO]->(l)
"""
)
# We estimate required memory of the operation
res = gds.graph.project.estimate(
["City"], # Node projection
"FLY_TO", # Relationship projection
readConcurrency=4 # Configuration parameters
)
assert res["bytesMax"] < 1e12
G, result = gds.graph.project(
"offices", # Graph name
["City"], # Node projection
"FLY_TO", # Relationship projection
readConcurrency=4 # Configuration parameters
)
assert G.node_count() == result["nodeCount"]
where G
is a Graph
object, and result
is a pandas Series
containing metadata from the underlying procedure call.
Note that all projection syntax variants are supported by way of specifying a Python dict
or list
for the node and relationship projection arguments.
To specify configuration parameters corresponding to the keys of the procedure’s configuration
map, we give named keyword arguments, like for readConcurrency=4
above.
Read more about the syntax in the GDS manual.
Similarly to Cypher there’s also a corresponding gds.graph.project.estimate
method that can be called in an analogous way.
To get a graph object that represents a graph that has already been projected into the graph catalog, one can call the client-side only get
method and passing it a name:
G = gds.graph.get("offices")
For users who are GDS admins, gds.graph.get
will resolve graph names into Graph
objects also when the provided name refers to another user’s graph projection.
In addition to those aforementioned there are five more methods that create graph objects:
-
gds.graph.project.cypher
(This is the Legacy Cypher projection, see Projecting a graph using Cypher Projection for the new Cypher projection) -
gds.beta.graph.subgraph
-
gds.beta.graph.generate
-
gds.graph.sample.rwr
-
gds.graph.sample.cnarw
Their Cypher signatures map to Python in much the same way as gds.graph.project
above.
2. Projecting a graph using Cypher Projection
The method gds.graph.cypher.project
allows for projecting a graph using Cypher projection.
Cypher projection is not a dedicated procedure; rather it requires writing a Cypher query that calls the gds.graph.project
aggregation function.
Read more about Cypher projection in the GDS manual.
The method gds.graph.cypher.project
bridges the gap between gds.run_cypher
and having to follow with gds.graph.get
.
2.1. Syntax
Name | Type | Default | Description |
---|---|---|---|
|
|
|
The Cypher query to be executed. Must end with |
|
|
|
Overrides the target database. The default uses the database from the connection. |
|
|
|
The query parameters as keyword args. |
Unlike gds.run_cypher
but very much like gds.graph.project
, it returns a tuple of a Graph
object and a pandas Series
containing metadata from Cypher execution.
The method does not modify the Cypher query in any way, all projection configuration must be done in the query itself.
The method however verifies that the query only contains a single RETURN gds.graph.project(…)
clause and that this clause appears at the end of the query.
This means that this method cannot be used to project a graph together with other aggregation in Cypher, nor can the result row be renamed.
If the provided query fails the validation, the fallback is to use gds.run_cypher
together with gds.graph.get
to achieve the same result.
# We put this simple graph in our database
gds.run_cypher(
"""
CREATE
(m: City {name: "Malmö"}),
(l: City {name: "London"}),
(s: City {name: "San Mateo"}),
(m)-[:FLY_TO]->(l),
(l)-[:FLY_TO]->(m),
(l)-[:FLY_TO]->(s),
(s)-[:FLY_TO]->(l)
"""
)
G, result = gds.graph.cypher.project(
"""
MATCH (n)-->(m)
RETURN gds.graph.project($graph_name, n, m, {
sourceNodeLabels: $label,
targetNodeLabels: $label,
relationshipType: $rel_type
})
""", # Cypher query
database="neo4j", # Target database
graph_name="offices", # Query parameter
label="City", # Query parameter
rel_type="FLY_TO" # Query parameter
)
assert G.node_count() == result["nodeCount"]
3. Constructing a graph from DataFrames
In addition to projecting a graph from the Neo4j database, it is also possible to create graphs directly from pandas DataFrame
objects.
3.1. Syntax
Name | Type | Default | Description |
---|---|---|---|
|
|
|
Name of the graph to be constructed. |
|
|
|
One or more dataframes containing node data. |
|
|
|
One or more dataframes containing relationship data. |
|
|
|
Number of threads used to construct the graph. |
|
|
|
List of relationship types to be projected as undirected. |
3.2. Example
nodes = pandas.DataFrame(
{
"nodeId": [0, 1, 2, 3],
"labels": ["A", "B", "C", "A"],
"prop1": [42, 1337, 8, 0],
"otherProperty": [0.1, 0.2, 0.3, 0.4]
}
)
relationships = pandas.DataFrame(
{
"sourceNodeId": [0, 1, 2, 3],
"targetNodeId": [1, 2, 3, 0],
"relationshipType": ["REL", "REL", "REL", "REL"],
"weight": [0.0, 0.0, 0.1, 42.0]
}
)
G = gds.graph.construct(
"my-graph", # Graph name
nodes, # One or more dataframes containing node data
relationships # One or more dataframes containing relationship data
)
assert "REL" in G.relationship_types()
The above example creates a graph from two DataFrame
objects, one for nodes and one for relationships.
The projected graph is equivalent to a graph that the following Cypher query would create in a Neo4j database:
CREATE
(a:A {prop1: 42, otherProperty: 0.1),
(b:B {prop1: 1337, otherProperty: 0.2),
(c:C {prop1: 8, otherProperty: 0.3),
(d:A {prop1: 0, otherProperty: 0.4),
(a)-[:REL {weight: 0.0}]->(b),
(b)-[:REL {weight: 0.0}]->(c),
(c)-[:REL {weight: 0.1}]->(d),
(d)-[:REL {weight: 42.0}]->(a),
The supported format for the node data frames is described in Arrow node schema and the format for the relationship data frames is described in Arrow relationship schema.
3.3. Apache Arrow flight server support
The construct
method can utilize the Apache Arrow Flight Server of GDS if it’s enabled.
This in particular means that:
-
The construction of the graph is greatly sped up,
-
It is possible to supply more than one data frame, both for nodes and relationships. If multiple node dataframes are used, they need to contain distinct node ids across all node data frames.
-
Prior to the
construct
call, a call toGraphDataScience.set_database
must have been made to explicitly specify which Neo4j database should be targeted.
3.4. Limitations on community edition
For users of GDS community edition, performance can be impacted for large graphs.
It is possible that socket connection with the database times out.
If this happens, a possible workaround is to modify the server configuration server.bolt.connection_keep_alive
or server.bolt.connection_keep_alive_probes
.
However, be aware of the side effects such as a genuine connection issue now taking longer to be detected.
4. Loading a NetworkX graph
Another way to construct a graph from client-side data is by using the library’s convenience NetworkX loading method.
In order to use this method, one has to install NetworkX support for the graphdatascience
library:
pip install graphdatascience[networkx]
The method that exposes the NetworkX dataset loading functionality is called gds.graph.networkx.load
.
It returns Graph
object, and takes three arguments:
Name | Type | |
---|---|---|
|
|
A graph in the NetworX format |
|
|
The name of the created GDS graph |
|
|
An optional number of threads to use |
Exactly how the networkx.Graph
nx_G
maps to a GDS Graph
projection is outlined in detail below.
4.1. Example
Let’s look at an example of loading a minimal heterogeneous toy NetworkX graph.
import networkx as nx
# Construct a directed NetworkX graph
nx_G = nx.DiGraph()
nx_G.add_node(1, labels=["Person"], age=52)
nx_G.add_node(42, labels=["Product", "Item"], cost=17.2)
nx_G.add_edge(1, 42, relationshipType="BUYS", quantity=4)
# Load the graph into GDS
G = gds.graph.networkx.load(nx_G, "purchases")
# Verify that the projection is what we expect
assert G.name() == "purchases"
assert G.node_count() == 2
assert set(G.node_labels()) == {"Person", "Product", "Item"}
assert G.node_properties("Person") == ["age"]
assert G.node_properties("Product") == ["cost"]
# Count rel not being = 2 indicates the graph is indeed directed
assert G.relationship_count() == 1
assert G.relationship_types() == ["BUYS"]
assert G.relationship_properties("BUYS") == ["quantity"]
Combined with NetworkX’s functionality for reading various graph formats one can easily load popular graph formats like edge list and GML into GDS. |
Combined with NetworkX’s functionality for generating various kinds of graphs one can easily load graph types popular in the literature into GDS, such as expanders, lollipop graphs, complete graphs, and more. |
4.2. NetworkX schema to GDS schema
There are some rules as to how the NetworkX graph maps to the projected Graph
in GDS.
They follow principles similar to those of Constructing a graph from DataFrames, and are outlined in detail in this section.
4.2.1. Node labels
Node labels for the projected GDS graph are taken from attributes on networkx.Graph
nodes.
The values of node attribute key labels
will dictate what labels nodes are given in the projection.
These values should be either strings or lists of strings.
Either all nodes of the networkx.Graph
must have valid labels
attributes, or they can be completely left out from the graph.
That is, a networx.Graph
with no labels
node attributes at all is also allowed.
In this latter case, the nodes in the projected Graph
will all have the node label N
.
4.2.2. Node properties
Node properties for the projected GDS graph are taken from attributes on networkx.Graph
nodes.
The keys of the attributes will map to property names, and the allowed values must follow the regular guidelines for node property values in GDS.
Please note though that the node attribute key labels
is reserved for node labels (previous section) and will not translate to node properties in the projection.
4.2.3. Relationship types
Relationship types for the projected GDS graph are taken from attributes on networkx.Graph
edges.
The values of edge attribute key relationshipType
will dictate what types relationships are given in the projection.
These values should be either strings or omitted.
Either all edges of the networkx.Graph
must have valid relationshipType
attributes, or they can be completely left out from the graph.
That is, a networx.Graph
with no relationshipType
edge attributes at all is also allowed.
In this latter case, the relationships in the projected Graph
will all have the relationship type R
.
4.2.4. Relationship properties
Relationship properties for the projected GDS graph are taken from attributes on networkx.Graph
edges.
The keys of the attributes will map to property names, and the allowed values must follow the regular guidelines for relationship property values in GDS.
Please note though that the edge attribute key relationshipType
is reserved for relationship types (previous section) and will not translate to relationship properties in the projection.
4.2.5. Relationship direction
The direction (DIRECTED
or UNDIRECTED
) of the relationships in the projected GDS graph are inferred from the type of networkx.Graph
used.
If the given NetworkX graph is directed, so a (sub)class of either networkx.DiGraph
or networkx.MultiDiGraph
, the relationships will be DIRECTED
in the projection.
Otherwise, they will be UNDIRECTED
.
4.3. Limitations on community edition
For users of GDS community edition, performance can be impacted for large graphs.
It is possible that socket connection with the database times out.
If this happens, a possible workaround is to modify the server configuration server.bolt.connection_keep_alive
or server.bolt.connection_keep_alive_probes
.
However, be aware of the side effects such as a genuine connection issue now taking longer to be detected.
5. Inspecting a graph object
There are convenience methods on the graph object that let us extract information about our projected graph.
Name | Arguments | Return type | Description |
---|---|---|---|
|
|
|
The name of the projected graph. |
|
|
|
Name of the database in which the graph has been projected. |
|
|
|
The node count of the projected graph. |
|
|
|
The relationship count of the projected graph. |
|
|
|
A list of the node labels present in the graph. |
|
|
|
A list of the relationship types present in the graph. |
|
|
|
If label argument given, returns a list of the properties present on the nodes with the provided node label. Otherwise, returns a |
|
|
|
If type argument given, returns a list of the properties present on the relationships with the provided relationship type. Otherwise, returns a |
|
|
|
The average out-degree of generated nodes. |
|
|
|
Density of the graph. |
|
|
|
Number of bytes used in the Java heap to store the graph. |
|
|
|
Human-readable description of |
|
|
|
Returns |
|
|
|
Removes the graph from the GDS Graph Catalog. |
|
|
|
The configuration used to project the graph in memory. |
|
|
|
Time when the graph was projected. |
|
|
|
Time when the graph was last modified. |
For example, to get the node count and node properties of a graph G
, we would do the following:
n = G.node_count()
props = G.node_properties("City")
6. Context management
The graph object also implement the context managment protocol, i.e., is usable inside with
clauses.
On exiting the with
block, the graph projection will be automatically dropped on the server side.
# We use the example graph from the `Projecting a graph object` section
with gds.graph.project(
"tmp_offices", # Graph name
["City"], # Node projection
"FLY_TO", # Relationship projection
readConcurrency=4 # Configuration parameters
)[0] as G_tmp:
assert G_tmp.exists()
# Outside of the with block the Graph does not exist
assert not gds.graph.exists("tmp_offices")["exists"]
7. Using a graph object
The primary use case for a graph object is to pass it to algorithms, but it’s also the input to most methods of the GDS Graph Catalog.
7.1. Input to algorithms
The Python client syntax for using a Graph
as input to an algorithm follows the GDS Cypher procedure API, where the graph is the first parameter passed to the algorithm.
result = gds[.<tier>].<algorithm>.<execution-mode>[.<estimate>](
G: Graph,
**configuration: dict[str, any]
)
In this example we run the degree centrality algorithm on a graph G
:
result = gds.degree.mutate(G, mutateProperty="degree")
assert "centralityDistribution" in result
7.2. The graph catalog
All procedures of the GDS Graph Catalog have corresponding Python methods in the client.
Of those catalog procedures that take a graph name string as input, their Python client equivalents instead take a Graph
object, with the exception of gds.graph.exists
which still takes a graph name string.
Below are some examples of how the GDS Graph Catalog can be used via the client, assuming we inspect the graph G
from the example above:
# List graphs in the catalog
list_result = gds.graph.list()
# Check for existence of a graph in the catalog
exists_result = gds.graph.exists("offices")
assert exists_result["exists"]
# Stream the node property 'degree'
result = gds.graph.nodeProperty.stream(G, node_property="degree")
# Drop a graph; same as G.drop()
gds.graph.drop(G)
7.2.1. Streaming properties
The client methods
-
gds.graph.nodeProperty.stream
(previouslygds.graph.streamNodeProperty
) -
gds.graph.nodeProperties.stream
(previouslygds.graph.streamNodeProperties
) -
gds.graph.relationshipProperty.stream
(previouslygds.graph.streamRelationshipProperty
) -
gds.graph.relationshipProperties.stream
(previouslygds.graph.streamRelationshipProperties
)
are greatly sped up if Apache Arrow Flight Server of GDS is enabled.
Additionally, setting the client only optional keyword parameter separate_property_columns=True
(it defaults to False
) for gds.graph.streamNodeProperties
and gds.graph.streamRelationshipProperties
returns a pandas DataFrame
in which each property requested has its own column.
Note that this is different from the default behavior for which there would only be one column called propertyValue
that contains all properties requested interleaved for each node or relationship.
7.2.2. Including node properties from Neo4j
Node properties such as names and descriptions are useful to understand the output of an algorithm, even if not needed to run the algorithm itself.
To fetch additional node properties directly from the Neo4j database, you can use the db_node_properties
client-only parameter of the gds.graph.nodeProperty.stream
and gds.graph.nodeProperties.stream
methods.
In the following example, the City
nodes have both a numeric and a String
property.
The stream
method retrieves the values of the database-only name
property alongside the values of the projected population
property.
gds.run_cypher(
"""
CREATE
(m: City {name: "Malmö", population: 360000}),
(l: City {name: "London", population: 8800000}),
(s: City {name: "San Mateo", population: 105000}),
(m)-[:FLY_TO]->(l),
(l)-[:FLY_TO]->(m),
(l)-[:FLY_TO]->(s),
(s)-[:FLY_TO]->(l)
"""
)
G, result = gds.graph.project(
"offices",
{
"City": {
"properties": ["population"]
}
},
"FLY_TO"
)
gds.graph.nodeProperties.stream(G, node_properties=["population"], db_node_properties=["name"])
7.2.3. Streaming topology by relationship type
The type returned from the Python client method corresponding to gds.beta.graph.relationships.stream
is called TopologyDataFrame
and inherits from the standard pandas DataFrame
.
TopologyDataFrame
comes with an additional convenience method named by_rel_type
which takes no arguments, and returns a dictionary of the form Dict[str, List[List[int]]]
.
This dictionary maps relationship types as strings to 2 x m
matrices where m
here represents the number of relationhips of the given type.
The first row of each such matrix are the source node ids of the relationships, and the second row are the corresponding target node ids.
We can illustrate this transformation with an example using our graph G
from the contruct example above:
topology_by_rel_type = gds.beta.graph.relationships.stream(G).by_rel_type()
assert list(topology_by_rel_type.keys()) == ["REL"]
assert topology_by_rel_type["REL"][0] == [0, 1, 2, 3]
assert topology_by_rel_type["REL"][1] == [1, 2, 3, 0]
Like the Streaming properties methods, the gds.beta.graph.relationships.stream
is also accelerated if the GDS Apache Arrow Flight Server is enabled.