Datasets
To support getting up and running quickly and conveniently with graphdatascience
, the library comes with functionality for very easily projecting some interesting datasets into server-side GDS.
Some of these are smaller and built-in to the library, and others are larger and hosted elsewhere and are automatically fetched under the hood upon loading.
1. Built-in datasets
For convenience, the library is shipped with a few smaller built-in datasets that are very fast to load. These are easily imported to GDS to get a graph object representing the dataset.
These datasets comes with a loader method that takes two optional parameters:
graph_name
which assigns a graph name,
undirected
which takes a boolean and will load the graph as undirected if set to true.
If a graph is loaded as undirected = True
, then it will have twice the number of relationships compared to its directed version.
The default value for undirected
varies for each dataset.
For example:
G = gds.graph.load_cora()
assert G.node_count() == 2_708
assert G.node_labels() == ["Paper"]
1.1. Cora
A well known citation network introduced by Automating the Construction of Internet Portals with Machine Learning and used in many node classification or link prediction publications.
The default is to load Cora as undirected = False
.
Name | Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1.2. Karate club
A well known social network introduced by Zachary.
The default is to load Karate club as undirected = False
.
Name | Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1.3. IMDB
A heterogeneous graph that is used to benchmark node classification or link prediction models such as Heterogeneous Graph Attention Network, MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding and Graph Transformer Networks. The graph contains Actors, Directors, Movies (and UnclassifiedMovies) as nodes, and relationships between actors and movies that they acted in, and between directors and movies which they directed for.
The default is to load IMDB dataset as undirected = True
. If loaded as directed, it will have half the relationships.
Name | Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1.4. LastFM
A heterogeneous graph that is used to benchmark link prediction models, for example used by MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding. The data and licensing is from HetRec 2011 The original raw data is from LastFM.
The graph contains User and Artists as nodes, each with a rawId
field that corresponds to the ids from the raw data.
The rawId
could be used, for example, to check the artist names by referring back to the HetRec'11 .dat files.
There are three types of relationships:
-
IS_FRIEND
exists betweenUser
nodes, representing user-to-user friendships. -
LISTEN_TO
exists betweenUser
andArtist
nodes, indicating the artists a user listens to. -
weight
property represents the number of times the user listened to the artist. -
TAGGED
exists betweenUser
andArtist
nodes, indicating artists that a user has tagged. -
tagID
property represents a genre -
day
,month
, andyear
properties represents the time when tagging was created -
timestamp
represents the same date as milliseconds since EPOCH
Note that a user can tag the same artist multiple times with different tagID
.
The default is to load LastFM dataset as undirected = True
.
If loaded as directed, it will have half the relationships.
Name | Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2. Datasets from the Open Graph Benchmark
There are convenience methods for loading Open Graph Benchmark datasets in the library.
In order to use these methods, one has to install OGB support for the graphdatascience
library:
pip install graphdatascience[ogb]
The two methods that expose the OGB dataset loading functionality are:
-
gds.graph.obgn.load
for loading node property prediction datasets, and -
gds.graph.obgl.load
for loading link property prediction datasets.
These methods both return a Graph
object, and take four arguments:
Name | Type | |
---|---|---|
|
|
The name of the ODB dataset |
|
|
An optional path for where the downloaded OGB dataset will be stored |
|
|
An optional name of the created graph. Defaults to |
|
|
An optional number of threads to use |
The OGB loading methods uses the OGB library under the hood.
As such, when loading a new OGB dataset, the dataset will be downloaded from the internet, hence an internet connection is required.
If the dataset is already found in the |
All OGB datasets are loaded into GDS with orientation |
If the targeted GDS server has an enterprise license and the Arrow Flight Server enabled, the performance of the OGB dataset loading will be very good. If not, the performance will be a lot worse. The loading will take several magnitudes longer time, and require substantially more memory. Indeed, one may need to increase the maximum heap size of the Neo4j database which hosts GDS in order to have enough memory for the loading. |
At this time, the loading methods do not support loading of relationship features that are in the OGB datasets. Whenever an dataset that includes relationship features is loaded, those features will not be included in the projection, and a warning till be issued. |
Depending on the dataset and task, the graphs projected into server-side GDS will look a bit different.
2.1. OGBN graphs
Datasets used for node property prediction.
2.1.1. Homogeneous
These graphs will, when projected into server-side GDS, have:
-
Up to three disjoint node labels representing the dataset split: "Train", "Valid" and "Test"
-
One relationship type "R" for all relationships
-
A "classLabel" property on all nodes, which is represented by a single number
-
If provided by the dataset, a node property "features", which is represented by an array of numbers for all nodes
Let’s see an example of loading and inspecting one of these datasets:
G = gds.graph.ogbn.load("ogbn-arxiv")
assert G.name() == "ogbn-arxiv"
assert G.node_count() == 169_343
assert G.node_labels() == ["Train", "Valid", "Test"]
assert G.node_properties()["Train"] == ["features", "classLabel"]
assert G.relationship_count() == 1_166_243
assert G.relationship_types() == ["R"]
assert G.relationship_properties()["R"] == []
2.1.2. Heterogeneous
These graphs are heterogenous, so by definition will have multiple node labels and relationship types. These labels and types will be named in the graph projection according to the their names in the original dataset. In addition, the projected graph will have:
-
Up to three disjoint node labels representing the dataset split: "Train", "Valid" and "Test". This implies that nodes might have multiple labels
-
A "classLabel" property on the nodes targeted for prediction, which is represented by a single number
-
If provided by the dataset, a node property "features", which is represented by an array of numbers for some or all of the nodes
Let’s see an example of loading and inspecting one of these datasets:
G = gds.graph.ogbn.load("ogbn-mag")
assert G.name() == "ogbn-mag"
assert G.node_count() == 1_939_743
assert set(G.node_labels()) == {
"Train",
"Test",
"Valid",
"institution",
"field_of_study",
"paper",
"author",
}
assert G.node_properties()["paper"] == ["features", "classLabel"]
assert G.node_properties()["institution"] == []
assert G.relationship_count() == 21_111_007
assert G.relationship_types() == ["cites", "writes", "affiliated_with", "has_topic"]
2.2. OGBL graphs
Datasets used for link property prediction.
2.2.1. Homogeneous
These graphs are used for link prediction. When projected into server-side GDS, they will have have:
-
One node label "N" for all nodes
-
Up to six disjoint relationship types representing the dataset split: "TRAIN_POS", "TRAIN_NEG", "VALID_POS", "VALID_NEG", "TEST_POS", "TEST_NEG"
-
If provided by the dataset, a node property "features" which is represented by an array of numbers for all nodes
Let’s see an example of loading and inspecting one of these datasets:
G = gds.graph.ogbl.load("ogbl-ddi")
assert G.name() == "ogbl-ddi"
assert G.node_count() == 4_267
assert G.node_labels() == ["N"]
assert G.node_properties()["N"] == []
assert G.relationship_count() == 1_334_889 + 197_481 # Positive + negative counts
assert G.relationship_types() == ["TRAIN_POS", "VALID_POS", "VALID_NEG", "TEST_POS", "TEST_NEG"]
ogbl-wikikg2 is a homogenous dataset according to the OGB API, but it has multiple relationship types. Hence we load it in the same way as the heterogeneous ones. For example we suffix the relationships with the dataset split type (see below). Please also note that since the number of negative relationships in this particular dataset is so large only positive relationships will be loaded. |
2.2.2. Heterogeneous
These are heterogenous graphs used for knowledge graph completion, so by definition will have multiple node labels and relationship types. The node labels will be named in the graph projection according to the their names in the original dataset. Complementing the naming in the original dataset, the relationship types names will be suffixed with an underscore followed by the kind of set they appear in in the dataset split ("TRAIN", "VALID" or "TEST").
In addition, the projected graph will have:
-
A "classLabel" property on all relationships, which is represented by a single non-negative integer. This integer maps one-to-one with the relationship types of the original dataset
-
If provided by the dataset, a node property "features" which is represented by an array of numbers for some or all of the nodes
Let’s see an example of loading and inspecting one of these datasets:
G = gds.graph.ogbl.load("ogbl-biokg")
assert G.name() == "ogbl-biokg"
assert G.node_count() == 93_773
assert G.node_labels() == ["disease", "protein", "drug", "sideeffect", "function"]
assert G.node_properties()["protein"] == []
assert G.relationship_count() == 5_088_434
# For each of the train, valid and test sets: number of rel types
assert len(G.relationship_types()) == 51 * 3
assert G.relationship_properties()["drug-drug_polycystic_ovary_syndrome_TRAIN"] == ["classLabel"]
2.3. Limitations on community edition
For users of GDS community edition, performance can be impacted for large graphs.
It is possible that socket connection with the database times out.
If this happens, a possible workaround is to modify the server configuration server.bolt.connection_keep_alive
or server.bolt.connection_keep_alive_probes
.
However, be aware of the side effects such as a genuine connection issue now taking longer to be detected.