Setting Up Amazon Neptune Graph Database

12 min readOct 16, 2022

What Is Amazon Neptune? — Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications. Using Amazon Neptune, it is possible to execute graph queries that efficiently navigate highly connected datasets. The Getting Started with Amazon Neptune | AWS Learning Path is a great resource to get started with learning about Amazon Neptune.

Image of a 3D rendered Abstract Graph Model Artwork — Abstract Graph Model Artwork

In this post, I will explain what I did to provision a two-node Amazon Neptune database cluster with Terraform while using the Free Cloud Computing Services — AWS Free Tier. Terraform enables provisioning infrastructure resources in the cloud — in this case, AWS and this infrastructure can then be used for learning purposes. If you would like to know more about Terraform, you can check out this 8-part mini tutorial AWS | Terraform | HashiCorp Developer.

I will also show you how to bulk-load data from an AWS S3 bucket into Amazon Neptune. Finally, we will run some gremlin queries using The Gremlin Console after logging in to an EC2 cloud compute instance. Once done, we will destroy all the provisioned infrastructure to prevent any further cloud usage charges. With the AWS Free Tier, there wouldn’t be any costs though, if the right configurations of resources are used.

PRACTICAL GREMLIN: An Apache TinkerPop Tutorial is a great resource to get started with the gremlin query language.

By the end of this article, you will get familiar with not only provisioning Amazon Neptune from scratch but will also gain a working knowledge of how to query a knowledge graph.

Knowledge Graphs

Knowledge graphs are a specific type of graph with an emphasis on contextual understanding. Knowledge graphs are interlinked sets of facts that describe real-world entities, events, or things and their interrelations in a human- and machine-understandable format.

Here is the graph model we will be using that is based on the domain as described in Build a graph application with Amazon Neptune and AWS Amplify | AWS Database Blog

Illustration showing a Knowledge graph representing a simplified domain of the academia world — Knowledge graph representing a simplified domain of the academia world

In this domain context of academia, a Person knows other Persons. Research papers are authored by a Person. There is a relationship between a Person and a Product that the person uses. The Product in turn is made by an Institution — which is another relationship. A Person is affiliated with an Institution and belongs to a Conference. It is pertinent to note that a graph consists of these concepts:

Vertices (nodes) represent the noun terms such as Person or Conference.
Edges represent the relationships between the vertices (nodes). A Person that knows another Person is represented with an edge having the label: Knows
Properties are associated with both the vertices (nodes) and edges.
For example, a Product vertex could be associated with a property — such as a name.

Configuring AWS Environment

This section assumes that we have already installed Terraform as well as AWS CLI. The latter can be installed following the instructions in Installing or updating the latest version of the AWS CLI — AWS Command Line Interface

We will use AWS Configure (AWS CLI) Configuration and credential file settings — AWS Command Line Interface to configure the terminal for calling the AWS APIs. We can run these commands in the terminal:

$> aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json

If you don’t have an access key and secret key, see Create an AWS access key from your AWS free tier account. Substitute your Access Key and Secret Access Key in place of those shown here. You can also change the default region name to the one that you plan to use.

Provisioning Amazon Neptune Database

We are going to use git commands to install some terraform scripts. These are the scripts that I have created to provision a Neptune cluster and a stand-alone EC2 compute instance. Refer here for Git — Installing Git

Let us begin by opening a terminal on the local machine. From the terminal, create a directory and change to that directory. Then run the git clone command as shown below:

$> mkdir neptune_dev
$> cd neptune_dev
$> git clone https://github.com/decliffy/amazon-neptune-database.git
$> cd amazon-neptune-database/

The datasets folder contains the vertex and edge CSV files with sample data. The data from these files will be bulk loaded using the Amazon Neptune database bulk loader API endpoint. Take a look at Using the Amazon Neptune Bulk Loader to Ingest Data — Amazon Neptune for details related to the bulk loader endpoint.

The terraform-aws-instance and terraform-aws-neptune folders contain the HCL scripts to provision an EC2 instance and Amazon Neptune database cluster respectively. These scripts make the concept of Infrastructure as Code come alive.

In order to set up the Neptune Database cluster, we need to run the following commands in the terminal.

$> cd terraform-aws-neptune
$> terraform init

In the terminal, you will see an output similar to this one below:

Terminal screen showing a series of status messages following terraform init command — Screen showing successful initiation after running **terraform init** command

We are now all set to provision an Amazon Neptune database cluster. Run the following code in the terminal:

$> terraform apply -auto-approve

At this point, Terraform starts to provision the Amazon Neptune cluster. It takes around 15 to 20 minutes to complete this step and get a database cluster up and running. As the HCL scripts execute and complete, you will see the status in the console, similar to the screenshot below.

Terminal output when terraform apply command is run — Output upon completion of the **terraform apply** command

Take note of the value of neptune_endpoint_dns. We will need this value to complete a configuration step later on.

Log in to the AWS console and search for Neptune in the services box at the top left of the console. Then select Databases from the left-hand side panel. You should see a view similar to the one below:

AWS graphical UI terminal showing the provisioned Neptune cluster — AWS console view of the provisioned Amazon Neptune database cluster

We can see that there are two nodes provisioned on db.t3.medium database instances with the writer node running in us-west-2a and the reader node in us-west-2b. Hence the cluster is set up across multiple availability zones. While there can be only one writer node present in a cluster at any time, there can be up to 15 read-replica DB instances resulting in high availability and scalability.

Provisioning EC2 Compute Instance

Let us first generate the keys in order to SSH to the EC2 compute instance that we will be provisioning in this section. Here is how to do it:

$> cd ..
$> mkdir my-key
$> cd my-key
$> ssh-keygen -f my-ec2-ssh-key

When prompted for a passphrase, hit enter without supplying any inputs. In the same folder, you will find two files newly generated. The file with the .pub extension is your public key. The other one is the private key. We will come to this private key file shortly. Run the following command (on windows use clip < my-ec2-ssh-key.pub) instead. This copies the contents of the public key file to the clipboard:

$> cat my-ec2-ssh-key.pub | pbcopy

In the terraform file ec2-instance.tf script file, paste the public key into the value of public_key argument:

resource "aws_key_pair" "deployer" {
  key_name   = "ec2-neptune-keypair-ssh"
  public_key = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDB1AYS1OdGeyn15zEihNfztOokDMUTA1nBc+aTyIHRMdcq71VZr5xlCOmO1sORH4c4SSwBkxOtoG7AF0JwQNtB891JqVHB7zZV7HE2ZjhojeoKugLgVlJJMxdTMa2ZKanEPKTTYcXvGncMaUxwXi4cWR9v+Rbsg0YtsORHbjCluKxn7xvnqSaIbFAvAwVPxh0TBgR1GVbFQ0SLIvnB32rm7jJkeFvMqqEX1RYXd4mLvO1c3JxNFsKRzSq1p9ztdNbJgEzrYJDWtYRRBFAo/4mJPKQ/aaVUq7zrVRKSEuBqWqkVf+axhc2qvilPJJBGJ3g+aBpPb4F1Ak94jZ3VAcOOzWM0M+ZRZVwKPiepoP3ANRtQXt9IG9q4s9Y1eJmwKhsEW4lUdVZeKAowoBWVcTLEWcQ418cX7DQfmWQGgn77CIdPwnpFDjJL9h8Au5qNk7bj4gnR8C7+nLdeAVJuV3g5RCKW/wa5jIPMhLeIWdxQlrymc/TgNIJbcG+aokXWmCM= clifforddsouza@Cliffords-iMac.local"
}

Additionally, change the ingress rule CIDR block to match that of your provider. This is to ensure additional security when connecting remotely. If you run an internet speed test tool, such as SPEEDTEST you will see the public IP address that you can use for the CIDR block. Substitute the IP address in the cidr_blocks value.

#Incoming traffic
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["110.226.x.x/32"]
  }

Change the directory to the Terraform scripts EC2 directory and run the Terraform apply command once again:

$> cd ../terraform-aws-instance/
$> terraform init
$> terraform apply -auto-approve

Once the script has successfully completed, you will see the public IP address of the provisioned EC2 instance in the console. We are all set to connect remotely to the provisioned EC2 instance now.

Console showing terraform run logs while provisioning an ec2 instance — Output upon completion of the **terraform apply** command

Note the value of instance_public_ip under Outputs. We will use this value when remotely connecting to the EC2 compute instance from the desktop terminal.

Log in to the AWS console and search for EC2 in the services box at the top left of the console. Then click on Instances from the left-hand side panel. You should see a view similar to the one below:

AWS graphical UI terminal showing the provisioned EC2 compute instance — AWS console view of the provisioned EC2 compute instance

Uploading the Vertex and Edge files to AWS S3

We can upload the sample vertex and edge CSV files that are present in the dataset directory to AWS S3. I ran these commands to do so using AWS CLI commands:

$> cd ../datasets
$> aws s3 cp vertex.csv s3://neptune-bulkloader-bucket/vertex.csv
$> aws s3 cp edge.csv s3://neptune-bulkloader-bucket/edge.csv
$> aws s3 ls s3://neptune-bulkloader-bucket --recursive --human-readable --summarize

The last command lists the objects in the bucket. Here you can verify that the uploaded files are indeed present.

We will look at Using the Amazon Neptune Bulk Loader to Ingest Data — Amazon Neptune i.e. the vertex and edge data into the Amazon Neptune database with the following curl commands. Use the neptune_endpoint_dns value we obtained earlier and the AWS region that is specific to your cluster for these curl commands.

Load the vertices:

$> curl -X POST \
    -H 'Content-Type: application/json' \
    https://neptune-cluster.cluster-cgjkeyxxxxxx.us-west-2.neptune.amazonaws.com:8182/loader -d '
    {
      "source" : "s3://neptune-bulkloader-bucket/vertex.csv",
      "format" : "csv",
      "iamRoleArn" : "arn:aws:iam::237414921190:role/neptunes3role",
      "region" : "us-west-2",
      "failOnError" : "FALSE",
      "parallelism" : "MEDIUM",
      "updateSingleCardinalityProperties" : "FALSE",
      "queueRequest" : "TRUE"
    }'

Load the edges:

$> curl -X POST \
    -H 'Content-Type: application/json' \
  https://neptune-cluster.cluster-cgjkeyxxxxxx.us-west-2.neptune.amazonaws.com:8182/loader -d '
    {
      "source" : "s3://neptune-bulkloader-bucket/edge.csv",
      "format" : "csv",
      "iamRoleArn" : "arn:aws:iam::237414921190:role/neptunes3role",
      "region" : "us-west-2",
      "failOnError" : "FALSE",
      "parallelism" : "MEDIUM",
      "updateSingleCardinalityProperties" : "FALSE",
      "queueRequest" : "TRUE"
    }'

Upon successful completion of these curl commands, you should see a status “200 OK” similar to what is seen in the console below:

Console showing the output of curl command that invokes Amazon Neptune database bulk loader API endpoint — curl command used to invoke the Amazon Neptune database bulk loader API endpoint

Connect to EC2 Compute Remote Instance with SSH

Let us connect to the remote EC2 compute instance. Change the directory to the directory where your private key has been moved to. Always ensure this key is maintained in a secure location and set to the appropriate file permissions. Use the private key when connecting to EC2 compute instance. Here is the command I ran for a new SSH session:

$> chmod 400 my-ec2-ssh-key
$> ssh -i my-ec2-ssh-key ec2-user@35.93.85.60

You can also obtain the public IP address of the EC2 compute instance after logging into the AWS console and opening EC2 service. See Amazon Web Services Sign-In. Once connected, you will be greeted with the text ‘Amazon Linux 2 AMI’ as in the screenshot below:

Console showing the initial login screen after remotely connecting to an EC2 compute instance — Screen showing a successful login attempt to a remote EC2 compute instance

That’s great. We are in!

Configuring the Gremlin Console

We will be using the Gremlin Console to connect to Amazon Neptune database from the EC2 compute instance we just provisioned. Both are provisioned within the same VPC. Before we start the Gremlin Console, we need to specify the Amazon Neptune cluster end point url in a configuration file that the Gremlin Console will use. To do this, from the EC 2 SSH session console, these commands need to be run:

$> cd /usr/local/share/applications/apache-tinkerpop-gremlin-console-3.5.2/conf
$> sudo su
$> sed -e "1 s/your-neptune-endpoint/neptune-cluster.cluster-cgjkeyxxxxxx.us-west-2.neptune.amazonaws.com/" neptune-remote.yaml > neptune-remote-cfg.yaml
$> exit

Now that these commands have been executed, we can start the Gremlin console with the commands below:

$> cd ..
$> bin/gremlin.sh

In the console, these are the commands I typed in order to connect to the Amazon Neptune database cluster that I provisioned with Terraform and obtained query the counts of the loaded vertices and edges.

gremlin> :remote connect tinkerpop.server conf/neptune-remote-cfg.yaml
gremlin> :remote console
gremlin> g.V().groupCount().by(label);
gremlin> g.E().groupCount().by(label);

The screenshot of all these along with the outputs looks like this below:

Console showing the gremlin console and queries run against the remote Amazon Neptune database — Gremlin console configuration and query execution

Gremlin Queries

Now that we have connected to the Amazon Neptune database cluster that we provisioned, let us run a few queries to learn more about our domain. We will ask a question and get the answer through a gremlin query. Please refer to the knowledge graph toward the beginning of the article for the domain context.

Which Institutions have affiliated People that use Products made by the Institutions?

gremlin> g.E().hasLabel('made_by').outV().inE().hasLabel('usage').outV().outE().hasLabel('affiliated_with').inV().id().dedup().fold();==>[Inst1, Inst3, Inst2]

We know that there are five institution vertices in the graph. The above result suggests we need to probe why the other institutions don’t appear on the list.

Which Persons affiliated with Institutions and have published Papers belong to Conferences?

gremlin> g.V().hasLabel('person').outE().hasLabel('affiliated_with').outV().inE().hasLabel('authored_by').inV().outE().hasLabel('belong_to').outV().id().dedup().fold();==>[Doctor1, Doctor2, Doctor3, Doctor5, Doctor6]

Looks like Doctor 4 is the odd one out.

Which are the Institutions whose Products are used at least 60% of the time?

gremlin> g.V().hasLabel('institution').as('inst').inE().hasLabel('made_by').outV().inE().hasLabel('usage').filter(properties("weight").value().is(gte(0.6))).select('inst').dedup().id().fold();==>[Inst4, Inst5]

Here we can see that Inst4 and Inst5 make the most popular Products.

We can see all the queries and the search results in the screenshot:

In order to exit from the Gremlin console, run the following command:

gremlin> :exit

You will be returned to the EC2 compute instance command prompt. Then type this command to logout from the SSH session and return to the local machine’s terminal command prompt:

$> logout
Connection to 35.86.229.83 closed.

Infrastructure Destroy and Clean-up

When it is time to end the session, we need to run a few commands. I try to practice gremlin queries whenever I get a block of free time. In between, a day or two may pass by depending on how busy my schedule is. I want to be able to take down the infrastructure and bring it back later when I am ready for another learning session. Here is how I can take down the infrastructure and avoid unnecessary costs.

From the git checked-out directory run these commands:

$> cd terraform-aws-neptune
$> terraform destroy -auto-approve

These commands will take down the Amazon Neptune database cluster that we had provisioned earlier. Likewise, we can run the commands below to take down the EC2 compute instance.

$> cd terraform-aws-instance
$> terraform destroy -auto-approve

Console showing the terraform command to destroy the provisioned Amazon Neptune database has completed successfully — Console logs confirming the Amazon Neptune database cluster has been removed

It is possible to also check for any costs that may have been incurred by running the AWS CLI command below:

aws ce get-cost-and-usage --time-period Start=2022-10-15,End=2022-10-16 --granularity=DAILY --metrics BlendedCost

Here is the console out as a result of running the command above:

Console showing the AWS CLI command in action that fetches the daily costs incurred for the AWS account — Using AWS CLI to fetch the cost and usage for a date range

Since we are using the AWS Free tier, there is no cost. Please note that Amazon allows the first 12 months of free usage from the date that the free tier account has been created.

Conclusion

Using Terraform scripts saves time to standup the infrastructure required to learn gremlin queries using Amazon Neptune database. This can be done in a quick, repeatable, and consistent manner.

I hope this article has given you some feel about the power of using some of the core cloud technology tools such as Terraform and AWS CLI in addition to Amazon Neptune database.

If you are really interested and want to try out more things, I could suggest these possibilities:

Enable IAM role for Amazon Neptune database in the Terraform script and use Signing with AWS Signature Version 4.
Create your own vertex and edge files from open data sets such as from Amazon Open Data. Load these into Amazon Neptune database and think of and write gremlin queries to gain interesting insights from the resulting graph.
Use an Open source conversational AI framework such as Rasa at rasa.com to get plain text queries from humans, call the corresponding mapped gremlin query and show the results back to the user.

Additional Resources

I do hope you liked reading through this article. Give it a clap if you feel this is the case. Thank you!