Lab: Graph Databases

In this lab, you will configure a cloud-hosted database, write graph database queries, and explore a large dataset.

Instructions

You should work on this assignment either individually or with a partner.

In this lab, you will explore the Paradise Papers dataset, which contains information about people and organizations involved in offshore investments. The dataset includes information about tens of thousands of people, companies, and other entities, and the various relationships between them, stored in a Neo4j graph database.

Step 1: Background Reading

Start the skimming the Wikipedia article about the Paradise Papers, especially focusing on the introduction and “Background” sections.

Next, skim this blog post about the Neo4j Paradise Papers dataset. Be sure you understand what kinds of information are represented by nodes, and what kinds of information are represented by relationships.

Step 2: Set Up Paradise Papers Database

We will host the Paradise Papers database in the cloud.

  1. Create a free account on sandbox.neo4j.com
  2. Create a new project, selecting “Paradise Papers by ICIJ” under “Pre-Built Data”
  3. After the database starts, open the “Connection details” tab, where you should see a “Username,” “Password,” and “IP Address”

Step 3: Set Up Paradise Papers Search App

You will search the Paradise Papers dataset using a Django app you will set up in this step.

Follow the instructions in the paradise-papers-django repo’s README to set up the application using Docker.

Note: If you are curious, here is some additional information about using Neo4j with Django.

Step 4: Search for Nodes

Nodes in the Paradise Papers dataset can represent people, companies, addresses, and law firms. Load the app’s search interface, which you will use to search for nodes.

Change the “By Country” selector to “United States” and click the magnifying glass to search. You should see thousands of entities, officers, etc.

You can also search by name. For the remainder of this lab, you will research a person, organization, etc. of your choice. You could choose from Wikipedia’s “List of people and organisations named in the Paradise Papers.” I will search for “Clark University” in the following examples, but you should choose your own topic to research.

If you click on a result, you will see information about the node, and the other nodes it is connected to. For example, my search for “Clark University” revealed the “Trustees of Clark University” under the “Officer” tab. Clicking on “Trustees of Clark University” shows:

Clark University

Clicking on the name of a connected node (e.g., “School, College and University Underwriters, Ltd.”) will show you information about it.

Hint: When searching for a person, you should format their name as last - first. For example, you can locate “Rex Tillerson” by searching for tillerson - rex.

Step 5: Explore the Graph

Graph databases can be queried using the Cypher query language. Learning Cypher is outside the scope of this lab. However, you will practice adapting a simple Cypher query.

Open the data browser in the Neo4j sandbox:

Open Browser

Next, connect to the database using the username and password from the “Connection details” tab.

At the top of the screen that loads next, you will see a neo4j$ shell prompt. Paste the following query, and click the ▶️ symbol to run it.

MATCH (n:Officer)
WHERE n.name = "Trustees of Clark University"
RETURN n

The results should show a single node. Click the node once, then click the bottom symbol to show the related nodes.

Expand child relationships
Child relationships

You can also reveal the nodes related to each of the related nodes.

Adapt the query to search for the person, organization, etc. you are researching. Notice that n:Officer tells the query the type of node you are searching for, so if you are researching an Entity, Intermediary, Address, or Other type of node, you would need to modify the query accordingly.

Record the query you used for your research, and take screenshots of the graphs you explored.

Step 6: Research

Per the disclaimer on The International Consortium of Investigative Journalists’s website:

There are legitimate uses for offshore companies and trusts. We do not intend to suggest or imply that any people, companies or other entities included in the ICIJ Offshore Leaks Database have broken the law or otherwise acted improperly. Many people and entities have the same or similar names. We suggest you confirm the identities of any individuals or entities located in the database based on addresses or other identifiable information. If you find an error in the database please get in touch with us.

Just because a person or organization was connected to the Paradise Papers, doesn’t necessarily mean they were involved in anything nefarious. Perform internet searches to get a more complete picture about the node you have been investigating.

For example, these news articles discuss the “School, College and University Underwriters, Ltd.” entity that Clark is connected to:

Submit

  1. The name of the node you researched (e.g., “Clark University”)
  2. The Cypher query you used to locate this node in the graph
  3. Screenshots of the related nodes
  4. A paragraph summarizing the results of your research
    • For full credit, include at least two links to sources of relevant information

The assignment will be graded as part of your assignment grade.

Further Reading

A relational database could have supported the queries we performed in this lab. However, certain types of queries are much easier to write using Cypher than SQL. For example, consider this Cypher query from another Neo4j blog post, which finds the shortest path from the Queen of England to Rex Tillerson, former U.S. secretary of state:

MATCH p=shortestPath((rex:Officer)-[*]-(queen:Officer))
WHERE rex.name = "Tillerson - Rex" AND queen.name = "The Duchy of Lancaster"
RETURN p

It is possible to perform graph operations using SQL in PostgreSQL, but it is more difficult. There is ongoing work to make graph queries easier to perform in PostgreSQL.