This repo contains code to generate several different databases containing the same data, for the Cambridge CST IA Databases course.
The data is a subset of the IMDb non-commercial datasets. The list of movies is filtered to only include popular movies, or those with useful properties such as duplicate names. The cast and crew data is then filtered down to only include those people relevant to this movies subset. The aim is to output ~1500 movies, to keep the database accessible even on basic hardware.
Three databases are created; SQLite 3, TinyDB and Neo4j, to demonstrate relational, document and graph databases respectively.
Install Python 3.7 or later, clone this repo and then install the required dependencies:
pip install requests tinydb neo4j
You will also need to install Neo4j, and Java 11 in order to run Neo4j. This code was designed with the Community Server edition in mind, version 4.4 LTS. Examples may not work in the 5.x version of Neo4j.
The main file is make_databases.py
. This script will download the relevant IMDb files, and create the databases.
Run:
python make_databases.py
The SQLite and TinyDB outputs will be created if they do not exist, or emptied and recreated if they do. The script expects a Neo4j database to be already running on localhost
with the default port; credentials should be configured in neo4j/neo4j_credentials.json
in the form {"username": "neo4j", "password": "neo4j"}
. All existing nodes and relations in the database neo4j
will be deleted and the movies data loaded; this is the default and only available database in the community server version.
The script will create output/movies.sqlite
and output/movies.tinydb.json
, as well as loading the data into the neo4j
database in the running Neo4j server.
This script creates the databases themselves. For SQLite and TinyDB, these are conveniently the single-file artefacts needed for someone to create their own version. Some additional artefacts are necessary:
It is useful to have a plain SQL file of CREATE and INSERT statements for use with other relational databases. Once you have run the script, connect to the SQLite database with sqlite3 output/movies.sqlite
and then dump the output to a file:
.output output/movies.sql
.dump
This file is not plain SQL; it will likely contain an SQLite PRAGMA command at the start. In a text editor, remove this line to produce a vendor-agnostic output.
To export the Neo4j database to file, stop the database server and then use the admin command to dump it to a file:
/path/to/neo4j-admin dump --database=neo4j --to=/path/to/output/movies.neo4j.dump
There are tutorials for using the two main databases for the course: relational database tutorial; document database tutorial.