https://skoda.projekty.ms.mff.cuni.cz/ndbi046/
Python 3.10+
- Clone this repository
- Optionally, create a virtual environment (
python -m virtualenv venv
,source venv/bin/activate
) - Install required libraries
pip install -r requirements.txt
- To generate data cubes run respective script in
cubes
directorypython cubes/care_providers.py
(output inout/care_providers.ttl
)python cubes/population.py
(output inout/population.ttl
)
- Check integrity constraints using
python queries.py
- Script located in
cubes/care_providers.py
- Import
get_cube
function to use the cube elsewhere - If ran as a main file the cube will be generated in RDF Turtle file (
out/care_providers.ttl
)
- Import
- Uses Národní registr poskytovatelů zdravotních služeb dataset
- dimensions:
- county
- region
- field of care
- measures:
- number of care providers
- Script located in
cubes/population.py
- Import
get_cube
function to use the cube elsewhere - If ran as a main file the cube will be generated in RDF Turtle file (
out/population.ttl
)
- Import
- Uses Pohyb obyvatel za ČR, kraje, okresy, SO ORP a obce - rok 2021 dataset
- Uses dataset from care providers data cube to map counties to regions
- dimensions:
- county
- region
- measures:
- mean population
- Script
queries.py
checks data cube integrity constraints for both cubes - Source of constraints: The RDF Data Cube Vocabulary
- Output
True
= Data cube violates corresponsing constraintFalse
= Data cube does not break this constraint- Ideally, all checks should return
False
Python 3.10+ (tested on 3.10.5), Linux/WSL
- Create a virtual environment and activate it
(
python -m virtualenv venv
,source venv/bin/activate
) - Install Apache Airflow using pip, following the official instructions, or use Docker as an alternative.
- Install required libraries (
pip install -r requirements.txt
) - Copy the content of the
airflow/dags
directory into your DAGs folder. Checkdags_folder
inairflow.cfg
. (cp -r airflow/dags/* <dags_folder>
) - Run the
data-cubes
DAG in Apache Airflow web interface. You can specify ouput directory using the "DAG with Config" option in Airflow. The format is{"output_path": "./out"}
.
The structure of data cubes is identical to the previous task. However, the transformation workflow has been improved. And any incomplete values have been dropped, so there might be some minor differences compared to the previous cubes.
System requirements and installation instructions are the same as for Task 1.
Run python provenance.py
to generate provenance file as out/provenance.trig
.
Resource names for each cube has been changed in task 1 code, so each has it's own unique one.
NSR.dataCubeInstance -> NSR.careProvidersDataCubeInstance, NSR.populationDataCubeInstance
System requirements and installation instructions are the same as for Task 1.
Run python vocabs/skos_hierarchy.py
to generate SKOS hierarchy in out/skos_hierarchy.ttl
.
Run python vocabs/dcat_dataset.py
to generate DCAT dataset for population datacube in out/dcat_dataset.ttl
.
I have decided to create a separate script to create SKOS hierarchy instead of adding it to cubes for improved readability.