oeg-upm / gtfs-bench Goto Github PK

View Code? Open in Web Editor NEW

16.0 9.0 11.0 201.19 MB

GTFS-Madrid-Bench: A Benchmark for Knowledge Graph Construction Engines

Home Page: https://doi.org/10.5281/zenodo.3574492

License: Apache License 2.0

Dockerfile 1.79% Shell 8.22% Python 89.98%

obda obdi rml r2rml data-integration knowledge-graph transport-domain

gtfs-bench's Introduction

The GTFS-Madrid-Bench

We present GTFS-Madrid-Bench, a benchmark to evaluate declarative KG construction engines that can be used for the provision of access mechanisms to (virtual) knowledge graphs. Our proposal introduces several scenarios that aim at measuring performance and scalability as well as the query capabilities of all this kind of engines, considering their heterogeneity. The data sources used in our benchmark are derived from the GTFS data files of the subway network of Madrid. They can be transformed into several formats (CSV, JSON, SQL and XML) and scaled up. The query set aims at addressing a representative number of SPARQL 1.1 features while covering usual queries that data consumers may be interested in.

Main Publication:

David Chaves-Fraga, Freddy Priyatna, Andrea Cimmino, Jhon Toledo, Edna Ruckhaus, & Oscar Corcho (2020). GTFS-Madrid-Bench: A benchmark for virtual knowledge graph access in the transport domain. Journal of Web Semantics, 65. Online

Citing GTFS-Madrid-Bench: If you used GTFS-Madrid-Bench in your work, please cite as:

@article{chaves2020gtfs,
  title={GTFS-Madrid-Bench: A benchmark for virtual knowledge graph access in the transport domain},
  author={Chaves-Fraga, David and Priyatna, Freddy and Cimmino, Andrea and Toledo, Jhon and Ruckhaus, Edna and Corcho, Oscar},
  journal={Journal of Web Semantics},
  volume={65},
  pages={100596},
  year={2020},
  doi={https://doi.org/10.1016/j.websem.2020.100596},
  publisher={Elsevier}

}

Results

Virtual KGC results can be reproduced through the resources provided in this branch
Materialized KGC results can be reproduced through the resources provided in this repo

Requirements for the use:

To have locally installed docker.

Decide the distributions to be used for your testing. They can be:

Standard distributions: data sources are represented in one format (e.g., GTFS-CSV, GTFS-JSON or GTFS-SQL).
Custom distributions: each data source is represented in the format selected by the user (e.g., SHAPES in JSON, CALENDAR in CSV, etc.)

Using GTFS-Madrid-Bench:

Download and run the docker image (run it always to ensure you are using the last version of the docker image).

Docker v20.10 or later: docker run --pull always -itv "$(pwd)":/output oegdataintegration/gtfs-bench
Previous versions: docker pull oegdataintegration/gtfs-bench and then docker run -itv "$(pwd)":/output oegdataintegration/gtfs-bench

Choose data scales and formats to obtain the distributions you want to test. You have to provide: first the data scales (in one line, separated by a comma), then, select the standard distributions (from none to all) and if is needed, the configuration for one custom distribution. If you want to generate several custom distributions, you will have to run the generator several times.
Optionally, you can apply a percentage of changes to the original data. A seed value can be provided to generate different changes to simulate multiple changed dumps. The following changes can be generated:
- Additions: Routes and their associated trips, stops, stoptimes, services are added to the data. Example: 25% additions will provide additional new routes, 25% of the number of routes of the original data.
- Modifications: Service entries for trips are modified. Example: 50% modifications will modify 50% of the service entries in the calendar.
- Deletions: Routes and their associated trips and services are removed from the data. Example: 10% deletions will remove 10% of the routes in the original data together with the associated data.

Demo usage:

Result will be available as result.zip in the current working directory. The folders structure are: one folder for datasets and other for the queries (for virtual KG). Inside the datasets folder will be one folder for each distribution (e.g., csv, sql, custom), and in each distribution folder we provide the required sizes (each size in one folder), the corresponding mapping associated to the distribution, and the SQL schemes if they are needed. Consider that for not repeating resources at scale level, the mappings and SQL paths to the data are define at distribution level (e.g, "data/AGENCY.csv") and their management for performing a correct evaluation has to be done by the user (with an script, for example). You can visit the utils folder where we provide some ideas on how to manage it. See the following example:

.
├── datasets
│   ├── csv
│   │   ├── 1
│   │   │   ├── AGENCY.csv
│   │   │   ├── CALENDAR.csv
│   │   │   ├── CALENDAR_DATES.csv
│   │   │   ├── FEED_INFO.csv
│   │   │   ├── FREQUENCIES.csv
│   │   │   ├── ROUTES.csv
│   │   │   ├── SHAPES.csv
│   │   │   ├── STOPS.csv
│   │   │   ├── STOP_TIMES.csv
│   │   │   └── TRIPS.csv
│   │   ├── 2
│   │   │   ├── AGENCY.csv
│   │   │   ├── CALENDAR.csv
│   │   │   ├── CALENDAR_DATES.csv
│   │   │   ├── FEED_INFO.csv
│   │   │   ├── FREQUENCIES.csv
│   │   │   ├── ROUTES.csv
│   │   │   ├── SHAPES.csv
│   │   │   ├── STOPS.csv
│   │   │   ├── STOP_TIMES.csv
│   │   │   └── TRIPS.csv
│   │   ├── 3
│   │   │   ├── AGENCY.csv
│   │   │   ├── CALENDAR.csv
│   │   │   ├── CALENDAR_DATES.csv
│   │   │   ├── FEED_INFO.csv
│   │   │   ├── FREQUENCIES.csv
│   │   │   ├── ROUTES.csv
│   │   │   ├── SHAPES.csv
│   │   │   ├── STOPS.csv
│   │   │   ├── STOP_TIMES.csv
│   │   │   └── TRIPS.csv
│   │   └── mapping.csv.nt
│   ├── json
│   │   ├── 1
│   │   │   ├── AGENCY.json
│   │   │   ├── CALENDAR_DATES.json
│   │   │   ├── CALENDAR.json
│   │   │   ├── FEED_INFO.json
│   │   │   ├── FREQUENCIES.json
│   │   │   ├── ROUTES.json
│   │   │   ├── SHAPES.json
│   │   │   ├── STOPS.json
│   │   │   ├── STOP_TIMES.json
│   │   │   └── TRIPS.json
│   │   ├── 2
│   │   │   ├── AGENCY.json
│   │   │   ├── CALENDAR_DATES.json
│   │   │   ├── CALENDAR.json
│   │   │   ├── FEED_INFO.json
│   │   │   ├── FREQUENCIES.json
│   │   │   ├── ROUTES.json
│   │   │   ├── SHAPES.json
│   │   │   ├── STOPS.json
│   │   │   ├── STOP_TIMES.json
│   │   │   └── TRIPS.json
│   │   ├── 3
│   │   │   ├── AGENCY.json
│   │   │   ├── CALENDAR_DATES.json
│   │   │   ├── CALENDAR.json
│   │   │   ├── FEED_INFO.json
│   │   │   ├── FREQUENCIES.json
│   │   │   ├── ROUTES.json
│   │   │   ├── SHAPES.json
│   │   │   ├── STOPS.json
│   │   │   ├── STOP_TIMES.json
│   │   │   └── TRIPS.json
│   │   └── mapping.json.nt
│   └── sql
│       ├── 1
│       │   ├── AGENCY.csv
│       │   ├── CALENDAR.csv
│       │   ├── CALENDAR_DATES.csv
│       │   ├── FEED_INFO.csv
│       │   ├── FREQUENCIES.csv
│       │   ├── ROUTES.csv
│       │   ├── SHAPES.csv
│       │   ├── STOPS.csv
│       │   ├── STOP_TIMES.csv
│       │   └── TRIPS.csv
│       ├── 2
│       │   ├── AGENCY.csv
│       │   ├── CALENDAR.csv
│       │   ├── CALENDAR_DATES.csv
│       │   ├── FEED_INFO.csv
│       │   ├── FREQUENCIES.csv
│       │   ├── ROUTES.csv
│       │   ├── SHAPES.csv
│       │   ├── STOPS.csv
│       │   ├── STOP_TIMES.csv
│       │   └── TRIPS.csv
│       ├── 3
│       │   ├── AGENCY.csv
│       │   ├── CALENDAR.csv
│       │   ├── CALENDAR_DATES.csv
│       │   ├── FEED_INFO.csv
│       │   ├── FREQUENCIES.csv
│       │   ├── ROUTES.csv
│       │   ├── SHAPES.csv
│       │   ├── STOPS.csv
│       │   ├── STOP_TIMES.csv
│       │   └── TRIPS.csv
│       └── mapping.sql.nt
│       └── schema.sql
└── queries
    ├── q10.rq
    ├── q11.rq
    ├── q12.rq
    ├── q13.rq
    ├── q14.rq
    ├── q15.rq
    ├── q16.rq
    ├── q17.rq
    ├── q18.rq
    ├── q1.rq
    ├── q2.rq
    ├── q3.rq
    ├── q4.rq
    ├── q5.rq
    ├── q6.rq
    ├── q7.rq
    ├── q8.rq
    └── q9.rq

Resources

Additionally to the generator engine, that provides the data at desirable scales and distributions, together with corresponding mappings and queries, there are also common resources openly available to be modified or used by any practicioner or developer:

Folder mappings contains RML mappings for CSV, XML, JSON and RDB distributions of the input GTFS dataset, R2RML mapping for RDB and xR2RML mapping for MongoDB. It also includes CSVW annotations for the CSV distributions.
Folder queries includes 18 queries with different levels of complexity including a representative set of SPARQL 1.1. operators. Additionally, the folder contains 11 simple queries that will help to test the basic capabilities of virtual KG construction engines (i.e., to understand if the engine is able to translate correctly the SPARQL operators over different GTFS distributions before starting to test performance and scalability).

Utils

Our experiences testing (virtual) knowledge graph engines have revealed the difficulties for setting up an infrastructure where many variables and resources are involved: databases, raw data, mappings, queries, data paths, mapping paths, databases connections, etc. For that reason, and in order to facilitate the use of the benchmark to any developer or practitioner, we provide a set of utils such as docker-compose templates or evaluation bash scripts that, in our opinion, can reduce the time for preparing the testing set up.

Desirable Metrics:

We highly recommend that (virutalizers or materializers) KG construction engines tested with this benchmark provide (at least) the following metris:

Total execution time
Number of answers
Memory consumption
Initial delay
Dief@k (only for continuous/streaming behavior)*
Dief@t (only for continuous/streaming behavior)*

For virtual knowledge graphs systems, we also encourage developers and tester to provide:

Loading time
Mapping translation time (if applies)
Number of requests
Source selection time
Query generation (or disitribution) time
Query rewritting time
Query translation time
Query exececution time
Results aggregation time

*R Package available at: https://github.com/dachafra/dief (extension from https://github.com/maribelacosta/dief) and Python PyPi module available at https://pypi.org/project/diefpy/ (provided by SDM-TIB)

Data License

All the datasets generated by this benchmark have to follow the license of the Consorcio Regional de Transporte de Madrid: https://www.crtm.es/licencia-de-uso?lang=en

Contribute

We know that there are variables and dimensions that we did not take into account in the current version of the benchmark (e.g., transformation function defined in the mapping rules). If you are interested in collaborate with us in a new version of the benchmark, send us an email or open a new discussion!

Authors

David Chaves-Fraga - [email protected]
Freddy Priyatna
Jhon Toledo
Daniel Doña
Edna Ruckhaus
Andrea Cimmino
Oscar Corcho

Ontology Engineering Group, October 2019 - Present

gtfs-bench's People

Stargazers

Watchers

Forkers

fpriyatna jatoledo daniel-dona marleeng frankborrero pmaria luislopezpi costica-moldovanu enridaga dylanvanassche xuemduan

gtfs-bench's Issues

missing commands in the sql script

SET GLOBAL local_infile = 1;
DROP DATABASE IF EXISTS XXX;

update mappings and URI for R2RML

Cambiar el tipo de datos de "stopLat" y "stopLong"

Cambiar los datos de String a tipo Float para poder usar los filtros.

Dockerize "generation at scale" step

Create a dockerfile instead of using a MySQL server directly with VIG

Update the README of each step

update the readme of each folder with more detail explanations

improve q5 vig

en filtro de q5 original es DATE solo, pero en filtro de q5 vig es DATETIME
segun GTFS spec: date en calendar_dates es DATE, no DATETIME
quitamos la hora en q5 de vig?
incluso, mejor si ponemos los mismo filtros como en original que son:
?serviceRule dct:date "2017-12-25"^^xsd:date .
?serviceRule gtfs:dateAddition "true"^^xsd:boolean .

replace default forward port

don't use default port such as 3306/27017 for GTFS1 because it will make the docker compose not being compatible with some systems having mysql/mongodb installed

Generated JSON format is not correct.

When generating JSON files with the benchmark, they are created with a wrong format. Some brackets are missing.

Ex:

Generated JSON :

{"agency_id": "00000000000000000001", "agency_name": "00000000000000000001", "agency_url": "00000000000000000001", "agency_timezone": "00000000000000000001", "agency_lang": "00000000000
000000001", "agency_phone": "00000000000000000001", "agency_fare_url": "00000000000000000001"}

Proper JSON format:

[{"agency_id": "CRTM", "agency_name": "Consorcio Regional de Transportes de Madrid", "agency_url": "http://www.crtm.es", "agency_timezone": "Europe/Madrid", "agency_lang": "es", "agency_phone": "", "agency_fare_url": ""}]

Clean repository

Improve generator's documentation

Improve the current documentation of the generator's output with:

explanation of the output folders and how they are organized
docker-compose template to provide the support to deploy different instances of the generator's ouput

create native SQL queries

Update readme according new generation workflow

update mappings and URI for xR2RML

XML mapping doesn't define iterator.

I have been using the benchmark to generate some source data, as also the mapping files.

When im using XML sourcefiles, im getting this error:
Semantifying KGcountry...
TM: http://mapping.example.com/map_shapes_0
The attribute shape_id is missing.

It seems generator is not defining the iterator.
So when running SDMFizer locally with the generated mapping knowledge graph is not getting created.

Script for the materialization

Prepare a script that getting the results from VIG in CSV generates the RDF using the RDFIzer and the RML mapping

Knowledge Graph generetated with mapping from generator has problem in properties

KG is generated wrong, it has some mistakes in properties.
For example:
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://xmlns.com/foaf/0.1/phone "00000000000000000001".
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://vocab.gtfs.org/terms#fareUrl "00000000000000000001".
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://vocab.gtfs.org/terms#timeZone "00000000000000000001".
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://purl.org/dc/terms/language "00000000000000000001".
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://xmlns.com/foaf/0.1/name "00000000000000000001".
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://xmlns.com/foaf/0.1/page "00000000000000000001".

For the class agency00000000000000000001, the property page is shown as an String.. in foaf vocabulary ,
foaf:page is related to a document, not to a string.,
As there are many other properties happening the same situation.
This means that the correct data is not available to evaluate accurracy from knowledge graphs generated.

use multiple GTFS datasets

The current GTFS dataset from Madrid contains many empty tables (calendar_dates, frequencies, stop_times) that are used in the queries.

We can use multiple GTFS datasets, from example, the one from Barcelona which has more info.
https://developer.tmb.cat/data/gtfs

complete YML mappings with datatypes

Add a link of the data license

https://www.crtm.es/licencia-de-uso?lang=en

Change the mappings aligned with the proposals of the benchmark

MySQL "LOCAL INFILE" import

Currently, the (My)SQL distribution is based on the CSV one plus a bunch of SQL DDL sentences to create and populate a database.

The use of "LOAD DATA LOCAL INFILE" is not standard SQL nor possible by default on every MySQL/MariaDB server.

Enabling this should be documented, based on this reference:

https://dev.mysql.com/doc/refman/8.0/en/load-data-local-security.html

update gtfs:parentStation from Literal to IRI

Check 10-json and 10-xml datasets

Applying the mappings to the 10-xml and 10-json datasets (the ones provided in the repo) I obtain a different number of triples w.r.t. the ones obtained from 10-csv lifting. The same does not happen for other sizes (e.g. 5-csv, 5-json, 5-xml produce the same number of triples).

Given that the mappings used are the same for every size I think there could be a problem in the datasets. The number of records is correct (number of Record nodes in JSON/XML files is the same as the number of rows in the CSV files), so my guess is that there are some identifiers used for the joins in RML that are not correct in all the JSON/XML files of the 10-json/xml datasets.

add the new results obtained with Morph-CSV

Add the new results of the new version of Morph-CSV with the two engines (Morph-RDB and Ontop)

refactor docker/bash scripts

not everyone has a server to setup our benchmark. Ideally, it should be easy to configure, if one just want to generate certain size, not all sizes.

Generate simple SPARQL queries based on the current ones

Generate a set of simple SPARQL queries based on the current ones, including only 1 SPARQL operator in each query. This would help the engines to discover what are the operators that are not covering. These queries will work as a set of "test-cases" but do not necessarily have to be used to report performance and scalability of the engines (the complex/original ones have to be used) which is the main focus of the benchmark

Fix column names in mappings

L203 In the GTFS specification the field is called feed_publisher_url. Change feed_published_url to feed_publisher_url.
L217 In the GTFS specification the field is called shape_dist_traveled. However, the datasets provided in the Madrid-Bench use shape_dist so this should not be an issue.

Change mappings for shapes GTFS file

Mapping for shapes GTFS file can be improved generating for each unique shape_id a gtfs:Shape and several gtfs:ShapePoint for related rows.

error using mapping custom generated by benchmark.

Hi, when I'm using a mapping file custom generated by benchmark with the SDM tool, im getting following error:

Semantifying KGCase04.nt...

TM: http://mapping.example.com/map_calendar_date_rules_0

TM: http://mapping.example.com/map_calendar_rules_0

TM: http://mapping.example.com/map_trips_0

TM: http://mapping.example.com/map_shapes_0

TM: http://mapping.example.com/map_services2_0

TM: http://mapping.example.com/map_feed_0

TM: http://mapping.example.com/map_services1_0

TM: http://mapping.example.com/map_stoptimes_0
Traceback (most recent call last):
File "/home/fborrero/env/lib/python3.8/site-packages/mysql/connector/abstracts.py", line 309, in config
self._port = int(config['port'])
ValueError: invalid literal for int() with base 10: ''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/fborrero/SDM-RDFizer/SDM-RDFizer/rdfizer/run_rdfizer.py", line 3, in
semantify(str(sys.argv[1]))
File "/home/fborrero/SDM-RDFizer/SDM-RDFizer/rdfizer/rdfizer/semantify.py", line 3643, in semantify
number_triple += executor.submit(semantify_file, triples_map, triples_map_list, ",", output_file_descriptor, wr, config[dataset_i]["name"], data).result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/fborrero/SDM-RDFizer/SDM-RDFizer/rdfizer/rdfizer/semantify.py", line 2055, in semantify_file
db = connector.connect(host=host, port=port, user=user, password=password)
File "/home/fborrero/env/lib/python3.8/site-packages/mysql/connector/init.py", line 183, in connect
return MySQLConnection(*args, **kwargs)
File "/home/fborrero/env/lib/python3.8/site-packages/mysql/connector/connection.py", line 100, in init
self.connect(**kwargs)
File "/home/fborrero/env/lib/python3.8/site-packages/mysql/connector/abstracts.py", line 733, in connect
self.config(**kwargs)
File "/home/fborrero/env/lib/python3.8/site-packages/mysql/connector/abstracts.py", line 314, in config
raise errors.InterfaceError(
mysql.connector.errors.InterfaceError: TCP/IP port number should be an integer

It can be reproduced running last version from benchmark, with size 1 and following file distribution:

Custom Distribution:

? [ Custom distribution ] Select output format for AGENCY TM: JSON
? [ Custom distribution ] Select output format for CALENDAR_DATES TM: XML
? [ Custom distribution ] Select output format for CALENDAR TM: CSV
? [ Custom distribution ] Select output format for FEED_INFO TM: JSON
? [ Custom distribution ] Select output format for FREQUENCIES TM: XML
? [ Custom distribution ] Select output format for ROUTES TM: CSV
? [ Custom distribution ] Select output format for SHAPES TM: JSON
? [ Custom distribution ] Select output format for STOPS TM: XML
? [ Custom distribution ] Select output format for STOP_TIMES TM: CSV
? [ Custom distribution ] Select output format for TRIPS TM: JSON

Then the mapping file is used locally with SDM.

Im attaching the mapping file, as also configfile used.
Dropbox.zip

fix values of latitute/longitute columns

The VIG generated values of longitute/latitute columns do not make sense. Latitudes range from -90 to 90. Longitudes range from -180 to 180.

Create a website of the benchmark

Create a website of the benchmark to show all the resources and how to run it

validate q17

check if q17 is a valid SPARQL query, or only in Virtuoso

update q13

update parameter in q13 vig, it seems that VIG does not generate "Nuevos Ministerios"

Change template in CalendarDateRule subject IRI

L176 The subject IRI composed using only service_id is wrong within the file calendar_dates. Also date should be used to generate unique IRIs.

q9 no results

q9 has no results, for this query having gtfsstop:acc_4_1_1 as the instantiated stop does not produce an answer because it is an access to the stop.

q7 add instantiation of Route

Path sources in mappings

Change the path of the sources to absolute ones (/data/frequencies.csv instead of data/frequencies.csv)

replace weelchairAccessible (sic) with wheelchair_boarding in Stop

replace http://vocab.gtfs.org/terms#weelchairAccessible (sic) in Stop with http://vocab.gtfs.org/terms#wheelchairBoarding, check https://developers.google.com/transit/gtfs/reference/#stopstxt

Generalization of the data generation process

Develop a tool taking as input configuration and CSV files at scale and generate the corresponding files and schemas in each desirable format with the corresponding mapping

Posibility to deploy multiple distribution in the Docker container

Currently, only one distribution can choosed to be deployed

check q16

The range of property gtfs:Service in the Trip class is gtfs:Service, not gtfs:ServiceDates.

Create a template for the issues

Create a template for the issues to facilitate the community help in this resource

update q2 vig

FILTER (?stopLat > 200.0) .
 FILTER (?stopLat <2000.0) .
 FILTER (?stopLong > 400.0) .
 FILTER (?stopLong < 4000.0) .

tiene poco sentido, porque son latitute y longitute
tiene rango -180 hasta 180

mi recomendacion es usar los filtros como en q2 original
que son
FILTER (?stopLat > 40.20) .
FILTER (?stopLat < 40.80) .
FILTER (?stopLong > -3.75) .
FILTER (?stopLong < -3.72) .

add termmap en YML mappings

Generate size specific mapping

The generated mappings right now are size agnostic, generic for every distribution. Would be better to generate a mapping per size and distribution

q6 delete OPTIONAL pattern

Add function to process exception_type in calendar_dates GTFS file

L180 exception_type is not of boolean type since it uses 1 and 2 as values in GTFS. This can cause mixed boolean/string values (1 is converted, 2 no). A function is required during lifting to match gtfs:dateAddition range as boolean or the input files should be changed accordingly.