Giter VIP home page Giter VIP logo

gtfs-bench's Introduction

The GTFS-Madrid-Bench

We present GTFS-Madrid-Bench, a benchmark to evaluate declarative KG construction engines that can be used for the provision of access mechanisms to (virtual) knowledge graphs. Our proposal introduces several scenarios that aim at measuring performance and scalability as well as the query capabilities of all this kind of engines, considering their heterogeneity. The data sources used in our benchmark are derived from the GTFS data files of the subway network of Madrid. They can be transformed into several formats (CSV, JSON, SQL and XML) and scaled up. The query set aims at addressing a representative number of SPARQL 1.1 features while covering usual queries that data consumers may be interested in.

Main Publication:

David Chaves-Fraga, Freddy Priyatna, Andrea Cimmino, Jhon Toledo, Edna Ruckhaus, & Oscar Corcho (2020). GTFS-Madrid-Bench: A benchmark for virtual knowledge graph access in the transport domain. Journal of Web Semantics, 65. Online

Citing GTFS-Madrid-Bench: If you used GTFS-Madrid-Bench in your work, please cite as:

@article{chaves2020gtfs,
  title={GTFS-Madrid-Bench: A benchmark for virtual knowledge graph access in the transport domain},
  author={Chaves-Fraga, David and Priyatna, Freddy and Cimmino, Andrea and Toledo, Jhon and Ruckhaus, Edna and Corcho, Oscar},
  journal={Journal of Web Semantics},
  volume={65},
  pages={100596},
  year={2020},
  doi={https://doi.org/10.1016/j.websem.2020.100596},
  publisher={Elsevier}

}

Results

  • Virtual KGC results can be reproduced through the resources provided in this branch
  • Materialized KGC results can be reproduced through the resources provided in this repo

Requirements for the use:

To have locally installed docker.

Decide the distributions to be used for your testing. They can be:

  • Standard distributions: data sources are represented in one format (e.g., GTFS-CSV, GTFS-JSON or GTFS-SQL).
  • Custom distributions: each data source is represented in the format selected by the user (e.g., SHAPES in JSON, CALENDAR in CSV, etc.)

Using GTFS-Madrid-Bench:

  1. Download and run the docker image (run it always to ensure you are using the last version of the docker image).
  • Docker v20.10 or later: docker run --pull always -itv "$(pwd)":/output oegdataintegration/gtfs-bench
  • Previous versions: docker pull oegdataintegration/gtfs-bench and then docker run -itv "$(pwd)":/output oegdataintegration/gtfs-bench
  1. Choose data scales and formats to obtain the distributions you want to test. You have to provide: first the data scales (in one line, separated by a comma), then, select the standard distributions (from none to all) and if is needed, the configuration for one custom distribution. If you want to generate several custom distributions, you will have to run the generator several times.
  2. Optionally, you can apply a percentage of changes to the original data. A seed value can be provided to generate different changes to simulate multiple changed dumps. The following changes can be generated:
    • Additions: Routes and their associated trips, stops, stoptimes, services are added to the data. Example: 25% additions will provide additional new routes, 25% of the number of routes of the original data.
    • Modifications: Service entries for trips are modified. Example: 50% modifications will modify 50% of the service entries in the calendar.
    • Deletions: Routes and their associated trips and services are removed from the data. Example: 10% deletions will remove 10% of the routes in the original data together with the associated data.

Demo usage: Demo GIF

  1. Result will be available as result.zip in the current working directory. The folders structure are: one folder for datasets and other for the queries (for virtual KG). Inside the datasets folder will be one folder for each distribution (e.g., csv, sql, custom), and in each distribution folder we provide the required sizes (each size in one folder), the corresponding mapping associated to the distribution, and the SQL schemes if they are needed. Consider that for not repeating resources at scale level, the mappings and SQL paths to the data are define at distribution level (e.g, "data/AGENCY.csv") and their management for performing a correct evaluation has to be done by the user (with an script, for example). You can visit the utils folder where we provide some ideas on how to manage it. See the following example:
.
├── datasets
│   ├── csv
│   │   ├── 1
│   │   │   ├── AGENCY.csv
│   │   │   ├── CALENDAR.csv
│   │   │   ├── CALENDAR_DATES.csv
│   │   │   ├── FEED_INFO.csv
│   │   │   ├── FREQUENCIES.csv
│   │   │   ├── ROUTES.csv
│   │   │   ├── SHAPES.csv
│   │   │   ├── STOPS.csv
│   │   │   ├── STOP_TIMES.csv
│   │   │   └── TRIPS.csv
│   │   ├── 2
│   │   │   ├── AGENCY.csv
│   │   │   ├── CALENDAR.csv
│   │   │   ├── CALENDAR_DATES.csv
│   │   │   ├── FEED_INFO.csv
│   │   │   ├── FREQUENCIES.csv
│   │   │   ├── ROUTES.csv
│   │   │   ├── SHAPES.csv
│   │   │   ├── STOPS.csv
│   │   │   ├── STOP_TIMES.csv
│   │   │   └── TRIPS.csv
│   │   ├── 3
│   │   │   ├── AGENCY.csv
│   │   │   ├── CALENDAR.csv
│   │   │   ├── CALENDAR_DATES.csv
│   │   │   ├── FEED_INFO.csv
│   │   │   ├── FREQUENCIES.csv
│   │   │   ├── ROUTES.csv
│   │   │   ├── SHAPES.csv
│   │   │   ├── STOPS.csv
│   │   │   ├── STOP_TIMES.csv
│   │   │   └── TRIPS.csv
│   │   └── mapping.csv.nt
│   ├── json
│   │   ├── 1
│   │   │   ├── AGENCY.json
│   │   │   ├── CALENDAR_DATES.json
│   │   │   ├── CALENDAR.json
│   │   │   ├── FEED_INFO.json
│   │   │   ├── FREQUENCIES.json
│   │   │   ├── ROUTES.json
│   │   │   ├── SHAPES.json
│   │   │   ├── STOPS.json
│   │   │   ├── STOP_TIMES.json
│   │   │   └── TRIPS.json
│   │   ├── 2
│   │   │   ├── AGENCY.json
│   │   │   ├── CALENDAR_DATES.json
│   │   │   ├── CALENDAR.json
│   │   │   ├── FEED_INFO.json
│   │   │   ├── FREQUENCIES.json
│   │   │   ├── ROUTES.json
│   │   │   ├── SHAPES.json
│   │   │   ├── STOPS.json
│   │   │   ├── STOP_TIMES.json
│   │   │   └── TRIPS.json
│   │   ├── 3
│   │   │   ├── AGENCY.json
│   │   │   ├── CALENDAR_DATES.json
│   │   │   ├── CALENDAR.json
│   │   │   ├── FEED_INFO.json
│   │   │   ├── FREQUENCIES.json
│   │   │   ├── ROUTES.json
│   │   │   ├── SHAPES.json
│   │   │   ├── STOPS.json
│   │   │   ├── STOP_TIMES.json
│   │   │   └── TRIPS.json
│   │   └── mapping.json.nt
│   └── sql
│       ├── 1
│       │   ├── AGENCY.csv
│       │   ├── CALENDAR.csv
│       │   ├── CALENDAR_DATES.csv
│       │   ├── FEED_INFO.csv
│       │   ├── FREQUENCIES.csv
│       │   ├── ROUTES.csv
│       │   ├── SHAPES.csv
│       │   ├── STOPS.csv
│       │   ├── STOP_TIMES.csv
│       │   └── TRIPS.csv
│       ├── 2
│       │   ├── AGENCY.csv
│       │   ├── CALENDAR.csv
│       │   ├── CALENDAR_DATES.csv
│       │   ├── FEED_INFO.csv
│       │   ├── FREQUENCIES.csv
│       │   ├── ROUTES.csv
│       │   ├── SHAPES.csv
│       │   ├── STOPS.csv
│       │   ├── STOP_TIMES.csv
│       │   └── TRIPS.csv
│       ├── 3
│       │   ├── AGENCY.csv
│       │   ├── CALENDAR.csv
│       │   ├── CALENDAR_DATES.csv
│       │   ├── FEED_INFO.csv
│       │   ├── FREQUENCIES.csv
│       │   ├── ROUTES.csv
│       │   ├── SHAPES.csv
│       │   ├── STOPS.csv
│       │   ├── STOP_TIMES.csv
│       │   └── TRIPS.csv
│       └── mapping.sql.nt
│       └── schema.sql
└── queries
    ├── q10.rq
    ├── q11.rq
    ├── q12.rq
    ├── q13.rq
    ├── q14.rq
    ├── q15.rq
    ├── q16.rq
    ├── q17.rq
    ├── q18.rq
    ├── q1.rq
    ├── q2.rq
    ├── q3.rq
    ├── q4.rq
    ├── q5.rq
    ├── q6.rq
    ├── q7.rq
    ├── q8.rq
    └── q9.rq

Resources

Additionally to the generator engine, that provides the data at desirable scales and distributions, together with corresponding mappings and queries, there are also common resources openly available to be modified or used by any practicioner or developer:

  • Folder mappings contains RML mappings for CSV, XML, JSON and RDB distributions of the input GTFS dataset, R2RML mapping for RDB and xR2RML mapping for MongoDB. It also includes CSVW annotations for the CSV distributions.
  • Folder queries includes 18 queries with different levels of complexity including a representative set of SPARQL 1.1. operators. Additionally, the folder contains 11 simple queries that will help to test the basic capabilities of virtual KG construction engines (i.e., to understand if the engine is able to translate correctly the SPARQL operators over different GTFS distributions before starting to test performance and scalability).

Utils

Our experiences testing (virtual) knowledge graph engines have revealed the difficulties for setting up an infrastructure where many variables and resources are involved: databases, raw data, mappings, queries, data paths, mapping paths, databases connections, etc. For that reason, and in order to facilitate the use of the benchmark to any developer or practitioner, we provide a set of utils such as docker-compose templates or evaluation bash scripts that, in our opinion, can reduce the time for preparing the testing set up.

Desirable Metrics:

We highly recommend that (virutalizers or materializers) KG construction engines tested with this benchmark provide (at least) the following metris:

  • Total execution time
  • Number of answers
  • Memory consumption
  • Initial delay
  • Dief@k (only for continuous/streaming behavior)*
  • Dief@t (only for continuous/streaming behavior)*

For virtual knowledge graphs systems, we also encourage developers and tester to provide:

  • Loading time
  • Mapping translation time (if applies)
  • Number of requests
  • Source selection time
  • Query generation (or disitribution) time
  • Query rewritting time
  • Query translation time
  • Query exececution time
  • Results aggregation time

*R Package available at: https://github.com/dachafra/dief (extension from https://github.com/maribelacosta/dief) and Python PyPi module available at https://pypi.org/project/diefpy/ (provided by SDM-TIB)

Data License

All the datasets generated by this benchmark have to follow the license of the Consorcio Regional de Transporte de Madrid: https://www.crtm.es/licencia-de-uso?lang=en

Contribute

We know that there are variables and dimensions that we did not take into account in the current version of the benchmark (e.g., transformation function defined in the mapping rules). If you are interested in collaborate with us in a new version of the benchmark, send us an email or open a new discussion!

Authors

  • David Chaves-Fraga - [email protected]
  • Freddy Priyatna
  • Jhon Toledo
  • Daniel Doña
  • Edna Ruckhaus
  • Andrea Cimmino
  • Oscar Corcho

Ontology Engineering Group, October 2019 - Present

gtfs-bench's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gtfs-bench's Issues

improve q5 vig

en filtro de q5 original es DATE solo, pero en filtro de q5 vig es DATETIME
segun GTFS spec: date en calendar_dates es DATE, no DATETIME
quitamos la hora en q5 de vig?
incluso, mejor si ponemos los mismo filtros como en original que son:
?serviceRule dct:date "2017-12-25"^^xsd:date .
?serviceRule gtfs:dateAddition "true"^^xsd:boolean .

replace default forward port

don't use default port such as 3306/27017 for GTFS1 because it will make the docker compose not being compatible with some systems having mysql/mongodb installed

Generated JSON format is not correct.

When generating JSON files with the benchmark, they are created with a wrong format. Some brackets are missing.

Ex:

Generated JSON :

{"agency_id": "00000000000000000001", "agency_name": "00000000000000000001", "agency_url": "00000000000000000001", "agency_timezone": "00000000000000000001", "agency_lang": "00000000000
000000001", "agency_phone": "00000000000000000001", "agency_fare_url": "00000000000000000001"}

Proper JSON format:

[{"agency_id": "CRTM", "agency_name": "Consorcio Regional de Transportes de Madrid", "agency_url": "http://www.crtm.es", "agency_timezone": "Europe/Madrid", "agency_lang": "es", "agency_phone": "", "agency_fare_url": ""}]

Improve generator's documentation

Improve the current documentation of the generator's output with:

  • explanation of the output folders and how they are organized
  • docker-compose template to provide the support to deploy different instances of the generator's ouput

XML mapping doesn't define iterator.

I have been using the benchmark to generate some source data, as also the mapping files.

When im using XML sourcefiles, im getting this error:
Semantifying KGcountry...
TM: http://mapping.example.com/map_shapes_0
The attribute shape_id is missing.

It seems generator is not defining the iterator.
So when running SDMFizer locally with the generated mapping knowledge graph is not getting created.

Knowledge Graph generetated with mapping from generator has problem in properties

KG is generated wrong, it has some mistakes in properties.
For example:
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://xmlns.com/foaf/0.1/phone "00000000000000000001".
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://vocab.gtfs.org/terms#fareUrl "00000000000000000001".
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://vocab.gtfs.org/terms#timeZone "00000000000000000001".
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://purl.org/dc/terms/language "00000000000000000001".
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://xmlns.com/foaf/0.1/name "00000000000000000001".
http://transport.linkeddata.es/madrid/agency/00000000000000000001 http://xmlns.com/foaf/0.1/page "00000000000000000001".

For the class agency00000000000000000001, the property page is shown as an String.. in foaf vocabulary ,
foaf:page is related to a document, not to a string.,
As there are many other properties happening the same situation.
This means that the correct data is not available to evaluate accurracy from knowledge graphs generated.

Check 10-json and 10-xml datasets

Applying the mappings to the 10-xml and 10-json datasets (the ones provided in the repo) I obtain a different number of triples w.r.t. the ones obtained from 10-csv lifting. The same does not happen for other sizes (e.g. 5-csv, 5-json, 5-xml produce the same number of triples).

Given that the mappings used are the same for every size I think there could be a problem in the datasets. The number of records is correct (number of Record nodes in JSON/XML files is the same as the number of rows in the CSV files), so my guess is that there are some identifiers used for the joins in RML that are not correct in all the JSON/XML files of the 10-json/xml datasets.

refactor docker/bash scripts

not everyone has a server to setup our benchmark. Ideally, it should be easy to configure, if one just want to generate certain size, not all sizes.

Generate simple SPARQL queries based on the current ones

Generate a set of simple SPARQL queries based on the current ones, including only 1 SPARQL operator in each query. This would help the engines to discover what are the operators that are not covering. These queries will work as a set of "test-cases" but do not necessarily have to be used to report performance and scalability of the engines (the complex/original ones have to be used) which is the main focus of the benchmark

Fix column names in mappings

  • L203 In the GTFS specification the field is called feed_publisher_url. Change feed_published_url to feed_publisher_url.
  • L217 In the GTFS specification the field is called shape_dist_traveled. However, the datasets provided in the Madrid-Bench use shape_dist so this should not be an issue.

error using mapping custom generated by benchmark.

Hi, when I'm using a mapping file custom generated by benchmark with the SDM tool, im getting following error:

Semantifying KGCase04.nt...

TM: http://mapping.example.com/map_calendar_date_rules_0

TM: http://mapping.example.com/map_calendar_rules_0

TM: http://mapping.example.com/map_trips_0

TM: http://mapping.example.com/map_shapes_0

TM: http://mapping.example.com/map_services2_0

TM: http://mapping.example.com/map_feed_0

TM: http://mapping.example.com/map_services1_0

TM: http://mapping.example.com/map_stoptimes_0
Traceback (most recent call last):
File "/home/fborrero/env/lib/python3.8/site-packages/mysql/connector/abstracts.py", line 309, in config
self._port = int(config['port'])
ValueError: invalid literal for int() with base 10: ''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/fborrero/SDM-RDFizer/SDM-RDFizer/rdfizer/run_rdfizer.py", line 3, in
semantify(str(sys.argv[1]))
File "/home/fborrero/SDM-RDFizer/SDM-RDFizer/rdfizer/rdfizer/semantify.py", line 3643, in semantify
number_triple += executor.submit(semantify_file, triples_map, triples_map_list, ",", output_file_descriptor, wr, config[dataset_i]["name"], data).result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/fborrero/SDM-RDFizer/SDM-RDFizer/rdfizer/rdfizer/semantify.py", line 2055, in semantify_file
db = connector.connect(host=host, port=port, user=user, password=password)
File "/home/fborrero/env/lib/python3.8/site-packages/mysql/connector/init.py", line 183, in connect
return MySQLConnection(*args, **kwargs)
File "/home/fborrero/env/lib/python3.8/site-packages/mysql/connector/connection.py", line 100, in init
self.connect(**kwargs)
File "/home/fborrero/env/lib/python3.8/site-packages/mysql/connector/abstracts.py", line 733, in connect
self.config(**kwargs)
File "/home/fborrero/env/lib/python3.8/site-packages/mysql/connector/abstracts.py", line 314, in config
raise errors.InterfaceError(
mysql.connector.errors.InterfaceError: TCP/IP port number should be an integer

It can be reproduced running last version from benchmark, with size 1 and following file distribution:

Custom Distribution:

? [ Custom distribution ] Select output format for AGENCY TM: JSON
? [ Custom distribution ] Select output format for CALENDAR_DATES TM: XML
? [ Custom distribution ] Select output format for CALENDAR TM: CSV
? [ Custom distribution ] Select output format for FEED_INFO TM: JSON
? [ Custom distribution ] Select output format for FREQUENCIES TM: XML
? [ Custom distribution ] Select output format for ROUTES TM: CSV
? [ Custom distribution ] Select output format for SHAPES TM: JSON
? [ Custom distribution ] Select output format for STOPS TM: XML
? [ Custom distribution ] Select output format for STOP_TIMES TM: CSV
? [ Custom distribution ] Select output format for TRIPS TM: JSON

Then the mapping file is used locally with SDM.

Im attaching the mapping file, as also configfile used.
Dropbox.zip

validate q17

check if q17 is a valid SPARQL query, or only in Virtuoso

update q13

update parameter in q13 vig, it seems that VIG does not generate "Nuevos Ministerios"

q9 no results

q9 has no results, for this query having gtfsstop:acc_4_1_1 as the instantiated stop does not produce an answer because it is an access to the stop.

Path sources in mappings

Change the path of the sources to absolute ones (/data/frequencies.csv instead of data/frequencies.csv)

check q16

The range of property gtfs:Service in the Trip class is gtfs:Service, not gtfs:ServiceDates.

update q2 vig

FILTER (?stopLat > 200.0) .
 FILTER (?stopLat <2000.0) .
 FILTER (?stopLong > 400.0) .
 FILTER (?stopLong < 4000.0) .

tiene poco sentido, porque son latitute y longitute
tiene rango -180 hasta 180

mi recomendacion es usar los filtros como en q2 original
que son
FILTER (?stopLat > 40.20) .
FILTER (?stopLat < 40.80) .
FILTER (?stopLong > -3.75) .
FILTER (?stopLong < -3.72) .

Generate size specific mapping

The generated mappings right now are size agnostic, generic for every distribution. Would be better to generate a mapping per size and distribution

Add function to process exception_type in calendar_dates GTFS file

L180 exception_type is not of boolean type since it uses 1 and 2 as values in GTFS. This can cause mixed boolean/string values (1 is converted, 2 no). A function is required during lifting to match gtfs:dateAddition range as boolean or the input files should be changed accordingly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.