heijul / pdf2gtfs Goto Github PK

A python tool to extract schedule data from PDF timetables and output it in GTFS.

Dockerfile 0.08% Python 99.92%

pdf2gtfs's Introduction

pdf2gtfs

pdf2gtfs can be used to extract schedule data from PDF timetables and turn it into valid GTFS.

It was created as a Bachelor's project + thesis at the chair of 'Algorithms and Datastructures' of the Freiburg University.

The Bachelor's thesis, which goes into more detail and adds an evaluation, can be found here. A (shorter) blogpost detailing its usage can be found here, though some parts are outdated.

Getting started

The master branch contains all the latest changes and is unstable. The release branch usually points to the latest tag, though it may contain some additional fixes.

Prerequisites

Linux (Windows should work as well, but I currently do not test this)
python3.10 or higher (required)
ghostscript >= 9.56.1-1 (recommended)

Older versions may work as well, but only the versions given above are officially supported.

Installation and Usage

Note: Using pip won't install those dependencies that are required only for development.

1. Clone the repository (with submodules):

git clone --recursive https://github.com/heijul/pdf2gtfs.git
cd pdf2gtfs

2. (Optional) Create a venv and activate it (more info):

python3.11 -m venv venv
source venv/bin/activate

Under Windows, you have to activate the venv using ´./venv/bin/activate´.

3. Install pdf2gtfs using pip or poetry.

Note: With pip you will have to manually install the development requirements (Defined in pyproject.toml).

Note: If pip/poetry complains that no pyproject.toml exists for custom_conf, you forgot to add the --recursive flag. To fix this, simply run git submodule update --init --recursive.

Using pip:

pip install .

Using poetry (requires poetry, of course):

poetry install

Using poetry, but also install the development requirements:

poetry install --with=dev

4. (Optional) Run the tests.

Using unittest:

python -m unittest discover test

Using pytest:

pytest test

5. Run pdf2gtfs.

pdf2gtfs -h

This will provide help on the usage of pdf2gtfs.

Configuration

pdf2gtfs will read the provided config file in order. The default configuration will be read first, and any provided config files will be read in the order they were given. Later configurations override previous configurations.

For more information on the config keys and their possible values, check out the default configuration.

Examples

The following examples can be run from the examples directory and show how some config values change the accuracy of the detected locations, as well as whether the pdf can be read at all. The base.yaml config only contains some basic output settings, used by all examples.

Before you run these, switch to the examples directory: cd examples

Example 1: Tram Line 1 of the VAG

Uses the default configuration, with the exception of the routetype.

pdf2gtfs --config=base.yaml --config=vag_1.yaml vag_1.pdf

Example 2: Subway Line S1 of the KVV

The max_row_distance needs to be adjusted, to read this PDF properly.

pdf2gtfs --config=base.yaml --config=kvv_s1.yaml kvv_s1.pdf

Example 3: RegionalExpress Lines RE2/RE3 of the GVH

The close_node_check, needs to be disabled, because it incorrectly disregards valid locations, that seem too far away.

Note: This example uses the legacy table extraction, because the new one (currently) results in errors.

pdf2gtfs --config=base.yaml --config=gvh_re2_re3.yaml gvh_re2_re3.pdf

Example 4: Bus Line 680 of the Havelbus

Here, disabling the close_node_check leads to far better results as well. Note that the config also contains some other settings, which lead to a similar result.

Note: This example uses the legacy table extraction, because the new one (currently) results in errors.

pdf2gtfs --config=base.yaml --config=havelbus_680.yaml havelbus_680.pdf

Example 5: Line G10 of the RMV

Reading of page 4 currently fails and reading more than one page leads to worse results in the location detection. This may sometimes happen, because the average of all locations for a specific stop is used.

pdf2gtfs --config=base.yaml --config=rmv_g10.yaml rmv_g10.pdf

How does it work

In principle, pdf2gtfs works in 3 steps:

Extract the timetable data from the PDF
Create the GTFS in memory
Detect the locations of the stops using the stop names and their order.

Finally, the GTFS feed is saved on disk, after adding the locations.

In the following are some rough descriptions on how each of the previously mentioned steps is performed.

Extract the timetable data from the PDF

Use ghostscript to remove all images and vector drawings from the PDF
Use pdfminer.six to extract the text from the PDF
Split the LTTextLine objects of pdfminer.six into words
Detect the words that are times using the time_format config-key
Define the body of the table using the times
Add cells to the table that overlap with its rows/columns

Create the GTFS in memory

If an agency.txt is given using the input_files option, and it contains a single entry, use that agency by default. If it contains multiple entries, ask the user to choose, which agency should be used.
If a stops.txt is given using the input_files option, search it for the stops.
Create basic skeleton of required GTFS files
In case the tables contain annotations, create a new calendar.txt entry for each annotation and date combination.
- Ask the user to input dates, where there is an exception in the service, which are added to the calendar_dates.txt
Iterate through the TimeTableEntries of all TimeTables and create a new entry data to the stop_times.txt.

Detect the locations of the stops using the stop names and their order

This is only done, if there is no stops.txt input file, or if the given file does not contain all necessary stops.

Get a list of all stop locations (nodes) along with their name, type and some attributes from OpenStreetMap (OSM) using QLever.
Normalize the names of the nodes by stripping any non-letter symbols and expanding any abbreviations
For each stop of the detected tables, find those nodes that contain every word of the (normalized) stop name.
Add basic costs:
- Name costs, based on the difference in length between a stops name and any of the node's names. (This works, because of the normalization)
- Node costs, based on the selected gtfs_routetype and the attributes of the node.
Use Dijkstra's algorithm, to find the nodes with the lowest cost. The cost of a node, is simply the sum of its name-, node- and travel cost. The travel cost is calculated using either a "closer-is-better" approach or a "closer-to-expected-distance-is-better" approach.
If any of the stops was found in the stops.txt file (if given), it's location will be used instead of checking the OSM data.
If the location of a stop was not found, it is interpolated using the surrounding stop locations.

The first two steps are generally the slowest steps of the location detection. Therefore, we cache the result and use the cache, if possible.

More information

The new table extraction, as well as the overall process and evaluation of pdf2gtfs are detailed in my Bachelor's thesis. There is also a blogpost, which describes the previously used table extraction and provides a shorter overview on how pdf2gtfs works.

Bugs and suggestions

If something is not working or is missing, feel free to create an issue.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

pdf2gtfs's People

Contributors

Stargazers

Watchers

pdf2gtfs's Issues

Make list of things not working with different examples

There are a lot of different timetable formats, each having their own caveats and many of which can currently not be read.
If we have some file that outlines the problems different non-working examples each have, we can prioritize easier, which problems to work on.

CalendarDates generation

When generating the calendar_dates, this may result in service being falsely disabled (possibly only when sundays/holidays use different timetables).

If this is not a problem, add documentation explaining why not.

Replace StopPosition by something else

When adding optional keys, StopPosition and all its usages need to be updated. If we instead use a class, we can easily add default values.
A thing to note is, that the original reason StopPosition was created, is so we can use itertuples, which afaik is faster than iterrows.

Properly check for valid IFOPTs

The regex used to check if an IFOPT is valid matches a lot more, than what should be accepted as valid IFOPT.
It also does not take the public_transport-value of the node into account. See the osm-wiki article.

Cleanup stop_times creation

Generating the stop_times currently is a big mess.

Split into multiple methods, maybe outsourcing some of the code to stop_times.py, if that makes sense.
Add more comments explaining it, if it is still convoluted
Fix tests/Add new tests

Manual editing of PDFTable/TimeTable

We could add the functionality, to edit a PDFTable/TimeTable in the program, similar to a spreadsheet.

Move TODOs from code to github issues

This issue is only here, in case the TODOs are needed for some reason.

Cleanup osm_values

This can be done much cleaner and also with better documentation, I guess.

Annotation precedence

If multiple annotations are active on the same service, but have different defaults, default=off will (probably) take precedence, when adding the annotation dates.

Add comment if this is not a problem.

BBox is_next_to should take font size into account

Title. Two characters of size 8 will have a different threshold for being considered next to each other than two characters of size 20.

Reverse route during location search

Reversing the route and searching in both directions, might lead to improved results, because different StartNodes will be used. This may help, if the last stop of the route can be found easier on OSM than the first one.

The only downside I see, is the worsening of the performance.

Check, if this would make a difference (probably based on simple_travel_cost_calculation) and how much it would impact performance
If it does improve the results:
- Implement it
- Document how/why
- Add a new config key, to enable/disable this function. The default should depend on how much the result is improved and how much performance is impacted.
- Test everything

Colorize logging output

Colors rock. Use them.

We need to differentiate between terminals that can use color and basically windows though, I guess.
There is also the question whether to use something like colorlog or do it ourselves.

Catch exceptions when getting the PDF page count

Getting the number of pages here needs error handling.

Count for locations with empty names not added to cache

Add an note to the logger.info(), about how many locations had empty names.

Working handler is required for testing

Apparently, we do not have a working handler used in testing.

Estimate average speed from existing stops

When we have existing stops, we can calculate the speed between the different nodes and use that, instead of the hardcoded one.

List annotations

Give a table view over all current annotations added, either at the end, or when a certain input was recognized. Maybe we can reuse the agency table for this.

Nodemap location

The html for the display_nodes, should be located in some temp_dir. Its filename should also show the route_id and which view this map is about (all nodes/final route/etc.).
That way, the map can be opened at a later time (or reloaded). If necessary, we can then also add a cli arg to export these maps as well.

Use geopy

Maybe use geopy for the distance calculation, instead of doing it manually and approximately.

Overwrite files options

When asking, whether a file should be overwritten, adding "overwrite [a]ll/never [o]verwrite" could make sense.
At the same time, this part was originally written, when the output consisted of different .txt files instead of a proper feed. In case this function is only needed for the output feed, there is no need to change it (though we should add a comment explaining the reasoning).

Add an agency cli arg

When using --non-interactive, or when it is clear which agency should be used, having to input the agency, once reading is complete is a hassle.
Therefore we should add a cli arg --agency, where the user can supply the agency id to use, or in case a DummyAgency should be created an empty string/-1/something similar.
Need to also raise an error, if the given agency id does not exist in the given agency.txt.

Add aerialway

Add aerialway as category for AerialLift and SuspendedCableCar.

Adjust map when displaying nodes

When using display_nodes, the zoom is fixed and the location uses the average of all node locations.
Instead, we should find a zoom-level, such that all nodes are visible and move the map to the correct position (which may be the same, as the one currently used).

StreetCar inheritence

The StreetCar should implement its own good_values/bad_values.

Autodetect annotation dates

Search the rows that were dropped, as well as Rows/Columns of type Other, for dates.
We could then suggest these in the annotation part, to make things easier for the user.

Some notes, which need checking

I made these notes at some point during (later) development. Need to check if they still apply:

Not sure what exactly I meant by this...

2. Merge consecutive stop columns
I'm not sure if this applies, because afaik the stop_column has to be the first column of a table.
3. Fix no header

Not entirely sure, but at some point the header was not read properly, though this may have very well been an issue with the used options (max_row_distance, etc.)
This may be part of the cleanup_tables step.

4. Add fix_split_stopnames/Do not add stopname for search on split stops

I assume, this was about the cleanup_tables step. Not sure about the second part though.

Improve logging

The loglevels INFO/DEBUG are probably used the wrong way.
We could also add more loglevels with different Formatters, for example for simple console output or for verbose output.

Warn/Do something about colliding IDs

When multiple input_files are being used, their IDs may clash. We should at least warn the user about this and abort, if this happens. We could also simply merge the data if this is easily possible, or add a suffix to existing colliding IDs.

Missing tests - gtfs_output.BaseDataClass and gtfs_output.BaseContainer

The tests for BaseDataClass and BaseContainer are missing from here.

Check if/how to best test this. The required testing could also be done in the subclasses.
Implement or remove the testclasses and mark as wontfix.

Use the OSM node name in the output feed

We could use the gtfs:name/name/?? of the OSM node, instead of what we are doing right now. (How are we even generating the GTFS-stop_name, right now?)

Fix asking for agency when using non-interactive mode

Apparently this does/did not work.

Use dropped rows to enhance the tables

Instead of just dropping rows, that had too much distance to the next, we should add them as "extra rows" to the surrounding tables. In particular they could be used to provide the suggestions for #48).

Extract help text from properties/default_config

The help text for the cli arguments is currently a truncated version of the help in the default_config. We could instead read the default configs comments before checking cli arguments.
This has the obvious problem that, if the reading fails, no help can be displayed properly. However, if that happens we have a whole different problem, so in general this should be fine(TM).

Cache is created even when cleanup fails

Seems like the cache at ~/.cache/pdf2gtfs/osm_cache.tsv is created, even if the cleanup step fails, making subsequent cache reads fail.

Node cost calculation is arbitrary and obscure

There are a few issues with how the node costs are calculated.

First and foremost, the function currently used, is arbitrary at best.
(Currently uses (route_type_score + node_has_this_many_optional_values) ** 2 // 20)
We should also change where the node costs are calculated, instead of doing something like this.

In other words, we probably need to

Move everything node cost related to a single place
Find a function that has some fixed limit and increases fast for smaller values. (log?)
It should also take the optional values into account, though it should not increase linearly based on them.
Document how the function works and why it is a good function
Test it

Replace DayIsActive by something else for the calendar

The way DayIsActive is used in the calendar.py is pretty ugly. Maybe we should just use a tuple or something.

Fix BaseContainer equality comparison

When comparing two BaseContainer, the entries should be sorted in some way first.

Check if unused routes can exist

Currently unused routes can exist and are removed. Check if this is true and document why this happens, otherwise document why that can not happen here.

Best node selection is obscure and not working properly

Currently, the best node is simply the one that was selected for most routes.
This seems to work most of the time, but probably selects non-optimal nodes sometimes.
The best route selection should also work on multiple, detached routes.

Check if the way we currently select the best nodes works properly
If it does:
- Document, how and why
- Recalculate the node scores for the full route.
Otherwise:
- Find a better way
- Document, how/why this works better/at all.

Repeat strategy "autodetect"

Most of the time, it should be clear, which repeat strategy should be used. We simply need to check for "-" and "/" and set the repeat_strategy to mean in that case.
This requires however, that there are no special symbols with meaning, that were removed during preprocessing. Hence, we should add a new default repeat strategy "autodetect", while keeping the other two, in case autodetection fails.

Add .cache fallback

In case the default cache directory can not be read, use ./.pdf2gtfs_cache or p2g_dir/.cache or something similar. (maybe TempDir with fixed name?)

Create a single temp dir for testing

This way it is immediately clear, which test run which test belongs to.
The name of the tempdir should probably be something like pdf2gtfs_test_{date}_{time}.

Advanced repeat columns

Some agencies have resorted to "interesting" ways of using repeat columns (e.g. line 1 of the Leipziger Verkehrsbetriebe).
The problem is twofold:

First we need to properly detect the columns, as well as the start-/end-stops for that particular repeat interval.
To properly convert these into normal columns, we basically need to start with the shortest interval and whenever (i * interval) > (k * other_interval) (or something similar), change the route accordingly.

Right now, the first part of this seems to be the bigger problem, while the second part looks like it definitively looks like it could end up looking like a hack.

On the other hand, we could ask the user to provide stop ids and the interval to use between them, similar to how we ask for the agency. This could also be done in conjunction of the above: The detection part is input by the user or some file, while the conversion part is done automatically.

Improve optional key handling

Adding new optional keys should be as easy as

Add name to OPT_KEYS/OPT_OSM_KEYS
Add logic to opt_keys_to_int, if the key should impact the node cost
Add a new field to the associated GTFS object and also updating the required defaults/logic when creating such a GTFS object
Add a new function called here, before the GTFS feed is written

Currently, this is (also because of #25) not the case.

CLI needs improvement

The cli can be improved in some ways:

We need a shorthand for frequent commands.
The help text could be added to the properties instead (#33 ).
#29
Some arguments need to be required. (e.g. routetype)
Also check, if there are any conditionally required arguments.
Check, which properties would make good arguments as well, but currently are not.