Giter VIP home page Giter VIP logo

pdf2gtfs's Introduction

pdf2gtfs

pdf2gtfs can be used to extract schedule data from PDF timetables and turn it into valid GTFS.

It was created as a Bachelor's project + thesis at the chair of 'Algorithms and Datastructures' of the Freiburg University.

The Bachelor's thesis, which goes into more detail and adds an evaluation, can be found here. A (shorter) blogpost detailing its usage can be found here, though some parts are outdated.

Getting started

The master branch contains all the latest changes and is unstable. The release branch usually points to the latest tag, though it may contain some additional fixes.

Prerequisites

  • Linux (Windows should work as well, but I currently do not test this)
  • python3.10 or higher (required)
  • ghostscript >= 9.56.1-1 (recommended)

Older versions may work as well, but only the versions given above are officially supported.

Installation and Usage

Note: Using pip won't install those dependencies that are required only for development.

1. Clone the repository (with submodules):

git clone --recursive https://github.com/heijul/pdf2gtfs.git
cd pdf2gtfs

2. (Optional) Create a venv and activate it (more info):

python3.11 -m venv venv
source venv/bin/activate

Under Windows, you have to activate the venv using ´./venv/bin/activate´.

3. Install pdf2gtfs using pip or poetry.

Note: With pip you will have to manually install the development requirements (Defined in pyproject.toml).

Note: If pip/poetry complains that no pyproject.toml exists for custom_conf, you forgot to add the --recursive flag. To fix this, simply run git submodule update --init --recursive.

Using pip:

pip install .

Using poetry (requires poetry, of course):

poetry install

Using poetry, but also install the development requirements:

poetry install --with=dev

4. (Optional) Run the tests.

Using unittest:

python -m unittest discover test

Using pytest:

pytest test

5. Run pdf2gtfs.

pdf2gtfs -h

This will provide help on the usage of pdf2gtfs.

Configuration

pdf2gtfs will read the provided config file in order. The default configuration will be read first, and any provided config files will be read in the order they were given. Later configurations override previous configurations.

For more information on the config keys and their possible values, check out the default configuration.

Examples

The following examples can be run from the examples directory and show how some config values change the accuracy of the detected locations, as well as whether the pdf can be read at all. The base.yaml config only contains some basic output settings, used by all examples.

Before you run these, switch to the examples directory: cd examples

Example 1: Tram Line 1 of the VAG

Uses the default configuration, with the exception of the routetype.

pdf2gtfs --config=base.yaml --config=vag_1.yaml vag_1.pdf

Example 2: Subway Line S1 of the KVV

The max_row_distance needs to be adjusted, to read this PDF properly.

pdf2gtfs --config=base.yaml --config=kvv_s1.yaml kvv_s1.pdf

Example 3: RegionalExpress Lines RE2/RE3 of the GVH

The close_node_check, needs to be disabled, because it incorrectly disregards valid locations, that seem too far away.

Note: This example uses the legacy table extraction, because the new one (currently) results in errors.

pdf2gtfs --config=base.yaml --config=gvh_re2_re3.yaml gvh_re2_re3.pdf

Example 4: Bus Line 680 of the Havelbus

Here, disabling the close_node_check leads to far better results as well. Note that the config also contains some other settings, which lead to a similar result.

Note: This example uses the legacy table extraction, because the new one (currently) results in errors.

pdf2gtfs --config=base.yaml --config=havelbus_680.yaml havelbus_680.pdf

Example 5: Line G10 of the RMV

Reading of page 4 currently fails and reading more than one page leads to worse results in the location detection. This may sometimes happen, because the average of all locations for a specific stop is used.

pdf2gtfs --config=base.yaml --config=rmv_g10.yaml rmv_g10.pdf

How does it work

In principle, pdf2gtfs works in 3 steps:

  1. Extract the timetable data from the PDF
  2. Create the GTFS in memory
  3. Detect the locations of the stops using the stop names and their order.

Finally, the GTFS feed is saved on disk, after adding the locations.

In the following are some rough descriptions on how each of the previously mentioned steps is performed.

Extract the timetable data from the PDF

  1. Use ghostscript to remove all images and vector drawings from the PDF
  2. Use pdfminer.six to extract the text from the PDF
  3. Split the LTTextLine objects of pdfminer.six into words
  4. Detect the words that are times using the time_format config-key
  5. Define the body of the table using the times
  6. Add cells to the table that overlap with its rows/columns

Create the GTFS in memory

  1. If an agency.txt is given using the input_files option, and it contains a single entry, use that agency by default. If it contains multiple entries, ask the user to choose, which agency should be used.
  2. If a stops.txt is given using the input_files option, search it for the stops.
  3. Create basic skeleton of required GTFS files
  4. In case the tables contain annotations, create a new calendar.txt entry for each annotation and date combination.
    • Ask the user to input dates, where there is an exception in the service, which are added to the calendar_dates.txt
  5. Iterate through the TimeTableEntries of all TimeTables and create a new entry data to the stop_times.txt.

Detect the locations of the stops using the stop names and their order

This is only done, if there is no stops.txt input file, or if the given file does not contain all necessary stops.

  1. Get a list of all stop locations (nodes) along with their name, type and some attributes from OpenStreetMap (OSM) using QLever.
  2. Normalize the names of the nodes by stripping any non-letter symbols and expanding any abbreviations
  3. For each stop of the detected tables, find those nodes that contain every word of the (normalized) stop name.
  4. Add basic costs:
    • Name costs, based on the difference in length between a stops name and any of the node's names. (This works, because of the normalization)
    • Node costs, based on the selected gtfs_routetype and the attributes of the node.
  5. Use Dijkstra's algorithm, to find the nodes with the lowest cost. The cost of a node, is simply the sum of its name-, node- and travel cost. The travel cost is calculated using either a "closer-is-better" approach or a "closer-to-expected-distance-is-better" approach.
  6. If any of the stops was found in the stops.txt file (if given), it's location will be used instead of checking the OSM data.
  7. If the location of a stop was not found, it is interpolated using the surrounding stop locations.

The first two steps are generally the slowest steps of the location detection. Therefore, we cache the result and use the cache, if possible.

More information

The new table extraction, as well as the overall process and evaluation of pdf2gtfs are detailed in my Bachelor's thesis. There is also a blogpost, which describes the previously used table extraction and provides a shorter overview on how pdf2gtfs works.

Bugs and suggestions

If something is not working or is missing, feel free to create an issue.

License

Copyright 2022 Julius Heinzinger

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

pdf2gtfs's People

Contributors

heijul avatar

Stargazers

 avatar

Watchers

 avatar

pdf2gtfs's Issues

Make list of things not working with different examples

There are a lot of different timetable formats, each having their own caveats and many of which can currently not be read.
If we have some file that outlines the problems different non-working examples each have, we can prioritize easier, which problems to work on.

CalendarDates generation

When generating the calendar_dates, this may result in service being falsely disabled (possibly only when sundays/holidays use different timetables).

If this is not a problem, add documentation explaining why not.

Replace StopPosition by something else

When adding optional keys, StopPosition and all its usages need to be updated. If we instead use a class, we can easily add default values.
A thing to note is, that the original reason StopPosition was created, is so we can use itertuples, which afaik is faster than iterrows.

Cleanup stop_times creation

Generating the stop_times currently is a big mess.

  • Split into multiple methods, maybe outsourcing some of the code to stop_times.py, if that makes sense.
  • Add more comments explaining it, if it is still convoluted
  • Fix tests/Add new tests

Reverse route during location search

Reversing the route and searching in both directions, might lead to improved results, because different StartNodes will be used. This may help, if the last stop of the route can be found easier on OSM than the first one.

The only downside I see, is the worsening of the performance.

  • Check, if this would make a difference (probably based on simple_travel_cost_calculation) and how much it would impact performance
  • If it does improve the results:
    • Implement it
    • Document how/why
    • Add a new config key, to enable/disable this function. The default should depend on how much the result is improved and how much performance is impacted.
    • Test everything

Colorize logging output

Colors rock. Use them.

We need to differentiate between terminals that can use color and basically windows though, I guess.
There is also the question whether to use something like colorlog or do it ourselves.

List annotations

Give a table view over all current annotations added, either at the end, or when a certain input was recognized. Maybe we can reuse the agency table for this.

Nodemap location

The html for the display_nodes, should be located in some temp_dir. Its filename should also show the route_id and which view this map is about (all nodes/final route/etc.).
That way, the map can be opened at a later time (or reloaded). If necessary, we can then also add a cli arg to export these maps as well.

Use geopy

Maybe use geopy for the distance calculation, instead of doing it manually and approximately.

  • Check the performance impact of doing this.
  • If it is not too big:
    • Switch to geopy
    • Add tests
    • Add documentation
  • Otherwise:
    • Improve the testing on our own implementation
    • Document why we don't use geopy

Overwrite files options

When asking, whether a file should be overwritten, adding "overwrite [a]ll/never [o]verwrite" could make sense.
At the same time, this part was originally written, when the output consisted of different .txt files instead of a proper feed. In case this function is only needed for the output feed, there is no need to change it (though we should add a comment explaining the reasoning).

Add an agency cli arg

When using --non-interactive, or when it is clear which agency should be used, having to input the agency, once reading is complete is a hassle.
Therefore we should add a cli arg --agency, where the user can supply the agency id to use, or in case a DummyAgency should be created an empty string/-1/something similar.
Need to also raise an error, if the given agency id does not exist in the given agency.txt.

Adjust map when displaying nodes

When using display_nodes, the zoom is fixed and the location uses the average of all node locations.
Instead, we should find a zoom-level, such that all nodes are visible and move the map to the correct position (which may be the same, as the one currently used).

Autodetect annotation dates

Search the rows that were dropped, as well as Rows/Columns of type Other, for dates.
We could then suggest these in the annotation part, to make things easier for the user.

Some notes, which need checking

I made these notes at some point during (later) development. Need to check if they still apply:

Not sure what exactly I meant by this...

  • 2. Merge consecutive stop columns
    I'm not sure if this applies, because afaik the stop_column has to be the first column of a table.

  • 3. Fix no header

Not entirely sure, but at some point the header was not read properly, though this may have very well been an issue with the used options (max_row_distance, etc.)
This may be part of the cleanup_tables step.

  • 4. Add fix_split_stopnames/Do not add stopname for search on split stops

I assume, this was about the cleanup_tables step. Not sure about the second part though.

Improve logging

The loglevels INFO/DEBUG are probably used the wrong way.
We could also add more loglevels with different Formatters, for example for simple console output or for verbose output.

Warn/Do something about colliding IDs

When multiple input_files are being used, their IDs may clash. We should at least warn the user about this and abort, if this happens. We could also simply merge the data if this is easily possible, or add a suffix to existing colliding IDs.

Use dropped rows to enhance the tables

Instead of just dropping rows, that had too much distance to the next, we should add them as "extra rows" to the surrounding tables. In particular they could be used to provide the suggestions for #48).

Extract help text from properties/default_config

The help text for the cli arguments is currently a truncated version of the help in the default_config. We could instead read the default configs comments before checking cli arguments.
This has the obvious problem that, if the reading fails, no help can be displayed properly. However, if that happens we have a whole different problem, so in general this should be fine(TM).

Node cost calculation is arbitrary and obscure

There are a few issues with how the node costs are calculated.

  1. First and foremost, the function currently used, is arbitrary at best.
    (Currently uses (route_type_score + node_has_this_many_optional_values) ** 2 // 20)

  2. We should also change where the node costs are calculated, instead of doing something like this.

In other words, we probably need to

  • Move everything node cost related to a single place
  • Find a function that has some fixed limit and increases fast for smaller values. (log?)
    It should also take the optional values into account, though it should not increase linearly based on them.
  • Document how the function works and why it is a good function
  • Test it

Best node selection is obscure and not working properly

Currently, the best node is simply the one that was selected for most routes.
This seems to work most of the time, but probably selects non-optimal nodes sometimes.
The best route selection should also work on multiple, detached routes.

  • Check if the way we currently select the best nodes works properly
  • If it does:
    • Document, how and why
    • Recalculate the node scores for the full route.
  • Otherwise:
    • Find a better way
    • Document, how/why this works better/at all.

Repeat strategy "autodetect"

Most of the time, it should be clear, which repeat strategy should be used. We simply need to check for "-" and "/" and set the repeat_strategy to mean in that case.
This requires however, that there are no special symbols with meaning, that were removed during preprocessing. Hence, we should add a new default repeat strategy "autodetect", while keeping the other two, in case autodetection fails.

Add .cache fallback

In case the default cache directory can not be read, use ./.pdf2gtfs_cache or p2g_dir/.cache or something similar. (maybe TempDir with fixed name?)

Create a single temp dir for testing

This way it is immediately clear, which test run which test belongs to.
The name of the tempdir should probably be something like pdf2gtfs_test_{date}_{time}.

Advanced repeat columns

Some agencies have resorted to "interesting" ways of using repeat columns (e.g. line 1 of the Leipziger Verkehrsbetriebe).
The problem is twofold:

  • First we need to properly detect the columns, as well as the start-/end-stops for that particular repeat interval.
  • To properly convert these into normal columns, we basically need to start with the shortest interval and whenever (i * interval) > (k * other_interval) (or something similar), change the route accordingly.

Right now, the first part of this seems to be the bigger problem, while the second part looks like it definitively looks like it could end up looking like a hack.

On the other hand, we could ask the user to provide stop ids and the interval to use between them, similar to how we ask for the agency. This could also be done in conjunction of the above: The detection part is input by the user or some file, while the conversion part is done automatically.

Improve optional key handling

Adding new optional keys should be as easy as

  • Add name to OPT_KEYS/OPT_OSM_KEYS
  • Add logic to opt_keys_to_int, if the key should impact the node cost
  • Add a new field to the associated GTFS object and also updating the required defaults/logic when creating such a GTFS object
  • Add a new function called here, before the GTFS feed is written

Currently, this is (also because of #25) not the case.

CLI needs improvement

The cli can be improved in some ways:

  • We need a shorthand for frequent commands.
  • The help text could be added to the properties instead (#33 ).
  • #29
  • Some arguments need to be required. (e.g. routetype)
    Also check, if there are any conditionally required arguments.
  • Check, which properties would make good arguments as well, but currently are not.

Check timetable validity

For each timetable, check if it is valid before generating the gtfs and add a is_valid property, to easily check this from outside. This should then also be used to raise an error/warning.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.