jillesvangurp / osm2geojson Goto Github PK

Tool to convert open street map xml to geojson

License: MIT License

Shell 0.55% Java 99.45%

osm2geojson's Introduction

Introduction

Osm2geojson is a little project that utilizes several of my other github projects to convert open streetmap xml to a more usable, geojson like format.

Why and How?

The problem with the osm xml is that it is basically a database dump of the three tables they have for nodes, ways, and relations. Most interesting applications probably require these tables to be joined.

This project merges the three into json blobs that have all the relevant information embedded. It's similar to loading everything in a database and then doing a gigantic join and then converting the output. The advantage of this approach is that it doesn't require a database, or an index and instead simply works by sorting and merging files.

OsmJoin

OsmJoin is the tool that joins the osm nodes, ways, and relations into more usable json equivalents. No attempt is made to filter the data and all tags are preserved.

A separate tool that takes the output of OsmJoin and produces geojson is provided as well (see OsmPostProcess below). The latter tool requires interpreting the meaning of tags in OSM, which given inconsistencies and ambiguities is hardly an exact science.

Building from source

It's a maven project. So, checking it out and doing a mvn clean install should do the trick. You should always get the latest version from github and build it yourself. I'm not currently releasing binaries to maven central for this.

Should anyone like this licensed differently, please contact me.

If anyone wants to fix stuff just send me a pull request.

Alternatively, you can exercise your rights under the license and simply copy and adapt as needed. The "license":https://github.com/jillesvangurp/geogeometry/blob/master/LICENSE allows you to do this and I have no problems with this although I do appreciate attribution.

Usage

After setting this up, you should be able to run mvn clean install on this project. It will compile and then put some libraries in target/lib. These are needed to run the osmjoin.sh script. Alternatively, you can run this from your IDE.

Make sure to assign enough heap to fit the bucket size (constant in the source code, defaults to 1M). Make sure you have enough disk space. OSM is a big data set.

performance, memory, file handles and disk usage

I've ran the OsmJoin tool on full world osm dumps. You'll want the planet osm xml dumps in bz2. These are about 30GB in size. DONOT expand it ;-). There is no reason to.

While running, the tool produces various .gz files with id, json pairs or id,id pairs on each line. These files are sorted and merged in several steps. Additionally, a temp directory is created where so-called bucket files are stored while the tool is running. You should ensure you have enough disk space for all of this.

I've provided a list of the different files that are generated:

-rw-r--r-- 1 localstream root 410M Sep 10 13:27 adrress_nodes.gz
-rw-r--r-- 1 localstream root  26G Sep  5 21:58 nodeid2rawnodejson.gz
-rw-r--r-- 1 localstream root  13M Sep  5 18:49 nodeid2relid.gz
-rw-r--r-- 1 localstream root 7.9G Sep  5 20:20 nodeid2wayid.gz
-rw-r--r-- 1 localstream root 1.3G Sep  6 05:07 relid2completejson.gz
-rw-r--r-- 1 localstream root 141M Sep  6 04:17 relid2jsonwithnodes.gz
-rw-r--r-- 1 localstream root  81M Sep  6 04:16 relid2nodejson.gz
-rw-r--r-- 1 localstream root 161M Sep  5 18:50 relid2rawreljson.gz
-rw-r--r-- 1 localstream root 5.8G Sep  6 04:57 relid2wayjson.gz
-rw-r--r-- 1 localstream root  28G Sep  6 04:01 wayid2completejson.gz
-rw-r--r-- 1 localstream root  25G Sep  6 01:01 wayid2nodejson.gz
-rw-r--r-- 1 localstream root 9.3G Sep  5 19:23 wayid2rawwayjson.gz
-rw-r--r-- 1 localstream root  78M Sep  5 18:49 wayid2relid.gz
-rw-r--r-- 1 localstream root   20 Sep  5 12:06 wqyid2completejson.gz

You should expect to use a bit more than twice the total space of these files. So, somewhere around 100GB of free space should be sufficient for the planet osm file, temp directory with bucket files and the generated gz files.

As you can see from the creation timestamps, the whole process takes some time to run. In this case it ran for approximately 12 hours on a quad core server with a heap size of 5GB and a raid1 disk. The first file is not created until several hours into the process since the first step (parsing the xml into several sorted files) is also the most expensive one. Your mileage may vary. The files of interest after running are

nodeid2rawnodejson.gz the json for each node, this includes things like POIs.
wqyid2completejson.gz the json for each way with the node json for the referenced nodes merged. This includes streets.
relid2completejson.gz the json for each relation with node and way json merged

The process uses a lot of memory. Especially the later steps are memory intensive. The configuration is hard coded in the OsmJoin class. The key parameter there is the bucketSize that is used for merge sorting the files. Each bucket is created in memory in a sorted datastructure, and then stored when filled to the specified limit.

A smaller bucketSize means less memory is used. However, this also means more fileHandles are used during the merge and that the merge process has to do more work. With the billions of ways and nodes, you need to be careful to stay under any imposed Filehandle limits by the OS. You may need to increase this limit on e.g. ubuntu where it is by default configured very conservatively to only 1024. This is by no means enough unless you have tens of GB of heap to spare. To change this, modify /etc/security/limits.conf

# this fixes ridiculously low file handle limit in Linux
root soft nofile 64000
root hard nofile 64000
* soft nofile 64000
* hard nofile 64000

OsmPostProcess

The goal of this step is to take the output files of OsmJoin and filter, transform, and normalize into GeoJson for the purpose of indexing it in elastic search.

The process involves interpreting what the OSM tags mean, categorizing, reconstructing polygons, linestrings, etc., and filtering out the stuff that cannot be easily categorized.

Inevitably this step is lossy. The current version recovers about 25M ways and 5M pois world wide with names and sensible categories. This includes most relevant streets, restaurants, transport stops, and other pois. A lot of what remains is either without a name (e.g. many buildings don't have names) or part of some less interesting map feature like a forest or a lake.

Currently, relations are not processed for reasons of complexity and limited amount of data (only a few hundred thousand relations exist). A preliminary break down based on grepping through the file suggests that the following can be recovered from relations:

350K relations:

admin_levels (60K) multi_polygons
public transport routes (62K)
associated street (30K)
TMC ??? some traffic meta data (17K)
restriction on traffic (153K)
other 34K (mix of all kinds of uncategorized metadata)

Potentially admin_levels and routes may be of interest.

The good news is that the post processing step is easy to customise. All it does is iterate over the joined json from the OsmJoin step.

Misc thoughts on OSM

One cannot help but wonder why the OSM data is so messy, inconsistent, and poorly structured. For a community effort to catalogue the world, the format is surprisingly sloppy. A project like this shows that it is possible to mine and recover a wealth of information. If only tagging was more consistent it could be exported in a much more usable format.

The current format is a near unusable database dump that in its current form assumes a relational database.

Should anybody involved with OSM care about my recommendations, I would recommend the following:

Evolve internal storage towards a denormalized view of the world. My recent experience indicates that a document store combined with powerful indexing such as provided by Elasticsearch might be more appropriate than database lookups and joins. Also the raw compressed bzip xml is only a few GB smaller than the joined json equivalent. This makes you wonder what purpose storing the data like that serves. Any reasonable use of the data in a relational store involves doing lots of joins in any case and any processing of the data takes hours simply because of the way the data is stored. A pre joined data set is much more efficient to use.
Introduce a standardized categorization that is validated. Apply this categorization in an automated fashion to all data and deprecate the use of free form tags in OSM applications.
Crosslink the data with other data sets. For example embedding geoname ids, woe_ids, wikipedia links, etc. would be hugely valuable. Cross linking with e.g. Facebook's opengraph would be enormously valuable as well. Unlike embedding the meta data, linking the data should be safer from a legal point of view.
Eliminate region specific variations of combinations of tags as much as possible
Introduce a standardized address format. See also relevant work regarding this in W3C and other open data groups.
Standardize naming of things and align with geonames and geoplanet on things like language codes, name translations, etc.

This is a massive undertaking but would make curating OSM more rewarding for contributors and open up applications of its data beyond rendering map tiles.

osm2geojson's People

Contributors

Stargazers

Watchers

Forkers

karussell ahzf gijs ramtej branches-cc a-mroz

osm2geojson's Issues

World wide and memory problems

I'm executing OsmJoin and using 14GB with a bucketSize of 100 000 but still I'm getting OutOfMemoryErrors. Here are the stack traces:

19:17:56.799 [concurrentProcessingIterableThread_3] INFO  c.g.j.mergesort.SortingWriter - sort buckets wayid2nodejson.gz: 50000000 lines
Exception in thread "concurrentProcessingIterableThread_3" 19:18:28.124 [concurrentProcessingIterableThread_4] INFO  c.g.j.mergesort.SortingWriter - sort buc
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOf(Arrays.java:2367)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
        at java.lang.StringBuilder.append(StringBuilder.java:132)
        at com.github.jillesvangurp.mergesort.SortingWriter.flushBucket(SortingWriter.java:107)
        at com.github.jillesvangurp.mergesort.SortingWriter.put(SortingWriter.java:71)
        at com.github.jillesvangurp.osm2geojson.OsmJoin$2.process(OsmJoin.java:265)
        at com.github.jillesvangurp.osm2geojson.OsmJoin$2.process(OsmJoin.java:259)
        at com.jillesvangurp.iterables.ConcurrentProcessingIterable$3.run(ConcurrentProcessingIterable.java:120)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Exception in thread "concurrentProcessingIterableThread_0" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:2694)
        at java.lang.String.<init>(String.java:203)
        at java.lang.String.substring(String.java:1877)
        at com.github.jillesvangurp.mergesort.EntryParsingProcessor.process(EntryParsingProcessor.java:17)
        at com.github.jillesvangurp.mergesort.EntryParsingProcessor.process(EntryParsingProcessor.java:8)
        at com.jillesvangurp.iterables.ProcessingIterable$1.next(ProcessingIterable.java:33)
        at com.jillesvangurp.iterables.PeekableIterator.next(PeekableIterator.java:35)
        at com.jillesvangurp.iterables.PeekableIterator.peek(PeekableIterator.java:52)
        at com.github.jillesvangurp.osm2geojson.EntryJoiningIterable$1.hasNext(EntryJoiningIterable.java:76)   
        at com.jillesvangurp.iterables.ConcurrentProcessingIterable$2.run(ConcurrentProcessingIterable.java:85)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Exception in thread "concurrentProcessingIterableThread_1" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOf(Arrays.java:2367)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
        at java.lang.StringBuilder.append(StringBuilder.java:132)
        at com.github.jillesvangurp.mergesort.SortingWriter.flushBucket(SortingWriter.java:107)
        at com.github.jillesvangurp.mergesort.SortingWriter.put(SortingWriter.java:71)
        at com.github.jillesvangurp.osm2geojson.OsmJoin$2.process(OsmJoin.java:265)

and

19:21:24.587 [main] INFO  c.g.j.osm2geojson.OsmJoin - started 3. create wayid2completejson.gz at 2013-12-10T18:21:24Z
19:21:24.587 [main] INFO  c.g.j.mergesort.SortingWriter - started sort buckets wayid2completejson.gz at 2013-12-10T18:21:24Z
19:25:22.233 [concurrentProcessingIterableThread_1] INFO  c.g.j.mergesort.SortingWriter - sort buckets wayid2completejson.gz: 1000000 lines
Exception in thread "concurrentProcessingIterableThread_3" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2367)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
        at java.lang.StringBuilder.append(StringBuilder.java:132)
        at com.github.jillesvangurp.mergesort.SortingWriter.flushBucket(SortingWriter.java:107)
        at com.github.jillesvangurp.mergesort.SortingWriter.put(SortingWriter.java:71)
        at com.github.jillesvangurp.osm2geojson.OsmJoin$3.process(OsmJoin.java:300)
        at com.github.jillesvangurp.osm2geojson.OsmJoin$3.process(OsmJoin.java:279)
        at com.jillesvangurp.iterables.ConcurrentProcessingIterable$3.run(ConcurrentProcessingIterable.java:120)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)

What could I tune to reduce memory consumption?

private int bucketSize = 100000;
private int blockSize = 100;
private int threadPoolSize = 6;    
private int queueSize = 100000;

Categories as map?

Would it make sense to make the categories a map instead of an array? This way it should be better searchable/filterable

osmCategories.put(tagName, value);

reference to jillesvangurp/iterables-support#3 relation parsing

Add link to jillesvangurp/iterables-support#3

Is it safe to ignore nodes without metadata?

In this line

https://github.com/jillesvangurp/osm2geojson/blob/master/src/main/java/com/github/jillesvangurp/osm2geojson/OsmPostProcessor.java#L375

you ignore all json stuff which is smaller than 50 chars. But a simple node without tags is smaller than that but could be referred in a way. So, is it really safe to ignore this? At least for nodes this should be reduced to 20 or something. And what is the advantage in terms of speed when skipping parsing?

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

merge sorting 0 elements fails

Hi Jilles,

I finally found some time to look a bit deeper into osm2geojson. And I have to say that I really like it, the concurrent processing works amazing well. Still, running it on the planet file I get an exception:

00:35:32.290 [main] INFO c.g.j.mergesort.SortingWriter - merge buckets into nodeid2rawnodejson.gz: 2113400000 lines
00:35:32.562 [main] INFO c.g.j.mergesort.SortingWriter - stopped merge buckets into nodeid2rawnodejson.gz at 2013-12-13T23:35:32Z duration 8291 seconds
00:35:32.562 [main] INFO c.g.j.mergesort.SortingWriter - completed merge buckets into nodeid2rawnodejson.gz: 2113463057 lines
Exception in thread "main" java.lang.IllegalStateException: java.lang.IllegalArgumentException
at com.github.jillesvangurp.osm2geojson.OsmJoin.splitAndEmit(OsmJoin.java:151)
at com.github.jillesvangurp.osm2geojson.OsmJoin.processAll(OsmJoin.java:433)
at com.github.jillesvangurp.osm2geojson.OsmJoin.main(OsmJoin.java:466)
Caused by: java.lang.IllegalArgumentException
at java.util.PriorityQueue.(PriorityQueue.java:152)
at com.github.jillesvangurp.mergesort.MergingEntryIterable.iterator(MergingEntryIterable.java:29)
at com.github.jillesvangurp.mergesort.SortingWriter.close(SortingWriter.java:137)
at com.github.jillesvangurp.osm2geojson.OsmJoin.splitAndEmit(OsmJoin.java:145)
... 2 more
Suppressed: java.lang.IllegalArgumentException
at java.util.PriorityQueue.(PriorityQueue.java:152)
at com.github.jillesvangurp.mergesort.MergingEntryIterable.iterator(MergingEntryIterable.java:29)
at com.github.jillesvangurp.mergesort.SortingWriter.close(SortingWriter.java:137)
at com.github.jillesvangurp.osm2geojson.OsmJoin.splitAndEmit(OsmJoin.java:146)
... 2 more
Suppressed: java.lang.IllegalArgumentException
at java.util.PriorityQueue.(PriorityQueue.java:152)
at com.github.jillesvangurp.mergesort.MergingEntryIterable.iterator(MergingEntryIterable.java:29)
at com.github.jillesvangurp.mergesort.SortingWriter.close(SortingWriter.java:137)
at com.github.jillesvangurp.osm2geojson.OsmJoin.splitAndEmit(OsmJoin.java:147)
... 2 more

Running it on a clean test file (e.g. Spain from Geofabrik) it works well. Can you give me a hint on whats the problem from looking at the exception? Otherwise I would start debugging and see if I can find the source for this.

Thanks in advance!

Synchronize necessary?

Is synchronized necessary here?

https://github.com/jillesvangurp/osm2geojson/blob/master/src/main/java/com/github/jillesvangurp/mergesort/SortingWriter.java#L37

You already lock via bucketLock or is the ".size" access that important to be synched?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.