Hi, I'm having a strange issue with ogr2osm. We are using it to pr

Forgot to mention, the output goes as far as : <code class="notransl

Hey, thanks a lot for looking into to this and bringing <a class="issue-link js-issue-

processing never finishes on small dataset about ogr2osm HOT 5 CLOSED

pnorman commented on July 24, 2024

processing never finishes on small dataset

from ogr2osm.

Comments (5)

roelderickx commented on July 24, 2024 1

The bottleneck is on line 23 of geom.py, which is only executed for a duplicate node:
Geometry.geometries.remove(self)
When there are few or no duplicates the performance drawback isn't really noticeable, even if you have millions of nodes. But in this case you have around 1 million duplicates which should all be searched for in a non-hashed list before removal.
On my computer it takes on average 0.035 seconds to process each unique coordinate (regardless of existing duplicates or not). That may not seem like much, but with 2,5 million unique coordinates the process ends up taking > 24 hours.
Which brings us to a more important question: why add elements to a list if you want to remove them later, without ever having used them? I am convinced that mergPoints can be integrated in parseData, that should significantly improve the performance.

from ogr2osm.

roelderickx commented on July 24, 2024 1

The changes work in roelderickx/ogr2pbf, processing time is down to around 5 minutes. I'll try to backport the changes to ogr2osm and create a pull request.
However, I see you are using a fork now where mergePoints is disabled, which seems to work for what you want to do. In that case you are probably affected by issue #51 as well.

from ogr2osm.

gplv2 commented on July 24, 2024

Forgot to mention, the output goes as far as :

l.debug("Checking list")

So it must be happening after this message, which of course you can deduct from the backtrace above, so perhaps this was not needed

from ogr2osm.

gplv2 commented on July 24, 2024

I've been debugging this a bit further, python is not my forte though I've added some debug statements. Turns out, we have in mergePoints :

Total points user : 3 508 945 (count of points variable)
Total points coord: 2 527 003 (count of pointcoords variable)

It takes a very long time to process the first 5000 points, unusually long imho:

for (location, pointsatloc) in pointcoords.items():

There are also quite some duplicates present in this dataset so it has to work hard. It doesn't make a lot of sense that is is so slow. I'll hack on this a bit more to find out where the performance hog is.

When we parse the road database, it contains a lot more points :

Merging points
Total points user : 8082689
Making list
Total points coord 8082689

But it seems we don't have any duplicates, so it goes really fast according to te debug logging. But the memory footprint is exactly the same as when we parse the addresses data.

from ogr2osm.

gplv2 commented on July 24, 2024

Hey, thanks a lot for looking into to this and bringing #51 to my attention. It's been a while that I hacked on this although the tool still exists and is in use. Really cool you took the time for this.

ogr2pbf is one of the tools in the chain to prepare data for human assisted import into osm via josm.

https://staging.grbosm.site/#/ (zoom low enough and on north part of Belgium for the layer to get pulled from postgres)

Afaik, I solved it by just living with the duplicates and later on in the chain of preprocessing the data it got solved , but I don't exactly remember how.

Anyway, pretty soon I'll be doing a fresh dataprocessing run which is entirely automated in fact, I will give it ago once it's backported and replace my fork , so it gets tested. The whole preprocessing of the data takes about 6 hrs on a decent google cloud node.

Big thanks Roel.

from ogr2osm.

processing never finishes on small dataset about ogr2osm HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent