Giter VIP home page Giter VIP logo

Comments (5)

roelderickx avatar roelderickx commented on July 24, 2024 1

The bottleneck is on line 23 of geom.py, which is only executed for a duplicate node:
Geometry.geometries.remove(self)
When there are few or no duplicates the performance drawback isn't really noticeable, even if you have millions of nodes. But in this case you have around 1 million duplicates which should all be searched for in a non-hashed list before removal.
On my computer it takes on average 0.035 seconds to process each unique coordinate (regardless of existing duplicates or not). That may not seem like much, but with 2,5 million unique coordinates the process ends up taking > 24 hours.
Which brings us to a more important question: why add elements to a list if you want to remove them later, without ever having used them? I am convinced that mergPoints can be integrated in parseData, that should significantly improve the performance.

from ogr2osm.

roelderickx avatar roelderickx commented on July 24, 2024 1

The changes work in roelderickx/ogr2pbf, processing time is down to around 5 minutes. I'll try to backport the changes to ogr2osm and create a pull request.
However, I see you are using a fork now where mergePoints is disabled, which seems to work for what you want to do. In that case you are probably affected by issue #51 as well.

from ogr2osm.

gplv2 avatar gplv2 commented on July 24, 2024

Forgot to mention, the output goes as far as :

l.debug("Checking list")

So it must be happening after this message, which of course you can deduct from the backtrace above, so perhaps this was not needed

from ogr2osm.

gplv2 avatar gplv2 commented on July 24, 2024

I've been debugging this a bit further, python is not my forte though I've added some debug statements. Turns out, we have in mergePoints :

Total points user : 3 508 945 (count of points variable)
Total points coord: 2 527 003 (count of pointcoords variable)

It takes a very long time to process the first 5000 points, unusually long imho:

for (location, pointsatloc) in pointcoords.items():

There are also quite some duplicates present in this dataset so it has to work hard. It doesn't make a lot of sense that is is so slow. I'll hack on this a bit more to find out where the performance hog is.

When we parse the road database, it contains a lot more points :

Merging points
Total points user : 8082689
Making list
Total points coord 8082689

But it seems we don't have any duplicates, so it goes really fast according to te debug logging. But the memory footprint is exactly the same as when we parse the addresses data.

from ogr2osm.

gplv2 avatar gplv2 commented on July 24, 2024

Hey, thanks a lot for looking into to this and bringing #51 to my attention. It's been a while that I hacked on this although the tool still exists and is in use. Really cool you took the time for this.

ogr2pbf is one of the tools in the chain to prepare data for human assisted import into osm via josm.

https://staging.grbosm.site/#/ (zoom low enough and on north part of Belgium for the layer to get pulled from postgres)

Afaik, I solved it by just living with the duplicates and later on in the chain of preprocessing the data it got solved , but I don't exactly remember how.

Anyway, pretty soon I'll be doing a fresh dataprocessing run which is entirely automated in fact, I will give it ago once it's backported and replace my fork , so it gets tested. The whole preprocessing of the data takes about 6 hrs on a decent google cloud node.

Big thanks Roel.

from ogr2osm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.