Giter VIP home page Giter VIP logo

changesetmd's People

Contributors

jlevente avatar kristenkam avatar pnorman avatar spatialhast avatar toebee avatar yohanboniface avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

changesetmd's Issues

PCDATA invalid Char value

$ time python changesetmd.py -r -d changesets
latest sequence from the database: 1659606
latest sequence on OSM server: 1664404
server has new sequence. commencing replication
opening replication file at http://planet.osm.org/replication/changesets/001/659/607.osm.gz
Traceback (most recent call last):
  File "changesetmd.py", line 190, in <module>
    md.doReplication(conn)
  File "changesetmd.py", line 150, in doReplication
    self.parseFile(connection, self.fetchReplicationFile(currentSequence), True)
  File "changesetmd.py", line 72, in parseFile
    for action, elem in context:
  File "iterparse.pxi", line 207, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:126122)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1, line 11, column 294

This is repeatable.

support schemas

ChangesetMD should support schemas.

This would be with a SQL command like SET search_path TO myschema;

spatial support

It would be nice to store changeset bounding boxes as BOX2D to be able to run queries against them.

Wrong diff fetched

opening replication file at https://planet.openstreetmap.org/replication/changesets/001/999/999.osm.gz
parsing complete
parsed 21
opening replication file at https://planet.openstreetmap.org/replication/changesets/002/1000/000.osm.gz
Traceback (most recent call last):
  File "changesetmd.py", line 190, in <module>
    md.doReplication(conn)
  File "changesetmd.py", line 150, in doReplication
    self.parseFile(connection, self.fetchReplicationFile(currentSequence), True)
  File "changesetmd.py", line 124, in fetchReplicationFile
    replicationFile = urllib2.urlopen(fileUrl)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

002/1000/000 is the wrong diff path.

If I clear the state flag and try again I get

latest sequence from the database: 1999999
latest sequence on OSM server: 2000469
server has new sequence. commencing replication
opening replication file at https://planet.openstreetmap.org/replication/changesets/002/1000/000.osm.gz

I'm a few versions behind, but I don't see any recent commits that would fix this.

Discussions branch duplicate comments

The discussions branch is importing the same comment over and over again when a changeset discussion contains multiple comments
e.g.

changesets=# select * from osm_changeset_comment where comment_changeset_id = 31067942;
-[ RECORD 1 ]--------+-----------------------------------------------------------------------
comment_changeset_id | 31067942
comment_user_id      | 1240849
comment_user_name    | ediyes
comment_date         | 2015-06-12 14:50:39
comment_text         | Hi,
                     | zeromap, thanks for your feedback, I was fixing overlapping buildings.
                     |
                     | I will be more careful in the future.
                     |
                     | Edith.
-[ RECORD 2 ]--------+-----------------------------------------------------------------------
comment_changeset_id | 31067942
comment_user_id      | 1240849
comment_user_name    | ediyes
comment_date         | 2015-06-12 14:50:39
comment_text         | Hi,
                     | zeromap, thanks for your feedback, I was fixing overlapping buildings.
                     |
                     | I will be more careful in the future.
                     |
                     | Edith.

Whereas the changeset contains one comment from zeromap and one from ediyes

Schema changes

Some suggested changes, based on writing a lot of queries

  • Normalize user IDs to have id-> name mapping in their own table. I do a lot of queries on a distinct user, and have to write SELECT DISTINCT user_id, user_name ...
  • Drop comment_ prefix from osm_changeset_comment column names
  • Perhaps drop the osm_ prefix from all table names

Add bzip2 support

Adding bzip support would eliminate having to do it externally which takes time and extra disk space.

Zero-area changeset

A changeset that only touches a node without moving it can have min_lat=max_lat and min_lon=max_lon.

These are currently encoded as polygons, but zero-area polygons are very hard to make normally. e.g. this SQL doesn't work on those changesets

UPDATE osm_changeset_40m
  SET geom = ST_SetSRID(ST_Makebox2d(ST_MakePoint(min_lon, min_lat), ST_MakePoint(max_lon, max_lat)),4326)::geometry(Polygon,4326)
  WHERE geom IS NULL;;

Would it be better to have these as points and change the geometry constraint to geometry(Geometry,4326)?

Add Time Cost

Calculate the time cost of executing the operation at command line.

[Errno 104] Connection reset by peer

I am trying the replication out and getting an error

parsed 22
opening replication file at http://planet.openstreetmap.org/replication/changesets/002/065/033.osm.gz
parsing complete
parsed 18
opening replication file at http://planet.openstreetmap.org/replication/changesets/002/065/034.osm.gz
parsing complete
parsed 18
opening replication file at http://planet.openstreetmap.org/replication/changesets/002/065/035.osm.gz
parsing complete
parsed 14
opening replication file at http://planet.openstreetmap.org/replication/changesets/002/065/036.osm.gz
error during replication
[Errno 104] Connection reset by peer

anybody know why this is or how to solve it?

Bumping minimum postgres to 9.5

I'd like to add a user_name table for #19, and the easy way to do this would be an INSERT ... ON CONFLICT ... statement, but that requires 9.5. Thoughts on making that the minimum?

Attempt to reconnect if necessary

When performing an incremental update I got the following error:

Traceback (most recent call last):
  File "./changesetmd.py", line 77, in <module>
    parser.parse(args.fileName)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
    self._cont_handler.endElement(name)
  File "/home/pnorman/osm/changesets/ChangesetMD/changesethandler.py", line 55, in endElement
    self.changeset.numChanges, self.changeset.userName,self.tags))
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

postgresql may of restarted during the skipping stage but it would be nice to not need to re-skip

synchronous_commit

If the server with postgres crashes there is no way to restore an interrupted upload. It is therefore safe to use synchronous_commit, as the data is lost anyways.

synchronous_commit should be set to off, with SQL like SET synchronous_commit TO OFF;

Error with replication branch

opening replication file at http://planet.osm.org/replication/changesets/001/507/867.osm.gz
Traceback (most recent call last):
  File "changesetmd.py", line 190, in <module>
    md.doReplication(conn)
  File "changesetmd.py", line 150, in doReplication
    self.parseFile(connection, self.fetchReplicationFile(currentSequence), True)
  File "changesetmd.py", line 71, in parseFile
    action, root = context.next()
  File "iterparse.pxi", line 208, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:131322)
lxml.etree.XMLSyntaxError: no element found

On next run

$ python changesetmd.py -d changesets -r
concurrent update in progress. Bailing out!

The contents of this replication diff are empty

Switch XML parsers

The XML parser being used right now seems to be pretty slow. Incremental updates take an hour to just parse the file. I'm sure there is a faster parser somewhere in pythonland.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.