toebee / changesetmd Goto Github PK
View Code? Open in Web Editor NEWSimple XML parser to shove OpenStreetMap changeset metadata dump files into a postgres database
Simple XML parser to shove OpenStreetMap changeset metadata dump files into a postgres database
Upgrade to Python e.g. 3.7.
$ time python changesetmd.py -r -d changesets
latest sequence from the database: 1659606
latest sequence on OSM server: 1664404
server has new sequence. commencing replication
opening replication file at http://planet.osm.org/replication/changesets/001/659/607.osm.gz
Traceback (most recent call last):
File "changesetmd.py", line 190, in <module>
md.doReplication(conn)
File "changesetmd.py", line 150, in doReplication
self.parseFile(connection, self.fetchReplicationFile(currentSequence), True)
File "changesetmd.py", line 72, in parseFile
for action, elem in context:
File "iterparse.pxi", line 207, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:126122)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 1, line 11, column 294
This is repeatable.
ChangesetMD should support schemas.
This would be with a SQL command like SET search_path TO myschema;
It would be nice to store changeset bounding boxes as BOX2D
to be able to run queries against them.
opening replication file at https://planet.openstreetmap.org/replication/changesets/001/999/999.osm.gz
parsing complete
parsed 21
opening replication file at https://planet.openstreetmap.org/replication/changesets/002/1000/000.osm.gz
Traceback (most recent call last):
File "changesetmd.py", line 190, in <module>
md.doReplication(conn)
File "changesetmd.py", line 150, in doReplication
self.parseFile(connection, self.fetchReplicationFile(currentSequence), True)
File "changesetmd.py", line 124, in fetchReplicationFile
replicationFile = urllib2.urlopen(fileUrl)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
002/1000/000 is the wrong diff path.
If I clear the state flag and try again I get
latest sequence from the database: 1999999
latest sequence on OSM server: 2000469
server has new sequence. commencing replication
opening replication file at https://planet.openstreetmap.org/replication/changesets/002/1000/000.osm.gz
I'm a few versions behind, but I don't see any recent commits that would fix this.
The discussions branch is importing the same comment over and over again when a changeset discussion contains multiple comments
e.g.
changesets=# select * from osm_changeset_comment where comment_changeset_id = 31067942;
-[ RECORD 1 ]--------+-----------------------------------------------------------------------
comment_changeset_id | 31067942
comment_user_id | 1240849
comment_user_name | ediyes
comment_date | 2015-06-12 14:50:39
comment_text | Hi,
| zeromap, thanks for your feedback, I was fixing overlapping buildings.
|
| I will be more careful in the future.
|
| Edith.
-[ RECORD 2 ]--------+-----------------------------------------------------------------------
comment_changeset_id | 31067942
comment_user_id | 1240849
comment_user_name | ediyes
comment_date | 2015-06-12 14:50:39
comment_text | Hi,
| zeromap, thanks for your feedback, I was fixing overlapping buildings.
|
| I will be more careful in the future.
|
| Edith.
Whereas the changeset contains one comment from zeromap and one from ediyes
Some suggested changes, based on writing a lot of queries
comment_
prefix from osm_changeset_comment
column namesAdding bzip support would eliminate having to do it externally which takes time and extra disk space.
A changeset that only touches a node without moving it can have min_lat=max_lat and min_lon=max_lon.
These are currently encoded as polygons, but zero-area polygons are very hard to make normally. e.g. this SQL doesn't work on those changesets
UPDATE osm_changeset_40m
SET geom = ST_SetSRID(ST_Makebox2d(ST_MakePoint(min_lon, min_lat), ST_MakePoint(max_lon, max_lat)),4326)::geometry(Polygon,4326)
WHERE geom IS NULL;;
Would it be better to have these as points and change the geometry constraint to geometry(Geometry,4326)
?
Calculate the time cost of executing the operation at command line.
I am trying the replication out and getting an error
parsed 22
opening replication file at http://planet.openstreetmap.org/replication/changesets/002/065/033.osm.gz
parsing complete
parsed 18
opening replication file at http://planet.openstreetmap.org/replication/changesets/002/065/034.osm.gz
parsing complete
parsed 18
opening replication file at http://planet.openstreetmap.org/replication/changesets/002/065/035.osm.gz
parsing complete
parsed 14
opening replication file at http://planet.openstreetmap.org/replication/changesets/002/065/036.osm.gz
error during replication
[Errno 104] Connection reset by peer
anybody know why this is or how to solve it?
I'd like to add a user_name table for #19, and the easy way to do this would be an INSERT ... ON CONFLICT ...
statement, but that requires 9.5. Thoughts on making that the minimum?
When performing an incremental update I got the following error:
Traceback (most recent call last):
File "./changesetmd.py", line 77, in <module>
parser.parse(args.fileName)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
self._cont_handler.endElement(name)
File "/home/pnorman/osm/changesets/ChangesetMD/changesethandler.py", line 55, in endElement
self.changeset.numChanges, self.changeset.userName,self.tags))
psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
postgresql may of restarted during the skipping stage but it would be nice to not need to re-skip
If the server with postgres crashes there is no way to restore an interrupted upload. It is therefore safe to use synchronous_commit
, as the data is lost anyways.
synchronous_commit should be set to off, with SQL like SET synchronous_commit TO OFF;
I think it would be a benefit to pin the dependencies, ie. using poetry. PR Incoming.
opening replication file at http://planet.osm.org/replication/changesets/001/507/867.osm.gz
Traceback (most recent call last):
File "changesetmd.py", line 190, in <module>
md.doReplication(conn)
File "changesetmd.py", line 150, in doReplication
self.parseFile(connection, self.fetchReplicationFile(currentSequence), True)
File "changesetmd.py", line 71, in parseFile
action, root = context.next()
File "iterparse.pxi", line 208, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:131322)
lxml.etree.XMLSyntaxError: no element found
On next run
$ python changesetmd.py -d changesets -r
concurrent update in progress. Bailing out!
The contents of this replication diff are empty
The XML parser being used right now seems to be pretty slow. Incremental updates take an hour to just parse the file. I'm sure there is a faster parser somewhere in pythonland.
Defaults to port 5432.
Have script default to port 5432 if a port is not specified at commandline.
Kristen
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.