philipmat / discogs-xml2db Goto Github PK

View Code? Open in Web Editor NEW

201.0 201.0 76.0 1.54 MB

Imports the discogs.com monthly XML dumps into databases

License: Apache License 2.0

Python 47.55% Shell 3.01% C# 49.44%

discogs python

discogs-xml2db's People

Contributors

Stargazers

Watchers

discogs-xml2db's Issues

Error when importing discogs data into PostgreSQL

The process crashed when it hit the "masters" section. Here's the output:

File "discogsparser.py", line 241, in
main(sys.argv[1:])
File "discogsparser.py", line 236, in main
parseMasters(parser, exporter)
File "discogsparser.py", line 171, in parseMasters
parser.parse(master_file)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/local/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 210, in feed
self._parser.Parse(data, isFinal)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 307, in end_element
self._cont_handler.endElement(name)
File "/home/thephltr/webapps/who_pro/discogs_importer/discogsmasterparser.py", line 173, in endElement
self.exporter.storeMaster(self.master)
File "/home/thephltr/webapps/who_pro/discogs_importer/postgresexporter.py", line 323, in storeMaster
(img.uri, img.image_type, master.id))
AttributeError: ImageInfo instance has no attribute 'image_type'

load data into PostgreSQL

I was stuck at the step "python discogsparser.py". The following is the run information:

python discogsparser.py -o pgsql -p "dbname=discogs user=discogs password=discogs" -d 20140803
Namespace(data_quality=None, date='20140803', file=[], ignore_unknown_tags=False, n=None, output='pgsql', params='dbname=discogs user=discogs password=discogs')

python discogsparser.py -o pgsql -p "dbname=discogs user=discogs password=discogs" discogs_20140803_artists.xml
Namespace(data_quality=None, date=None, file=['discogs_20140803_artists.xml'], ignore_unknown_tags=False, n=None, output='pgsql', params='dbname=discogs user=discogs password=discogs')

It looks like it didn't execute the command. I don't know what's wrong with it. I am python newbie.

Thank you,
Ying

It takes more than 24 hours to export as a JSON file

I'm trying to import data to mongodb. As you mention in README, direct import takes longer than the other option (XML to JSON and mongoimport). However even if I tried other option, I couldn't easily import it. Releases.xml file is ~9gb and it took more than 24 hours to export JSON file. Actually after that process releases.json file was corrupted so I couldn't really import file.

I tried the script in a Amazon EC2 micro instance and another VPS that has dual core processor. The results were same, the files were ~8gb and corrupted. And the other problem is the script caused CPU overload so it's not choice to leave the discogsparser.py file open and let it work for 24 hours. I couldn't figure out what the problem is. Is xml.sax slow or the systems that I used don't have enough resource?

error parsing masters file

parsing 20121201 master file i got the following stack trace:

Traceback (most recent call last):
  File "discogsparser.py", line 223, in <module>
    main(sys.argv[1:])
  File "discogsparser.py", line 218, in main
    parseMasters(parser, exporter)
  File "discogsparser.py", line 153, in parseMasters
    parser.parse(master_file)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
    self._cont_handler.endElement(name)
  File "/Users/burc/Documents/Development/discogs-xml2db/discogsmasterparser.py", line 170, in endElement
    self.exporter.storeMaster(self.master)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 250, in storeMaster
    self.execute('masters', master)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 203, in execute
    uniq, md5 = self._is_uniq(collection, what.id, json_string)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 181, in _is_uniq
    return self._quick_uniq.is_uniq(collection, id, json_string)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 110, in is_uniq
    self._load(collection)
  File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 88, in _load
    if self._hashes[name] is None:
KeyError: 'masters'

Without studying the code in details, changed line 84 of mongodbexporter.py to

self._hashes = {'artists': None, 'labels': None, 'releases': None, 'masters': None }

fixed the issue

Image Type is a function of the relation

Currently image type is stored in the images table, but it is actually a function of the relationship between the image and each release/label/artist it is linked to. There are two problems with the current approach

When you add an image to the image table you check to see if already added and if not it is not added. This means that if the first time the image was added for a release it has type of primary, and then the same image exists with type of secondary that secondary image will never get added, so a later query on that relation will incorrectly return that the relationship was primary when in fact it was secondary.
The relations_images ( & artists_images, labels_images) have only two columns release_id and uri, in some cases the same uri is added to a release as both a secondary and primary type and so the code adds exactly same row to the table, so it is impossible to have a primary key on the table (because it cannot be unique)

Solution:Move the image type column from the image table to the releases_images, artist_images, labels_images tables.

Error: Unknown Release element 'sub_tracks' building release table from 01072014

Error: Unknown Release element 'sub_tracks' building release table from 01072014 dump, need to handle/ignore new sub_tracks entity

Unable to load master images into table because type null

in fix_db.sql

--Remove duplicate master rows
insert into masters_images
select distinct t1.image_uri, t1.type, t1.master_id
from tmp_masters_images t1
left join masters_images t2
on t1.image_uri = t2.image_uri
and t1.type = t2.type
and t1.master_id = t2.master_id
where t2.image_uri is null
;
fails with

ERROR: null value in column "type" violates not-null constraint
DETAIL: Failing row contains (http://api.discogs.com/image/R-1-1193812031.jpeg, null, 5427).

Restoring from md5 not working as expected

Hello,

While playing with md5 following was discovered:

upserting in mongo takes very long time and never finishes (rather mongo issue? upsert per line is heavy operation..)
md5 data too large in count with (~25% diff in 4 months range) - maybe this is expected..

I've tested with following data files:

discogs_20141001_artists.xml (main file)
discogs_20150301_artists.xml (md5 for upsert)

 ~]# du -h artists.*
1.2G     discogs_20141001_artists.json
315M discogs_20150301_artists.json
147M artists.md5

 ~]# wc -l *
   3523086 discogs_20141001_artists.json
    799269 discogs_20150301_artists.json
   3769950 artists.md5

~]# time mongoimport -d dataset discogs_20150301_artists.json

2015-06-17T15:38:46.106+0300    no collection specified
2015-06-17T15:38:46.106+0300    using filename 'artists' as collection
2015-06-17T15:38:46.108+0300    connected to: localhost
2015-06-17T15:38:49.107+0300    [#.......................] dataset.artists  69.1 MB/1.1 GB (6.0%)
2015-06-17T15:38:52.107+0300    [##......................] dataset.artists  136.4 MB/1.1 GB (11.9%)
2015-06-17T15:38:55.109+0300    [####....................] dataset.artists  197.2 MB/1.1 GB (17.3%)
2015-06-17T15:38:58.107+0300    [#####...................] dataset.artists  248.6 MB/1.1 GB (21.8%)
2015-06-17T15:39:01.107+0300    [######..................] dataset.artists  308.1 MB/1.1 GB (27.0%)
2015-06-17T15:39:04.108+0300    [#######.................] dataset.artists  359.5 MB/1.1 GB (31.5%)
2015-06-17T15:39:07.107+0300    [########................] dataset.artists  397.7 MB/1.1 GB (34.8%)
2015-06-17T15:39:10.107+0300    [#########...............] dataset.artists  439.6 MB/1.1 GB (38.5%)
2015-06-17T15:39:13.107+0300    [##########..............] dataset.artists  496.0 MB/1.1 GB (43.4%)
2015-06-17T15:39:16.107+0300    [###########.............] dataset.artists  546.8 MB/1.1 GB (47.8%)
2015-06-17T15:39:19.108+0300    [############............] dataset.artists  601.0 MB/1.1 GB (52.6%)
2015-06-17T15:39:22.107+0300    [#############...........] dataset.artists  638.6 MB/1.1 GB (55.9%)
2015-06-17T15:39:25.107+0300    [##############..........] dataset.artists  694.1 MB/1.1 GB (60.7%)
2015-06-17T15:39:28.107+0300    [###############.........] dataset.artists  741.0 MB/1.1 GB (64.8%)
2015-06-17T15:39:31.107+0300    [################........] dataset.artists  766.8 MB/1.1 GB (67.1%)
2015-06-17T15:39:34.107+0300    [#################.......] dataset.artists  815.8 MB/1.1 GB (71.4%)
2015-06-17T15:39:37.107+0300    [##################......] dataset.artists  873.0 MB/1.1 GB (76.4%)
2015-06-17T15:39:40.107+0300    [###################.....] dataset.artists  920.1 MB/1.1 GB (80.5%)
2015-06-17T15:39:43.107+0300    [####################....] dataset.artists  971.5 MB/1.1 GB (85.0%)
2015-06-17T15:39:46.107+0300    [#####################...] dataset.artists  1.0 GB/1.1 GB (89.8%)
2015-06-17T15:39:49.108+0300    [######################..] dataset.artists  1.1 GB/1.1 GB (94.7%)
2015-06-17T15:39:52.107+0300    [#######################.] dataset.artists  1.1 GB/1.1 GB (98.5%)
2015-06-17T15:39:53.895+0300    imported 3523086 documents

real    1m7.806s

~]# time mongoimport --upsert --upsertFields 'id' -d dataset discogs_20150301_artists.json

2015-06-17T15:40:35.177+0300    no collection specified
2015-06-17T15:40:35.177+0300    using filename 'artists' as collection
2015-06-17T15:40:35.179+0300    connected to: localhost
2015-06-17T15:40:38.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:41.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:44.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:47.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:50.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:53.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:56.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:59.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:02.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:05.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:08.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:11.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:14.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:17.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:20.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:23.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:26.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:29.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:32.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:35.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:38.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:41.178+0300    [........................] dataset.artists  7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:14.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:17.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:20.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:23.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:26.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:29.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:32.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:35.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:38.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:41.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:14.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:17.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:20.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:23.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:26.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:29.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:32.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:35.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:38.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:41.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:14.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:17.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:20.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:23.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:26.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:29.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:32.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:35.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:38.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:41.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:44.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:47.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:50.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:53.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:56.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:59.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:02.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:05.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:08.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:11.178+0300    [#.......................] dataset.artists  14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:26.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:08.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:11.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:14.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:17.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:20.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:23.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:26.177+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:29.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:32.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:35.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:38.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:41.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:44.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:47.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:50.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:53.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:56.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:59.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:50:02.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
2015-06-17T15:50:05.178+0300    [#.......................] dataset.artists  20.1 MB/314.7 MB (6.4%)
^C
real    9m31.152s

Automated testing

Should have a way to test that new discogs XML releases do not break the parsing.

Postgres database missing indexes

Postgres database missing indexes, there doesnt seem to be any indexes making query the data slower than it needs to be.

Merge interesting forks

There is some interesting work done on some forks. Merge them back:

Filenames of sql scripts are unwieldy

No need to have 'discogs' in the name, everything to do with discogs
No need for 'pgsql' in name, already have .sql suffix
Longer names, more typing

Renamed as follows:
discogs-indexes-pgsql.sql -> create_indexes.sql
discogs-pgsql.sql -> create_tables.sql
discogs-fixdb-pgsql.sql -> fix_db.sql

There is no 'Various' artist table in Artist table breaking relationships

'Various' is referred to as artist with id 194 on the website, i,.e clicking on a 'Various' link in the database will take you to http://www.discogs.com/artist/194 however artist id doesn't actually exist in the data dumps and hence the database once imported.

This is problematic because it means relational links such as

SELECT r.release_id
a.id as artistId
a.name
FROM releases_artists AS r
INNER JOIN artist a ON r.artist_name=a.name

will not return results for various artists. After the data has been imported we should create an row for 'Various' so that queryies on the database don't have to do special cases for Various artists.

discogsparser.py doesn't run

I created a database in postgresql and imported database schema. And I tried to run discogsparser.py with following command:

python discogsparser.py -o pgsql -p "host=localhost dbname=discogs user=[user] password=[pass]" discogs_20120501_releases.xml.gz

Also I tried json format but the result doesnt' change. I'm getting something like this:

Namespace(data_quality=None, date=None, file=['discogs_20120501_releases.xml.gz'], ignore_unknown_tags=False, n=None, output='pgsql', params='host=localhost dbname=discogs user=[user] password=[pass]')

and discogsparser.py stop working without any exception.

Release formats table need a position column

In the original xml the formats are listed in order, so for a multi format release the first format listed relates to the first tracks on the release. But we have lost this order once imported into the postgres database because there is no order column in the release_formats table. This makes it impossible to accurately map the track list to a series of mediums and set the medium format accordingly

releases_extraartists_name field too long for standard index

Running create_indexes gives

ERROR: index row size 2888 exceeds maximum 2712 for index "releases_extraartists_name_idx"
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

Release.artist is missing from some datasets

When exporting the 20131201_releases.xml file I've found that some entries don't have the artist attribute.

I'm not sure if this issue is unique to this 20131201_releases.xml file and so should be considered a bug with the data or code.

Stacktrace

Namespace(data_quality=None, date='20131201', file=[], ignore_unknown_tags=True, n=None, output='mongo', params='file://output')
Traceback (most recent call last):
  File "discogsparser.py", line 223, in <module>
    main(sys.argv[1:])
  File "discogsparser.py", line 217, in main
    parseReleases(parser, exporter)
  File "discogsparser.py", line 128, in parseReleases
    parser.parse(release_file)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
    self._cont_handler.endElement(name)
  File "/home/kenan/workspace/discogs-xml2db/discogsreleaseparser.py", line 309, in endElement
    self.exporter.storeRelease(self.release)
  File "/home/kenan/workspace/discogs-xml2db/mongodbexporter.py", line 243, in storeRelease
    if release.artist:
AttributeError: Release instance has no attribute 'artist'

Package the project

Turn the existing structure into a package.

Option to run as a script (python -m)
Option to download the files and
Option to run tests on the downloaded files.
Option to create the storage structure (DB tables)

Add support for storing release barcode

Modify the image code because no longer have image urls

Disocgs datadump now only has image metadata but no urls so with current code the only data we put into image table is height and width but because no image uri no way to link between releases_images and artists_images to image table. So we need to modify releases_images (artists_images etc) to drop the uri columns and add height and width columns, and then drop the image table which is no longer useful.

Fix-xml needs amending because dumps file no longer missing outer xml tags

Fix-xml needs amending because dumps file no longer missing outer xml tags so it ends up adding multiple tags start ending tags if applied

For Postgres Track table needs track number that represents the track order

The Track table has a position column but for many releases this is not a simple ascending number but something much more difficult to parse (e.g Vinyl could be A1 A2 B1), what we really need is an additional number column that is set to 1 for first track, 2 for second track and so on. I assume tracks are in the correct order in the original xml dump file so this would be a relatively easy fix to postgresimporter.py (but i'm not a python programmer myself so not sure where to start)

Duplicate records in releases_labels, prevent us adding primary key

Duplicate records in releases_labels, prevent us adding primary key slowing access in queries and is bad db design

jthinksearch=> \d releases_labels;
Unlogged table "discogs.releases_labels"
   Column   |  Type   | Modifiers
------------+---------+-----------
 label      | text    |
 release_id | integer |
 catno      | text    |
Indexes:
    "releases_labels_catno_idx" btree (catno)
    "releases_labels_name_idx" btree (label)
Foreign-key constraints:
    "foreign_did" FOREIGN KEY (release_id) REFERENCES release(id)

jthinksearch=> select * from releases_labels  where release_id=6155;
    label     | release_id |   catno
--------------+------------+------------
 Warp Records |       6155 | WAP 39 CDR
 Warp Records |       6155 | WAP 39 CDR
(2 rows)

Add support for artist name variation for release extra artists and track extra artists

Add some for artist name variation for release extra artists and track extra artists

Option to restrict import based on data quality

Add option to restrict imports based on the values of data_quality.
Values based on the available entries at http://www.discogs.com/help/voting-guidelines.html

Error when trying to import discogs master file

Got this output, and import stop, after trying to import the 20141001 release data.

Traceback (most recent call last):
File "discogsparser.py", line 241, in
main(sys.argv[1:])
File "discogsparser.py", line 236, in main
parseMasters(parser, exporter)
File "discogsparser.py", line 171, in parseMasters
parser.parse(master_file)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 210, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 307, in end_element
self._cont_handler.endElement(name)
File "/home//Discogs_Importer/discogs-xml2db-master/discogsmasterparser.py", line 173, in endElement
self.exporter.storeMaster(self.master)
File "/home//Discogs_Importer/discogs-xml2db-master/postgresexporter.py", line 323, in storeMaster
(img.uri, img.image_type, master.id))
AttributeError: ImageInfo instance has no attribute 'image_type'

Fix_db.sql only working for releases because of db constraints

Output as follows

INSERT 0 1
INSERT 0 9522448
ERROR: insert or update on table "artists_images" violates foreign key constraint "artists_images_image_uri_fkey"
DETAIL: Key (image_uri)=(http://api.discogs.com/image/A-1-1138987958.jpeg) is not present in table "image".
ERROR: insert or update on table "labels_images" violates foreign key constraint "labels_images_image_uri_fkey"
DETAIL: Key (image_uri)=(http://api.discogs.com/image/L-58127-1255729347.jpeg) is not present in table "image".
ERROR: null value in column "type" violates not-null constraint
DETAIL: Failing row contains (http://api.discogs.com/image/R-1-1193812031.jpeg, null, 5427).
INSERT 0 9520322

Put discogs table into own schema rather than public

My specific requirement for this is so I can load disocgs and musicbrainz table sinto same database and then do queries involving tables from both datasets. But I think this is a good general improvement anyway

getlatestdumps.sh broken because of discogs changes

The very good news is that the discogs dumps are now stored on amazon s3, this is much quicker and more reliable but it does mean this script doesnt work (view the source of :http://data.discogs.com/)

You can simply download files as ..
wget http://data.discogs.com.s3-us-west-2.amazonaws.com/data/discogs_20151001_artists.xml.gz
......

but I dont know if there is a way to fix the script so that it will automatically get the latest dumps

invalid token of release xml file when loading into PostgreSQL

I am loading the xml files into the PostgreSQL. I downloaded 20140803 version and 20140701 version. Both of them I met errors when loading the release xml file.

For 20140803 version:
xml.sax._exceptions.SAXParseException: discogs_20140803_releases.xml:53155:784: not well-formed (invalid token)

For 20140701 version:
xml.sax._exceptions.SAXParseException: discogs_20140701_releases.xml:3763466:690: not well-formed (invalid token)

Which version of discogs did you load into PostgreSQL successfully?

Thank you,
Ying

master_id field on discogs.release table is always empty

master_id field on discogs.release table is always empty when loading release data into postgres. The data does exist in the original xml file.

artist and artist_joins table should be merged

(Apologies me If Ive misunderstood this but this is how I'm seeing it)

Shouldn't the artist join field be stored as a column in track_artists , release_artists rather than having the separate tracks_artists_joins, release_artist_joins tabel.

Certainly looking at the xml returned by the webservice http://api.discogs.com/release/3 join data is stored with the artist itself. Whereas with the database tables Im not sure how I meant to retrieve the data, i.e i can lookup the artists for a track using track_artists then I think if the number of artists for one track is more than one I have to look up track_artists_joins by trackid, and then find the right row by comparing artist1 and artist2 with the rows from the first query. This is'nt too bad when just have two artists on a song but if the song has four artists how do I know if artist1 artist2 goes with the 1st and 2nd, 2nd and 3rd, 1st and 3rd ectera because the artist order is not defined anywhere.

It would be better if track_artists had a position column (to signify position of artist on track), and a join column(this would always be empty for last artist or when track is only my one artist). Same logic applies to release artists

New id field for artists

The November 11 2011 dump of artists contains a new numeric id field, e.g. <artist>...<id>123</id>...</artist>.

If this becomes a permanent fixture, consider adding an artist_id to the PGSQL artist table.

How can I keep updated the database?

I asked this question in Discogs forum but they said that they don't provide any solution. http://www.discogs.com/help/forums/topic/340340

Do you have any manuel solution this frustrating problem?

File Import Error

When I try to import artists' xml file with

discogsparser.py -i -o mongo -p "file:///home/ubuntu/discogs/?uniq=md5" discogs_20120501_artists.xml

it returns the following output.

Namespace(data_quality=None, date=None, file=['discogs_20120501_artists.xml'], ignore_unknown_tags=True, n=None, output='mongo', params='file:///home/ubuntu/discogs/?uniq=md5')
Traceback (most recent call last):
  File "discogsparser.py", line 223, in <module>
    main(sys.argv[1:])
  File "discogsparser.py", line 215, in main
    parseArtists(parser, exporter)
  File "discogsparser.py", line 73, in parseArtists
    parser.parse(artist_file)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
    self._cont_handler.endElement(name)
  File "/home/ubuntu/discogs-xml2db/discogsartistparser.py", line 117, in endElement
    self.exporter.storeArtist(self.artist)
  File "/home/ubuntu/discogs-xml2db/mongodbexporter.py", line 240, in storeArtist
    self.execute('artists', artist)
  File "/home/ubuntu/discogs-xml2db/mongodbexporter.py", line 206, in execute
    doc.updated_on = "%s" % date.today()
AttributeError: 'dict' object has no attribute 'updated_on'

Releases_artists table contains companies !

select * from releases_artists where release_id=2;

gives

"Mr. James Barth & A.D.";2
"JTS Studios";2
"MPO";2

But the release artist is "Mr. James Barth & A.D.", the other two entries are just assocaited companies

http://api.discogs.com/release/2
http://www.discogs.com/release/2

I dont think they should be ending up in this table as it makes it impossible to identify who the actual release artist is.

md5 Can't be Created

Hello,

Looks like md5 files like described in README, can't be created.

$ discogsparser.py -i -o mongo -p "file:///tmp/discogs/?uniq=md5" -d 20111101 
# this results in 2 files creates for each class, e.g. an artists.json file and an artists.md5 file

I'm using Python 2.7.5

Handle ID tag in labels XML

The labels XML seems to have gotten an ID tag. Maybe it's time for a refresh of the script to take into account the latest versions of each XML.

Import masters

Since November 2011 discogs have added monthly dumps for masters.
http://www.discogs.com/data/

Updating a PostgreSQL database

Is there a mechanism in place for updating a PostgreSQL database from the latest XML dumps? I couldn't find anything other than what the README mentioned about MongoDB.

MongoDb import failed

I want to do direct (or not) import in MongoDb.
I Have download and extract Discogs releases file.

Execute this and receive error:

$ ./discogsparser.py -i -o mongo -p "mongodb://user:pass@localhost/discogs?uniq=md5" -d 20120901
  File "./discogsparser.py", line 74
    except ParserStopError as pse:
                            ^
SyntaxError: invalid syntax

updated_on field for when a record was imported or updated

To be able to compute which records MongoDB needs to re-index, I need an updated_on field that should reflect the date (or the dump file) the record originated from, either as a new import or as an update to a previous import.

Need to make sure it doesn't interfere with MD5 calculations.

Speed up processing

Hi guys,

What's the best ideas to speed up processing with discogs-xml2db?

Is it possilbe to run it in parallel mode?
Right now it occupies only 1 CPU core and processing latest 20G xml ~3 hours (for mongo file-dump)

$ time discogsparser.py -i -o mongo -p "file:///HDD1/2015-06" /HDD2/2015-06/discogs_20150601_releases.xml

real    191m27.478s
user    189m22.749s
sys     2m0.714s

I am playing with http://www.gnu.org/software/parallel/ now, but maybe there are some other options.

Also i've tried to omit building md5 hashes, to speed things up a bit, but overall processing time didn't changed much.

Any other suggestions are more then welcomed!!

Thanks!

track extraartists and release extraartists should normalize roles

Currently if the xml feed contains a release extra artist with multiple roles they are added as one row in the release_extraartist table with an array of roles. Fair enough however often the same artist is listed as a seperate artist for one release (i.e http://api.discogs.com/release/2 ) so that we end up getting multiple rows in the table for the same artist and release anyway negating the advantage of putting the roles in an array.

( I have changed my code so that the release_extraartist table defines role as a simple text field and adds a new row for each release/artist/combination, same logic for track_extraartists)

Use Postgres Copy to get the data into Postgres

Import of the release data is slow, not least because it is processing each record one by one sequentially adding them into the database. So I think the bottleneck is in the code the database could cope with multiple statements being fired at the same time.

I think code could be speedup by parallelizing the import of the data, I wonder if just manually splitting the file into three chunks and running import on the three files in parallel would work.

ANV not assigned to the proper parent

The script currently overrides the release.anv property for each anv node it founds - last one wins.

So, first, the Extraartist should support anv (change the PGSQL scripts).

That would allow the anv property to be attached to the proper nodes: release.extraartists or track.extraartist.

Release artists not matching artists if using artist name variation

Rows in the release_artist table do not match to artists in the artist table if the release artist is an artist name variation, this query applied to the postgres db shows the problem (Various artists filtered out because different problem raised in separate issue)

select distinct r1.release_id, r1.artist_name from
releases_artists r1
left join artist r2
on r1.artist_name= r2.name
where r2.name is null
and r1.artist_name!='Various'
order by r1.artist_name

An example here, this release:

http://api.discogs.com/release/2294510

has artist Jürgen Von Manger

but the actual artist name uses small 'v' in von
http://www.discogs.com/artist/566712
Jürgen von Manger

The way to fix this would be when loading the database from the release dump file to populate the release_artists table with artist_id instead of artist_name (as both are included in the dump). Clearly requires changes to database which I can do but Im struggling to fix the python code.

I guess we have the same problem with track_artists as well

Ignoring unknown tags should be the default

Replace --ignore-unknown-tags with a --strict-tags option

RtD style documentation

The project should have better documentation and be on https://readthedocs.org.

Trim README.rst to only cover a brief description, features, installation, and quick way to run.
Documentation for each exporter
Documentation for tests

Feb 2015 Discogs Xml Dump now contains members id

If a group has members the members section now contains artist ids just not their names, this breaks the postgres artist parsing causing it to consider each member an additional artist in their own right.

January 2015 Dump:

grep "22387" discogs_20150101_artists.xml
22387DisintegratorOliver Chesler & John SelwayThe collaboration of New York City based DJ/Producers Oliver Chesler and John Selway.<data_quality>Correct</data_quality>DesintegratorDisintergratorKoenig CylindersMachines (8)Carlos VasquezJohn SelwayOliver Chesler

February Dump:
grep "22387" discogs_20150201_artists.xml
22387DisintegratorOliver Chesler & John SelwayThe collaboration of New York City based DJ/Producers Oliver Chesler and John Selway.<data_quality>Correct</data_quality>DesintegratorDisintergratorKoenig CylindersMachines (8)241853Carlos Vasquez17John Selway4563Oliver Chesler

philipmat / discogs-xml2db Goto Github PK

discogs-xml2db's People

Contributors

Stargazers

Watchers

Forkers

discogs-xml2db's Issues

Stacktrace

Recommend Projects

Recommend Topics

Recommend Org