philipmat / discogs-xml2db Goto Github PK
View Code? Open in Web Editor NEWImports the discogs.com monthly XML dumps into databases
License: Apache License 2.0
Imports the discogs.com monthly XML dumps into databases
License: Apache License 2.0
The process crashed when it hit the "masters" section. Here's the output:
File "discogsparser.py", line 241, in
main(sys.argv[1:])
File "discogsparser.py", line 236, in main
parseMasters(parser, exporter)
File "discogsparser.py", line 171, in parseMasters
parser.parse(master_file)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/local/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 210, in feed
self._parser.Parse(data, isFinal)
File "/usr/local/lib/python2.7/xml/sax/expatreader.py", line 307, in end_element
self._cont_handler.endElement(name)
File "/home/thephltr/webapps/who_pro/discogs_importer/discogsmasterparser.py", line 173, in endElement
self.exporter.storeMaster(self.master)
File "/home/thephltr/webapps/who_pro/discogs_importer/postgresexporter.py", line 323, in storeMaster
(img.uri, img.image_type, master.id))
AttributeError: ImageInfo instance has no attribute 'image_type'
I was stuck at the step "python discogsparser.py". The following is the run information:
python discogsparser.py -o pgsql -p "dbname=discogs user=discogs password=discogs" -d 20140803
Namespace(data_quality=None, date='20140803', file=[], ignore_unknown_tags=False, n=None, output='pgsql', params='dbname=discogs user=discogs password=discogs')python discogsparser.py -o pgsql -p "dbname=discogs user=discogs password=discogs" discogs_20140803_artists.xml
Namespace(data_quality=None, date=None, file=['discogs_20140803_artists.xml'], ignore_unknown_tags=False, n=None, output='pgsql', params='dbname=discogs user=discogs password=discogs')
It looks like it didn't execute the command. I don't know what's wrong with it. I am python newbie.
Thank you,
Ying
I'm trying to import data to mongodb. As you mention in README, direct import takes longer than the other option (XML to JSON and mongoimport). However even if I tried other option, I couldn't easily import it. Releases.xml file is ~9gb and it took more than 24 hours to export JSON file. Actually after that process releases.json file was corrupted so I couldn't really import file.
I tried the script in a Amazon EC2 micro instance and another VPS that has dual core processor. The results were same, the files were ~8gb and corrupted. And the other problem is the script caused CPU overload so it's not choice to leave the discogsparser.py file open and let it work for 24 hours. I couldn't figure out what the problem is. Is xml.sax slow or the systems that I used don't have enough resource?
parsing 20121201 master file i got the following stack trace:
Traceback (most recent call last):
File "discogsparser.py", line 223, in <module>
main(sys.argv[1:])
File "discogsparser.py", line 218, in main
parseMasters(parser, exporter)
File "discogsparser.py", line 153, in parseMasters
parser.parse(master_file)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
self._cont_handler.endElement(name)
File "/Users/burc/Documents/Development/discogs-xml2db/discogsmasterparser.py", line 170, in endElement
self.exporter.storeMaster(self.master)
File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 250, in storeMaster
self.execute('masters', master)
File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 203, in execute
uniq, md5 = self._is_uniq(collection, what.id, json_string)
File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 181, in _is_uniq
return self._quick_uniq.is_uniq(collection, id, json_string)
File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 110, in is_uniq
self._load(collection)
File "/Users/burc/Documents/Development/discogs-xml2db/mongodbexporter.py", line 88, in _load
if self._hashes[name] is None:
KeyError: 'masters'
Without studying the code in details, changed line 84 of mongodbexporter.py to
self._hashes = {'artists': None, 'labels': None, 'releases': None, 'masters': None }
fixed the issue
Currently image type is stored in the images table, but it is actually a function of the relationship between the image and each release/label/artist it is linked to. There are two problems with the current approach
Solution:Move the image type column from the image table to the releases_images, artist_images, labels_images tables.
Error: Unknown Release element 'sub_tracks' building release table from 01072014 dump, need to handle/ignore new sub_tracks entity
in fix_db.sql
--Remove duplicate master rows
insert into masters_images
select distinct t1.image_uri, t1.type, t1.master_id
from tmp_masters_images t1
left join masters_images t2
on t1.image_uri = t2.image_uri
and t1.type = t2.type
and t1.master_id = t2.master_id
where t2.image_uri is null
;
fails with
ERROR: null value in column "type" violates not-null constraint
DETAIL: Failing row contains (http://api.discogs.com/image/R-1-1193812031.jpeg, null, 5427).
Hello,
While playing with md5 following was discovered:
I've tested with following data files:
discogs_20141001_artists.xml (main file)
discogs_20150301_artists.xml (md5 for upsert)
~]# du -h artists.*
1.2G discogs_20141001_artists.json
315M discogs_20150301_artists.json
147M artists.md5
~]# wc -l *
3523086 discogs_20141001_artists.json
799269 discogs_20150301_artists.json
3769950 artists.md5
~]# time mongoimport -d dataset discogs_20150301_artists.json
2015-06-17T15:38:46.106+0300 no collection specified
2015-06-17T15:38:46.106+0300 using filename 'artists' as collection
2015-06-17T15:38:46.108+0300 connected to: localhost
2015-06-17T15:38:49.107+0300 [#.......................] dataset.artists 69.1 MB/1.1 GB (6.0%)
2015-06-17T15:38:52.107+0300 [##......................] dataset.artists 136.4 MB/1.1 GB (11.9%)
2015-06-17T15:38:55.109+0300 [####....................] dataset.artists 197.2 MB/1.1 GB (17.3%)
2015-06-17T15:38:58.107+0300 [#####...................] dataset.artists 248.6 MB/1.1 GB (21.8%)
2015-06-17T15:39:01.107+0300 [######..................] dataset.artists 308.1 MB/1.1 GB (27.0%)
2015-06-17T15:39:04.108+0300 [#######.................] dataset.artists 359.5 MB/1.1 GB (31.5%)
2015-06-17T15:39:07.107+0300 [########................] dataset.artists 397.7 MB/1.1 GB (34.8%)
2015-06-17T15:39:10.107+0300 [#########...............] dataset.artists 439.6 MB/1.1 GB (38.5%)
2015-06-17T15:39:13.107+0300 [##########..............] dataset.artists 496.0 MB/1.1 GB (43.4%)
2015-06-17T15:39:16.107+0300 [###########.............] dataset.artists 546.8 MB/1.1 GB (47.8%)
2015-06-17T15:39:19.108+0300 [############............] dataset.artists 601.0 MB/1.1 GB (52.6%)
2015-06-17T15:39:22.107+0300 [#############...........] dataset.artists 638.6 MB/1.1 GB (55.9%)
2015-06-17T15:39:25.107+0300 [##############..........] dataset.artists 694.1 MB/1.1 GB (60.7%)
2015-06-17T15:39:28.107+0300 [###############.........] dataset.artists 741.0 MB/1.1 GB (64.8%)
2015-06-17T15:39:31.107+0300 [################........] dataset.artists 766.8 MB/1.1 GB (67.1%)
2015-06-17T15:39:34.107+0300 [#################.......] dataset.artists 815.8 MB/1.1 GB (71.4%)
2015-06-17T15:39:37.107+0300 [##################......] dataset.artists 873.0 MB/1.1 GB (76.4%)
2015-06-17T15:39:40.107+0300 [###################.....] dataset.artists 920.1 MB/1.1 GB (80.5%)
2015-06-17T15:39:43.107+0300 [####################....] dataset.artists 971.5 MB/1.1 GB (85.0%)
2015-06-17T15:39:46.107+0300 [#####################...] dataset.artists 1.0 GB/1.1 GB (89.8%)
2015-06-17T15:39:49.108+0300 [######################..] dataset.artists 1.1 GB/1.1 GB (94.7%)
2015-06-17T15:39:52.107+0300 [#######################.] dataset.artists 1.1 GB/1.1 GB (98.5%)
2015-06-17T15:39:53.895+0300 imported 3523086 documents
real 1m7.806s
~]# time mongoimport --upsert --upsertFields 'id' -d dataset discogs_20150301_artists.json
2015-06-17T15:40:35.177+0300 no collection specified
2015-06-17T15:40:35.177+0300 using filename 'artists' as collection
2015-06-17T15:40:35.179+0300 connected to: localhost
2015-06-17T15:40:38.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:41.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:44.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:47.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:50.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:53.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:56.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:40:59.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:02.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:05.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:08.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:11.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:14.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:17.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:20.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:23.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:26.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:29.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:32.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:35.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:38.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:41.178+0300 [........................] dataset.artists 7.6 MB/314.7 MB (2.4%)
2015-06-17T15:41:44.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:47.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:50.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:53.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:56.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:41:59.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:02.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:05.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:08.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:11.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:14.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:17.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:20.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:23.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:26.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:29.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:32.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:35.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:38.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:41.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:44.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:47.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:50.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:53.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:56.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:42:59.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:02.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:05.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:08.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:11.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:14.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:17.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:20.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:23.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:26.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:29.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:32.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:35.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:38.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:41.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:44.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:47.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:50.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:53.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:56.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:43:59.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:02.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:05.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:08.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:11.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:14.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:17.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:20.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:23.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:26.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:29.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:32.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:35.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:38.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:41.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:44.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:47.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:50.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:53.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:56.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:44:59.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:02.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:05.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:08.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:11.178+0300 [#.......................] dataset.artists 14.0 MB/314.7 MB (4.4%)
2015-06-17T15:45:14.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:17.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:20.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:23.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:26.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:29.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:32.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:35.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:38.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:41.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:44.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:47.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:50.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:53.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:56.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:45:59.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:02.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:05.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:08.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:11.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:14.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:17.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:20.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:23.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:26.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:29.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:32.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:35.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:38.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:41.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:44.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:47.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:50.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:53.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:56.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:46:59.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:02.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:05.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:08.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:11.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:14.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:17.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:20.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:23.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:26.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:29.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:32.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:35.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:38.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:41.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:44.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:47.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:50.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:53.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:56.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:47:59.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:02.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:05.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:08.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:11.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:14.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:17.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:20.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:23.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:26.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:29.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:32.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:35.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:38.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:41.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:44.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:47.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:50.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:53.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:56.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:48:59.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:02.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:05.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:08.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:11.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:14.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:17.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:20.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:23.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:26.177+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:29.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:32.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:35.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:38.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:41.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:44.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:47.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:50.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:53.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:56.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:49:59.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:50:02.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
2015-06-17T15:50:05.178+0300 [#.......................] dataset.artists 20.1 MB/314.7 MB (6.4%)
^C
real 9m31.152s
Should have a way to test that new discogs XML releases do not break the parsing.
Postgres database missing indexes, there doesnt seem to be any indexes making query the data slower than it needs to be.
There is some interesting work done on some forks. Merge them back:
No need to have 'discogs' in the name, everything to do with discogs
No need for 'pgsql' in name, already have .sql suffix
Longer names, more typing
Renamed as follows:
discogs-indexes-pgsql.sql -> create_indexes.sql
discogs-pgsql.sql -> create_tables.sql
discogs-fixdb-pgsql.sql -> fix_db.sql
'Various' is referred to as artist with id 194 on the website, i,.e clicking on a 'Various' link in the database will take you to http://www.discogs.com/artist/194 however artist id doesn't actually exist in the data dumps and hence the database once imported.
This is problematic because it means relational links such as
SELECT r.release_id
a.id as artistId
a.name
FROM releases_artists AS r
INNER JOIN artist a ON r.artist_name=a.name
will not return results for various artists. After the data has been imported we should create an row for 'Various' so that queryies on the database don't have to do special cases for Various artists.
I created a database in postgresql and imported database schema. And I tried to run discogsparser.py with following command:
python discogsparser.py -o pgsql -p "host=localhost dbname=discogs user=[user] password=[pass]" discogs_20120501_releases.xml.gz
Also I tried json format but the result doesnt' change. I'm getting something like this:
Namespace(data_quality=None, date=None, file=['discogs_20120501_releases.xml.gz'], ignore_unknown_tags=False, n=None, output='pgsql', params='host=localhost dbname=discogs user=[user] password=[pass]')
and discogsparser.py stop working without any exception.
In the original xml the formats are listed in order, so for a multi format release the first format listed relates to the first tracks on the release. But we have lost this order once imported into the postgres database because there is no order column in the release_formats table. This makes it impossible to accurately map the track list to a series of mediums and set the medium format accordingly
Running create_indexes gives
ERROR: index row size 2888 exceeds maximum 2712 for index "releases_extraartists_name_idx"
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
When exporting the 20131201_releases.xml file I've found that some entries don't have the artist attribute.
I'm not sure if this issue is unique to this 20131201_releases.xml file and so should be considered a bug with the data or code.
Namespace(data_quality=None, date='20131201', file=[], ignore_unknown_tags=True, n=None, output='mongo', params='file://output')
Traceback (most recent call last):
File "discogsparser.py", line 223, in <module>
main(sys.argv[1:])
File "discogsparser.py", line 217, in main
parseReleases(parser, exporter)
File "discogsparser.py", line 128, in parseReleases
parser.parse(release_file)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
self._cont_handler.endElement(name)
File "/home/kenan/workspace/discogs-xml2db/discogsreleaseparser.py", line 309, in endElement
self.exporter.storeRelease(self.release)
File "/home/kenan/workspace/discogs-xml2db/mongodbexporter.py", line 243, in storeRelease
if release.artist:
AttributeError: Release instance has no attribute 'artist'
Turn the existing structure into a package.
python -m
)Disocgs datadump now only has image metadata but no urls so with current code the only data we put into image table is height and width but because no image uri no way to link between releases_images and artists_images to image table. So we need to modify releases_images (artists_images etc) to drop the uri columns and add height and width columns, and then drop the image table which is no longer useful.
Fix-xml needs amending because dumps file no longer missing outer xml tags so it ends up adding multiple tags start ending tags if applied
The Track table has a position column but for many releases this is not a simple ascending number but something much more difficult to parse (e.g Vinyl could be A1 A2 B1), what we really need is an additional number column that is set to 1 for first track, 2 for second track and so on. I assume tracks are in the correct order in the original xml dump file so this would be a relatively easy fix to postgresimporter.py (but i'm not a python programmer myself so not sure where to start)
Duplicate records in releases_labels, prevent us adding primary key slowing access in queries and is bad db design
jthinksearch=> \d releases_labels;
Unlogged table "discogs.releases_labels"
Column | Type | Modifiers
------------+---------+-----------
label | text |
release_id | integer |
catno | text |
Indexes:
"releases_labels_catno_idx" btree (catno)
"releases_labels_name_idx" btree (label)
Foreign-key constraints:
"foreign_did" FOREIGN KEY (release_id) REFERENCES release(id)
jthinksearch=> select * from releases_labels where release_id=6155;
label | release_id | catno
--------------+------------+------------
Warp Records | 6155 | WAP 39 CDR
Warp Records | 6155 | WAP 39 CDR
(2 rows)
Add some for artist name variation for release extra artists and track extra artists
Add option to restrict imports based on the values of data_quality.
Values based on the available entries at http://www.discogs.com/help/voting-guidelines.html
Got this output, and import stop, after trying to import the 20141001 release data.
Traceback (most recent call last):
File "discogsparser.py", line 241, in
main(sys.argv[1:])
File "discogsparser.py", line 236, in main
parseMasters(parser, exporter)
File "discogsparser.py", line 171, in parseMasters
parser.parse(master_file)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 210, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 307, in end_element
self._cont_handler.endElement(name)
File "/home//Discogs_Importer/discogs-xml2db-master/discogsmasterparser.py", line 173, in endElement
self.exporter.storeMaster(self.master)
File "/home//Discogs_Importer/discogs-xml2db-master/postgresexporter.py", line 323, in storeMaster
(img.uri, img.image_type, master.id))
AttributeError: ImageInfo instance has no attribute 'image_type'
Output as follows
INSERT 0 1
INSERT 0 9522448
ERROR: insert or update on table "artists_images" violates foreign key constraint "artists_images_image_uri_fkey"
DETAIL: Key (image_uri)=(http://api.discogs.com/image/A-1-1138987958.jpeg) is not present in table "image".
ERROR: insert or update on table "labels_images" violates foreign key constraint "labels_images_image_uri_fkey"
DETAIL: Key (image_uri)=(http://api.discogs.com/image/L-58127-1255729347.jpeg) is not present in table "image".
ERROR: null value in column "type" violates not-null constraint
DETAIL: Failing row contains (http://api.discogs.com/image/R-1-1193812031.jpeg, null, 5427).
INSERT 0 9520322
My specific requirement for this is so I can load disocgs and musicbrainz table sinto same database and then do queries involving tables from both datasets. But I think this is a good general improvement anyway
The very good news is that the discogs dumps are now stored on amazon s3, this is much quicker and more reliable but it does mean this script doesnt work (view the source of :http://data.discogs.com/)
You can simply download files as ..
wget http://data.discogs.com.s3-us-west-2.amazonaws.com/data/discogs_20151001_artists.xml.gz
......
but I dont know if there is a way to fix the script so that it will automatically get the latest dumps
I am loading the xml files into the PostgreSQL. I downloaded 20140803 version and 20140701 version. Both of them I met errors when loading the release xml file.
For 20140803 version:
xml.sax._exceptions.SAXParseException: discogs_20140803_releases.xml:53155:784: not well-formed (invalid token)
For 20140701 version:
xml.sax._exceptions.SAXParseException: discogs_20140701_releases.xml:3763466:690: not well-formed (invalid token)
Which version of discogs did you load into PostgreSQL successfully?
Thank you,
Ying
master_id field on discogs.release table is always empty when loading release data into postgres. The data does exist in the original xml file.
(Apologies me If Ive misunderstood this but this is how I'm seeing it)
Shouldn't the artist join field be stored as a column in track_artists , release_artists rather than having the separate tracks_artists_joins, release_artist_joins tabel.
Certainly looking at the xml returned by the webservice http://api.discogs.com/release/3 join data is stored with the artist itself. Whereas with the database tables Im not sure how I meant to retrieve the data, i.e i can lookup the artists for a track using track_artists then I think if the number of artists for one track is more than one I have to look up track_artists_joins by trackid, and then find the right row by comparing artist1 and artist2 with the rows from the first query. This is'nt too bad when just have two artists on a song but if the song has four artists how do I know if artist1 artist2 goes with the 1st and 2nd, 2nd and 3rd, 1st and 3rd ectera because the artist order is not defined anywhere.
It would be better if track_artists had a position column (to signify position of artist on track), and a join column(this would always be empty for last artist or when track is only my one artist). Same logic applies to release artists
The November 11 2011 dump of artists contains a new numeric id field, e.g. <artist>...<id>123</id>...</artist>
.
If this becomes a permanent fixture, consider adding an artist_id to the PGSQL artist table.
I asked this question in Discogs forum but they said that they don't provide any solution. http://www.discogs.com/help/forums/topic/340340
Do you have any manuel solution this frustrating problem?
When I try to import artists' xml file with
discogsparser.py -i -o mongo -p "file:///home/ubuntu/discogs/?uniq=md5" discogs_20120501_artists.xml
it returns the following output.
Namespace(data_quality=None, date=None, file=['discogs_20120501_artists.xml'], ignore_unknown_tags=True, n=None, output='mongo', params='file:///home/ubuntu/discogs/?uniq=md5')
Traceback (most recent call last):
File "discogsparser.py", line 223, in <module>
main(sys.argv[1:])
File "discogsparser.py", line 215, in main
parseArtists(parser, exporter)
File "discogsparser.py", line 73, in parseArtists
parser.parse(artist_file)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
self._cont_handler.endElement(name)
File "/home/ubuntu/discogs-xml2db/discogsartistparser.py", line 117, in endElement
self.exporter.storeArtist(self.artist)
File "/home/ubuntu/discogs-xml2db/mongodbexporter.py", line 240, in storeArtist
self.execute('artists', artist)
File "/home/ubuntu/discogs-xml2db/mongodbexporter.py", line 206, in execute
doc.updated_on = "%s" % date.today()
AttributeError: 'dict' object has no attribute 'updated_on'
select * from releases_artists where release_id=2;
gives
"Mr. James Barth & A.D.";2
"JTS Studios";2
"MPO";2
But the release artist is "Mr. James Barth & A.D.", the other two entries are just assocaited companies
http://api.discogs.com/release/2
http://www.discogs.com/release/2
I dont think they should be ending up in this table as it makes it impossible to identify who the actual release artist is.
Hello,
Looks like md5 files like described in README, can't be created.
$ discogsparser.py -i -o mongo -p "file:///tmp/discogs/?uniq=md5" -d 20111101
# this results in 2 files creates for each class, e.g. an artists.json file and an artists.md5 file
I'm using Python 2.7.5
The labels XML seems to have gotten an ID tag. Maybe it's time for a refresh of the script to take into account the latest versions of each XML.
Since November 2011 discogs have added monthly dumps for masters.
http://www.discogs.com/data/
Is there a mechanism in place for updating a PostgreSQL database from the latest XML dumps? I couldn't find anything other than what the README mentioned about MongoDB.
I want to do direct (or not) import in MongoDb.
I Have download and extract Discogs releases file.
Execute this and receive error:
$ ./discogsparser.py -i -o mongo -p "mongodb://user:pass@localhost/discogs?uniq=md5" -d 20120901
File "./discogsparser.py", line 74
except ParserStopError as pse:
^
SyntaxError: invalid syntax
To be able to compute which records MongoDB needs to re-index, I need an updated_on field that should reflect the date (or the dump file) the record originated from, either as a new import or as an update to a previous import.
Need to make sure it doesn't interfere with MD5 calculations.
Hi guys,
What's the best ideas to speed up processing with discogs-xml2db?
Is it possilbe to run it in parallel mode?
Right now it occupies only 1 CPU core and processing latest 20G xml ~3 hours (for mongo file-dump)
$ time discogsparser.py -i -o mongo -p "file:///HDD1/2015-06" /HDD2/2015-06/discogs_20150601_releases.xml
real 191m27.478s
user 189m22.749s
sys 2m0.714s
I am playing with http://www.gnu.org/software/parallel/ now, but maybe there are some other options.
Also i've tried to omit building md5 hashes, to speed things up a bit, but overall processing time didn't changed much.
Any other suggestions are more then welcomed!!
Thanks!
Currently if the xml feed contains a release extra artist with multiple roles they are added as one row in the release_extraartist table with an array of roles. Fair enough however often the same artist is listed as a seperate artist for one release (i.e http://api.discogs.com/release/2 ) so that we end up getting multiple rows in the table for the same artist and release anyway negating the advantage of putting the roles in an array.
( I have changed my code so that the release_extraartist table defines role as a simple text field and adds a new row for each release/artist/combination, same logic for track_extraartists)
Import of the release data is slow, not least because it is processing each record one by one sequentially adding them into the database. So I think the bottleneck is in the code the database could cope with multiple statements being fired at the same time.
I think code could be speedup by parallelizing the import of the data, I wonder if just manually splitting the file into three chunks and running import on the three files in parallel would work.
The script currently overrides the release.anv
property for each anv
node it founds - last one wins.
So, first, the Extraartist
should support anv
(change the PGSQL scripts).
That would allow the anv
property to be attached to the proper nodes: release.extraartists
or track.extraartist
.
Rows in the release_artist table do not match to artists in the artist table if the release artist is an artist name variation, this query applied to the postgres db shows the problem (Various artists filtered out because different problem raised in separate issue)
select distinct r1.release_id, r1.artist_name from
releases_artists r1
left join artist r2
on r1.artist_name= r2.name
where r2.name is null
and r1.artist_name!='Various'
order by r1.artist_name
An example here, this release:
http://api.discogs.com/release/2294510
has artist Jürgen Von Manger
but the actual artist name uses small 'v' in von
http://www.discogs.com/artist/566712
Jürgen von Manger
The way to fix this would be when loading the database from the release dump file to populate the release_artists table with artist_id instead of artist_name (as both are included in the dump). Clearly requires changes to database which I can do but Im struggling to fix the python code.
I guess we have the same problem with track_artists as well
Replace --ignore-unknown-tags
with a --strict-tags
option
The project should have better documentation and be on https://readthedocs.org.
If a group has members the members section now contains artist ids just not their names, this breaks the postgres artist parsing causing it to consider each member an additional artist in their own right.
January 2015 Dump:
grep "22387" discogs_20150101_artists.xml
22387DisintegratorOliver Chesler & John SelwayThe collaboration of New York City based DJ/Producers Oliver Chesler and John Selway.<data_quality>Correct</data_quality>DesintegratorDisintergratorKoenig CylindersMachines (8)Carlos VasquezJohn SelwayOliver Chesler
February Dump:
grep "22387" discogs_20150201_artists.xml
22387DisintegratorOliver Chesler & John SelwayThe collaboration of New York City based DJ/Producers Oliver Chesler and John Selway.<data_quality>Correct</data_quality>DesintegratorDisintergratorKoenig CylindersMachines (8)241853Carlos Vasquez17John Selway4563Oliver Chesler
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.