stadt-karlsruhe / ckanext-extractor Goto Github PK
View Code? Open in Web Editor NEWA full text and metadata extractor for CKAN
License: GNU Affero General Public License v3.0
A full text and metadata extractor for CKAN
License: GNU Affero General Public License v3.0
Do you have any plans to support the new background jobs system soon?
I have a few custom ckan tag vocabularies for my datasets. It looks like when the worker extracts the text, the vocabulary tags are removed from the dataset.
I haven't looked into the worker code yet and I'm still on [email protected]
Basically the only thing I have using celery (yes, still on celery despite you upgrading it to work with redis on my request, sorry) is this.
When I create a dataset, I assign a couple of vocabulary tags to it.
When I add a PDF resource to it and programmatically request the package immediately afterwords, they're still set correctly.
A few seconds later they're not set any more.
If I stop the celery worker, the tags will stay in place until I start the worker again.
Any idea why this might be? I'll dive into the worker code ASAP but it's taken me a day or so to track this down to this plugin so it might not be tomorrow.
As always, I'm such a huge fan of this and appreciate it very much. Just posting here so long in case you know very quickly what it is. I'll update when I know more.
I think this has been hidden in the past because I used a script that would update (and fix) the package each time I add a resource, and I generally add XLS resources after adding PDF resources to the same datasets, and I have extractor configured to only extract PDF resources.
Celery threads don't seem to go away after extract all
which makes the machine run out of memory pretty quickly.
Have you seen this before?
I have a big dataset with 8,00,000 records. When i do the extraction it came with the following error:
[2017-10-10 14:43:08,491: ERROR/MainProcess] Task extractor.extract[401c7ccc-7a3c-455e-a5f4-f23b804ae43d] raised unexpected: SearchIndexError('Solr returned an error: (u"Connection to server 'http://solr_server/solr/ckan/update/?commit=true' timed out: HTTPConnectionPool(host='#########', port=8983): Read timed out. (read timeout=60)",)',)
Traceback (most recent call last):
File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 438, in protected_call
return self.run(*args, **kwargs)
File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/tasks.py", line 94, in extract
index_for('package').update_dict(pkg_dict)
File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 101, in update_dict
self.index_package(pkg_dict, defer_commit)
File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 295, in index_package
raise SearchIndexError(msg)
SearchIndexError: Solr returned an error: (u"Connection to server 'http://XXXXXXXXXXXXXXXXX/solr/ckan/update/?commit=true' timed out: HTTPConnectionPool(host='xxxxxxxxxx', port=8983): Read timed out. (read timeout=60)",)
Did anyone else had same issue or can anyone please let me know how to fix it.
Thanks in Advance.!!
I have a problem installing this extension on Ubuntu 14.04.
You wrote that Solr packages are broken for Ubuntu and that people should download the correct jar files. I downloaded them then added the appropriate lines to solrconfig.xml. However, I still have issues with Solr extracting metadata.
Here's the error I get
Traceback (most recent call last):
File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/tasks.py", line 62, in extract
extracted = download_and_extract(res_dict['url'])
File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/lib.py", line 43, in download_and_extract
data = pysolr.Solr(config['solr_url']).extract(f, extractFormat='text')
File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/pysolr.py", line 979, in extract
files={'file': (file_obj.name, file_obj)})
File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/pysolr.py", line 394, in _send_request
raise SolrError(error_message % (resp.status_code, solr_message))
SolrError: Solr responded with an error (HTTP 500): [Reason: None]
Any advice you can give would be appreciated.
I'm not sure if I understand the handling correctly. As far as I can tell, thanks to celeryd all new uploads (e.g. of a PDF file) are automatically extracted. But then the result is not yet present in the search index. So to actually make use of the extracted fulltext for the search, I have to rebuild the index.
Is this correct? Or should the index eventually be updated?
When I attempt to create a resource with a PDF file I am receiving the following error:
raised unexpected: SearchIndexError("Solr returned an error: (u'Solr responded with an error (HTTP 400): [Reason: ERROR: [doc=6b8e5b3b06fb3097149ebc2caffa7ffa] multiple values encountered for non multiValued field ckanext-extractor_b7e8d049-9b51-4c98-8d09-33b4879a45d7_x-parsed-by: [org.apache.tika.parser.DefaultParser, org.apache.tika.parser.pdf.PDFParser]]',)",) the following error:
Any thoughts?
Hello and thanks for your work putting this extension together! I have CKAN running on a Windows machine but the index is not the full text index of all the assets stored in CKAN. Is redis required for the plugin to work? The celery error I get is:
[2017-04-05 11:09:18,331: INFO/MainProcess] Received task: extractor.extract[d5b7a2fb-31a5-4395-adf6-0e58c076a300]
[2017-04-05 11:09:31,660: ERROR/MainProcess] Task extractor.extract[d5b7a2fb-31a5-4395-adf6-0e58c076a300] raised unexpected: HTTPError('404 Client Error: Not Found',)
Traceback (most recent call last):
File "C:\Users\sbarn_000\Source\Repos\ckan\ckanenv32-2.7\lib\site-packages\celery\app\trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "C:\Users\sbarn_000\Source\Repos\ckan\ckanenv32-2.7\lib\site-packages\celery\app\trace.py", line 438, in protected_call
return self.run(*args, **kwargs)
File "c:\users\sbarn_000\source\repos\ckan\ckanenv32-2.7\src\ckanext-extractor\ckanext\extractor\tasks.py", line 63, in extract
extracted = download_and_extract(res_dict['url'])
File "c:\users\sbarn_000\source\repos\ckan\ckanenv32-2.7\src\ckanext-extractor\ckanext\extractor\lib.py", line 38, in download_and_extract
r.raise_for_status()
File "C:\Users\sbarn_000\Source\Repos\ckan\ckanenv32-2.7\lib\site-packages\requests\models.py", line 851, in raise_for_status
raise HTTPError(http_error_msg, response=self)
HTTPError: 404 Client Error: Not Found
Thank you for your help.
Have you seen any need for showing search match snippets?
We'd like to show search match relevance. Snippets from the usual indexed fields as well as the full text field would probably be very useful for us. Just thought I'd raise it here, although it should probably be done as a separate plugin.
One way of implementing it might just be to configure solr to store the fulltext field and enable highlighting. And then have the package search API include the highlighting results in the response somehow.
I'm keen to hear your thoughts.
https://extensions.ckan.org/extension/extractor/ still contains references to celery, which has now been updated.
When ckan-worker runs to pull the metadata out using ckan-extractor, im getting this error below all the time.
Datasetthumbnail is a seperate plugin that has nothing to do with and is properly installed. If i disable the thumbnail plugin from prod.ini then it works and i dont get this error in ckan-worker.
Why is extracter ckan-worker doing this? Ive seen it do it before with other plugins too but never figured out the solution other than disabling the plugin it interferes with, but its always a random plugin that it says it interferes with. Everything is properly installed.
2018-10-14 13:57:43,667 INFO [ckan.lib.jobs] Worker rq:worker:localhost.13874 has finished job fba8e776-22fe-4de9-99e1-77660fdada72 from queue "default"
2018-10-14 13:57:43,669 INFO [rq.worker]
2018-10-14 13:57:43,669 INFO [rq.worker] *** Listening on ckan:default:default...
2018-10-14 13:57:48,763 INFO [rq.worker] ckan:default:default: ckanext.extractor.tasks.extract('/etc/ckan/default/production.ini', {u'cache_last_updated': None, u'cache_url': None, u'mimetype_inner': None, u'hash': u'', u'description': u'', u'format': u'CSV', u'url': u'http://127.0.0.1/dataset/086935f5-0bd0-4171-83a6-91076e4fdfb1/resource/3f07b9d5-be73-40f1-a6b6-108f2069c332/download/5000-cc-records.csv', u'created': '2018-10-14T20:57:42.620174', u'state': u'active', u'package_id': u'086935f5-0bd0-4171-83a6-91076e4fdfb1', u'last_modified': '2018-10-14T20:57:42.592766', u'mimetype': u'text/csv', u'url_type': u'upload', u'position': 5, u'revision_id': u'2d9f79b7-ab4c-431e-878b-51bcd364d98b', u'size': 460482L, u'datastore_active': True, u'id': u'3f07b9d5-be73-40f1-a6b6-108f2069c332', u'resource_type': None, u'name': u'5000 CC Records.csv'}) (a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8)
2018-10-14 13:57:48,763 INFO [ckan.lib.jobs] Worker rq:worker:localhost.13874 starts job a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8 from queue "default"
2018-10-14 13:57:49,128 ERROR [ckan.lib.jobs] Job a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8 on worker rq:worker:localhost.13874 raised an exception: datasetthumbnail
Traceback (most recent call last):
File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
rv = job.perform()
File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/job.py", line 498, in perform
self._result = self.func(*self.args, **self.kwargs)
File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/tasks.py", line 67, in extract
load_config(ini_path)
File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/config.py", line 71, in load_config
load_environment(conf.global_conf, conf.local_conf)
File "/usr/lib/ckan/default/src/ckan/ckan/config/environment.py", line 99, in load_environment
p.load_all()
File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 139, in load_all
load(*plugins)
File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 153, in load
service = _get_service(plugin)
File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 256, in _get_service
raise PluginNotFoundException(plugin_name)
PluginNotFoundException: datasetthumbnail
2018-10-14 13:57:49,129 ERROR [rq.worker] PluginNotFoundException: datasetthumbnail
Traceback (most recent call last):
File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
rv = job.perform()
File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/job.py", line 498, in perform
self._result = self.func(*self.args, **self.kwargs)
File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/tasks.py", line 67, in extract
load_config(ini_path)
File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/config.py", line 71, in load_config
load_environment(conf.global_conf, conf.local_conf)
File "/usr/lib/ckan/default/src/ckan/ckan/config/environment.py", line 99, in load_environment
p.load_all()
File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 139, in load_all
load(*plugins)
File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 153, in load
service = _get_service(plugin)
File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 256, in _get_service
raise PluginNotFoundException(plugin_name)
PluginNotFoundException: datasetthumbnail
Traceback (most recent call last):
File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
rv = job.perform()
File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/job.py", line 498, in perform
self._result = self.func(*self.args, **self.kwargs)
File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/tasks.py", line 67, in extract
load_config(ini_path)
File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/config.py", line 71, in load_config
load_environment(conf.global_conf, conf.local_conf)
File "/usr/lib/ckan/default/src/ckan/ckan/config/environment.py", line 99, in load_environment
p.load_all()
File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 139, in load_all
load(*plugins)
File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 153, in load
service = _get_service(plugin)
File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 256, in _get_service
raise PluginNotFoundException(plugin_name)
PluginNotFoundException: datasetthumbnail
2018-10-14 13:57:49,129 WARNI [rq.worker] Moving job to u'failed' queue
2018-10-14 13:57:49,135 INFO [ckan.lib.jobs] Worker rq:worker:localhost.13874 has finished job a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8 from queue "default"
2018-10-14 13:57:49,136 INFO [rq.worker]
2018-10-14 13:57:49,137 INFO [rq.worker] *** Listening on ckan:default:default...
How do we handle resources that are uploaded via API to datastore_create? It just says the following error below that it cant find the URL:
2018-09-07 20:15:41,106 INFO [rq.worker] ckan:default:default: ckanext.extractor.tasks.extract('/etc/ckan/default/production.ini', {u'cache_last_updated': None, u'package_id': u'045626ce-96c5-4eb5-a248-d3b1e5a9eb2f', u'datastore_active': True, u'id': u'ce1b08ed-51e2-4ec8-9eb9-1bb0bec237e9', u'size': None, u'restricted': u'{"allowed_users": "", "level": "public"}', u'state': u'active', u'hash': u'', u'description': u'', u'format': u'data dictionary', u'mimetype_inner': None, u'url_type': None, u'mimetype': None, u'cache_url': None, u'name': u'rees', u'created': '2018-09-07T20:14:21.774267', u'url': u'', u'last_modified': None, u'position': 7, u'revision_id': u'11d499ed-19f0-491c-a2bb-482b74c3cdca', u'tag_string_resource': u'', u'resource_type': u''}) (a92631fc-7197-416e-bd0e-1df1b7d5e421)
2018-09-07 20:15:41,109 INFO [ckan.lib.jobs] Worker rq:worker:MECALDDMPCKN01.19289 starts job a92631fc-7197-416e-bd0e-1df1b7d5e421 from queue "default"
2018-09-07 20:15:44,209 DEBUG [ckanext.extractor.model] Resource metadata table already defined
2018-09-07 20:15:44,209 DEBUG [ckanext.extractor.model] Resource metadatum table already defined
2018-09-07 20:15:47,618 DEBUG [ckanext.extractor.model] Resource metadata table already defined
2018-09-07 20:15:47,618 DEBUG [ckanext.extractor.model] Resource metadatum table already defined
2018-09-07 20:15:49,236 WARNI [ckanext.extractor.tasks] Failed to download resource data from "": Invalid URL '': No schema supplied. Perhaps you meant http://?
2018-09-07 20:15:49,306 DEBUG [ckanext.extractor.logic.action] extractor_show 53fc14dd-3ffb-4407-88ea-a66feeef87e0
2018-09-07 20:15:49,314 DEBUG [ckanext.extractor.logic.action] extractor_show 990d609a-1231-4ed8-8d95-a2e53311cf6d
2018-09-07 20:15:49,330 DEBUG [ckanext.extractor.logic.action] extractor_show 9de52d49-2e53-45fb-b5be-c287b10d3cb3
2018-09-07 20:15:49,339 DEBUG [ckanext.extractor.logic.action] extractor_show 849c6583-f795-4de6-89b8-b5130fb1e3e9
2018-09-07 20:15:49,346 DEBUG [ckanext.extractor.logic.action] extractor_show 3e67f293-0cbb-417e-949c-7ceb7116829a
2018-09-07 20:15:49,354 DEBUG [ckanext.extractor.logic.action] extractor_show 45b35abb-4f05-4fed-85a5-8b198e58e578
2018-09-07 20:15:49,361 DEBUG [ckanext.extractor.logic.action] extractor_show 207db850-c5e2-4508-887b-4107d8a7684a
2018-09-07 20:15:49,368 DEBUG [ckanext.extractor.logic.action] extractor_show ce1b08ed-51e2-4ec8-9eb9-1bb0bec237e9
I have the method before_index in an extension which takes a multivalued field and helps sort it in slor for tags so instead of appearing as a list they appear individually.
Basically, business_area looks like ["tag1", "tag2", "tag3" ]
If there are no tags its just an empty list "[]"
def before_index(self, data_dict):
print(data_dict)
#print(json.loads(data_dict.get('business_area', '[]')))
if data_dict.get('business_area'):
data_dict['business_area'] = json.loads(data_dict.get('business_area', '[]'))
return data_dict
The problem is when extractor picks up it keeps getting this stack trace on every push. It keeps saying
TypeError: expected string or buffer
If I remove that method before_index all together it works fine and I don't get this error.
Why does extractor keep calling this in a separate extension? And is there any way to fix it from erroring that? Also why is it giving errors? This method works fine besides extractor complaining about it.
Traceback (most recent call last):
File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
rv = job.perform()
File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/job.py", line 498, in perform
self._result = self.func(*self.args, **self.kwargs)
File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/tasks.py", line 202, in extract
index_for('package').update_dict(pkg_dict)
File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 101, in update_dict
self.index_package(pkg_dict, defer_commit)
File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 278, in index_package
pkg_dict = item.before_index(pkg_dict)
File "/usr/lib/ckan/default/src/ckanext-datasettheme/ckanext/datasettheme/plugin.py", line 99, in before_index
data_dict['business_area'] = json.loads(data_dict.get('business_area', '[]'))
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer
And if not, any guidance would be appreciated, as I'd like to take a crack at it and contribute it back.
Consider integrating with http://api.reegle.info/ to automatically create tags for concepts, places, and other entities.
Some more context - actually prototyped integration with Semantic Mediawiki and it worked surprisingly well even for non-cleantech related content. It was still able to recognize generic concepts, places, and people. And the best part is the auto-tagging API is free.
I am seeing a NotAuthorized
Error when the celery Daemon receives a task.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.