stadt-karlsruhe / ckanext-extractor Goto Github PK

View Code? Open in Web Editor NEW

19.0 7.0 16.0 162 KB

A full text and metadata extractor for CKAN

License: GNU Affero General Public License v3.0

Shell 3.15% Python 96.85%

ckan ckan-extension search fulltext fulltext-search metadata

ckanext-extractor's People

Contributors

Stargazers

Watchers

Forkers

jqnatividad vaquer thehyve liip-forks open-data data-govt-nz jbothma catlibwilk opengov-opendata salsadigitalauorg stadtkarlsruheit dathere

ckanext-extractor's Issues

Migration to ckan 2.7 background tasks

Do you have any plans to support the new background jobs system soon?

Vocabulary Tags removed from dataset when worker extracts text

I have a few custom ckan tag vocabularies for my datasets. It looks like when the worker extracts the text, the vocabulary tags are removed from the dataset.

I haven't looked into the worker code yet and I'm still on [email protected]

Basically the only thing I have using celery (yes, still on celery despite you upgrading it to work with redis on my request, sorry) is this.

When I create a dataset, I assign a couple of vocabulary tags to it.

When I add a PDF resource to it and programmatically request the package immediately afterwords, they're still set correctly.

A few seconds later they're not set any more.

If I stop the celery worker, the tags will stay in place until I start the worker again.

Any idea why this might be? I'll dive into the worker code ASAP but it's taken me a day or so to track this down to this plugin so it might not be tomorrow.

As always, I'm such a huge fan of this and appreciate it very much. Just posting here so long in case you know very quickly what it is. I'll update when I know more.

I think this has been hidden in the past because I used a script that would update (and fix) the package each time I add a resource, and I generally add XLS resources after adding PDF resources to the same datasets, and I have extractor configured to only extract PDF resources.

Memory leak - celery threads don't go away

Celery threads don't seem to go away after extract all which makes the machine run out of memory pretty quickly.

Have you seen this before?

TimeOut when extracting a large dataset

I have a big dataset with 8,00,000 records. When i do the extraction it came with the following error:

[2017-10-10 14:43:08,491: ERROR/MainProcess] Task extractor.extract[401c7ccc-7a3c-455e-a5f4-f23b804ae43d] raised unexpected: SearchIndexError('Solr returned an error: (u"Connection to server 'http://solr_server/solr/ckan/update/?commit=true' timed out: HTTPConnectionPool(host='#########', port=8983): Read timed out. (read timeout=60)",)',)
Traceback (most recent call last):
File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 438, in protected_call
return self.run(*args, **kwargs)
File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/tasks.py", line 94, in extract
index_for('package').update_dict(pkg_dict)
File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 101, in update_dict
self.index_package(pkg_dict, defer_commit)
File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 295, in index_package
raise SearchIndexError(msg)
SearchIndexError: Solr returned an error: (u"Connection to server 'http://XXXXXXXXXXXXXXXXX/solr/ckan/update/?commit=true' timed out: HTTPConnectionPool(host='xxxxxxxxxx', port=8983): Read timed out. (read timeout=60)",)

Did anyone else had same issue or can anyone please let me know how to fix it.
Thanks in Advance.!!

Installation on Ubuntu 14.04

I have a problem installing this extension on Ubuntu 14.04.

You wrote that Solr packages are broken for Ubuntu and that people should download the correct jar files. I downloaded them then added the appropriate lines to solrconfig.xml. However, I still have issues with Solr extracting metadata.

Here's the error I get

Traceback (most recent call last):
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/tasks.py", line 62, in extract
    extracted = download_and_extract(res_dict['url'])
  File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/lib.py", line 43, in download_and_extract
    data = pysolr.Solr(config['solr_url']).extract(f, extractFormat='text')
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/pysolr.py", line 979, in extract
    files={'file': (file_obj.name, file_obj)})
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/pysolr.py", line 394, in _send_request
    raise SolrError(error_message % (resp.status_code, solr_message))
SolrError: Solr responded with an error (HTTP 500): [Reason: None]

Any advice you can give would be appreciated.

After the extraction the index is not updated

I'm not sure if I understand the handling correctly. As far as I can tell, thanks to celeryd all new uploads (e.g. of a PDF file) are automatically extracted. But then the result is not yet present in the search index. So to actually make use of the extracted fulltext for the search, I have to rebuild the index.

Is this correct? Or should the index eventually be updated?

Error for PDF Resources

When I attempt to create a resource with a PDF file I am receiving the following error:
raised unexpected: SearchIndexError("Solr returned an error: (u'Solr responded with an error (HTTP 400): [Reason: ERROR: [doc=6b8e5b3b06fb3097149ebc2caffa7ffa] multiple values encountered for non multiValued field ckanext-extractor_b7e8d049-9b51-4c98-8d09-33b4879a45d7_x-parsed-by: [org.apache.tika.parser.DefaultParser, org.apache.tika.parser.pdf.PDFParser]]',)",) the following error:

Any thoughts?

Handling of HTTP errors

Hello and thanks for your work putting this extension together! I have CKAN running on a Windows machine but the index is not the full text index of all the assets stored in CKAN. Is redis required for the plugin to work? The celery error I get is:

[2017-04-05 11:09:18,331: INFO/MainProcess] Received task: extractor.extract[d5b7a2fb-31a5-4395-adf6-0e58c076a300]
[2017-04-05 11:09:31,660: ERROR/MainProcess] Task extractor.extract[d5b7a2fb-31a5-4395-adf6-0e58c076a300] raised unexpected: HTTPError('404 Client Error: Not Found',)
Traceback (most recent call last):
File "C:\Users\sbarn_000\Source\Repos\ckan\ckanenv32-2.7\lib\site-packages\celery\app\trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "C:\Users\sbarn_000\Source\Repos\ckan\ckanenv32-2.7\lib\site-packages\celery\app\trace.py", line 438, in protected_call
return self.run(*args, **kwargs)
File "c:\users\sbarn_000\source\repos\ckan\ckanenv32-2.7\src\ckanext-extractor\ckanext\extractor\tasks.py", line 63, in extract
extracted = download_and_extract(res_dict['url'])
File "c:\users\sbarn_000\source\repos\ckan\ckanenv32-2.7\src\ckanext-extractor\ckanext\extractor\lib.py", line 38, in download_and_extract
r.raise_for_status()
File "C:\Users\sbarn_000\Source\Repos\ckan\ckanenv32-2.7\lib\site-packages\requests\models.py", line 851, in raise_for_status
raise HTTPError(http_error_msg, response=self)
HTTPError: 404 Client Error: Not Found

Thank you for your help.

Highlighting (snippets)

Have you seen any need for showing search match snippets?

We'd like to show search match relevance. Snippets from the usual indexed fields as well as the full text field would probably be very useful for us. Just thought I'd raise it here, although it should probably be done as a separate plugin.

One way of implementing it might just be to configure solr to store the fulltext field and enable highlighting. And then have the package search API include the highlighting results in the response somehow.

I'm keen to hear your thoughts.

the documentation page on extensions.ckan.org needs an update

https://extensions.ckan.org/extension/extractor/ still contains references to celery, which has now been updated.

ckan_worker interfering with other plugins

When ckan-worker runs to pull the metadata out using ckan-extractor, im getting this error below all the time.

Datasetthumbnail is a seperate plugin that has nothing to do with and is properly installed. If i disable the thumbnail plugin from prod.ini then it works and i dont get this error in ckan-worker.

Why is extracter ckan-worker doing this? Ive seen it do it before with other plugins too but never figured out the solution other than disabling the plugin it interferes with, but its always a random plugin that it says it interferes with. Everything is properly installed.

2018-10-14 13:57:43,667 INFO  [ckan.lib.jobs] Worker rq:worker:localhost.13874 has finished job fba8e776-22fe-4de9-99e1-77660fdada72 from queue "default"
2018-10-14 13:57:43,669 INFO  [rq.worker] 
2018-10-14 13:57:43,669 INFO  [rq.worker] *** Listening on ckan:default:default...
2018-10-14 13:57:48,763 INFO  [rq.worker] ckan:default:default: ckanext.extractor.tasks.extract('/etc/ckan/default/production.ini', {u'cache_last_updated': None, u'cache_url': None, u'mimetype_inner': None, u'hash': u'', u'description': u'', u'format': u'CSV', u'url': u'http://127.0.0.1/dataset/086935f5-0bd0-4171-83a6-91076e4fdfb1/resource/3f07b9d5-be73-40f1-a6b6-108f2069c332/download/5000-cc-records.csv', u'created': '2018-10-14T20:57:42.620174', u'state': u'active', u'package_id': u'086935f5-0bd0-4171-83a6-91076e4fdfb1', u'last_modified': '2018-10-14T20:57:42.592766', u'mimetype': u'text/csv', u'url_type': u'upload', u'position': 5, u'revision_id': u'2d9f79b7-ab4c-431e-878b-51bcd364d98b', u'size': 460482L, u'datastore_active': True, u'id': u'3f07b9d5-be73-40f1-a6b6-108f2069c332', u'resource_type': None, u'name': u'5000 CC Records.csv'}) (a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8)
2018-10-14 13:57:48,763 INFO  [ckan.lib.jobs] Worker rq:worker:localhost.13874 starts job a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8 from queue "default"
2018-10-14 13:57:49,128 ERROR [ckan.lib.jobs] Job a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8 on worker rq:worker:localhost.13874 raised an exception: datasetthumbnail
Traceback (most recent call last):
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
    rv = job.perform()
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/job.py", line 498, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/tasks.py", line 67, in extract
    load_config(ini_path)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/config.py", line 71, in load_config
    load_environment(conf.global_conf, conf.local_conf)
  File "/usr/lib/ckan/default/src/ckan/ckan/config/environment.py", line 99, in load_environment
    p.load_all()
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 139, in load_all
    load(*plugins)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 153, in load
    service = _get_service(plugin)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 256, in _get_service
    raise PluginNotFoundException(plugin_name)
PluginNotFoundException: datasetthumbnail
2018-10-14 13:57:49,129 ERROR [rq.worker] PluginNotFoundException: datasetthumbnail
Traceback (most recent call last):
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
    rv = job.perform()
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/job.py", line 498, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/tasks.py", line 67, in extract
    load_config(ini_path)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/config.py", line 71, in load_config
    load_environment(conf.global_conf, conf.local_conf)
  File "/usr/lib/ckan/default/src/ckan/ckan/config/environment.py", line 99, in load_environment
    p.load_all()
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 139, in load_all
    load(*plugins)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 153, in load
    service = _get_service(plugin)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 256, in _get_service
    raise PluginNotFoundException(plugin_name)
PluginNotFoundException: datasetthumbnail
Traceback (most recent call last):
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
    rv = job.perform()
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/rq/job.py", line 498, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/tasks.py", line 67, in extract
    load_config(ini_path)
  File "/usr/lib/ckan/default/src/ckanext-extracter/ckanext/extractor/config.py", line 71, in load_config
    load_environment(conf.global_conf, conf.local_conf)
  File "/usr/lib/ckan/default/src/ckan/ckan/config/environment.py", line 99, in load_environment
    p.load_all()
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 139, in load_all
    load(*plugins)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 153, in load
    service = _get_service(plugin)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 256, in _get_service
    raise PluginNotFoundException(plugin_name)
PluginNotFoundException: datasetthumbnail
2018-10-14 13:57:49,129 WARNI [rq.worker] Moving job to u'failed' queue
2018-10-14 13:57:49,135 INFO  [ckan.lib.jobs] Worker rq:worker:localhost.13874 has finished job a0d3da6b-a7f6-4199-a6b4-76bdb81b12d8 from queue "default"
2018-10-14 13:57:49,136 INFO  [rq.worker] 
2018-10-14 13:57:49,137 INFO  [rq.worker] *** Listening on ckan:default:default...

How do we handle resources uploaded via datastore_create?

How do we handle resources that are uploaded via API to datastore_create? It just says the following error below that it cant find the URL:


2018-09-07 20:15:41,106 INFO  [rq.worker] ckan:default:default: ckanext.extractor.tasks.extract('/etc/ckan/default/production.ini', {u'cache_last_updated': None, u'package_id': u'045626ce-96c5-4eb5-a248-d3b1e5a9eb2f', u'datastore_active': True, u'id': u'ce1b08ed-51e2-4ec8-9eb9-1bb0bec237e9', u'size': None, u'restricted': u'{"allowed_users": "", "level": "public"}', u'state': u'active', u'hash': u'', u'description': u'', u'format': u'data dictionary', u'mimetype_inner': None, u'url_type': None, u'mimetype': None, u'cache_url': None, u'name': u'rees', u'created': '2018-09-07T20:14:21.774267', u'url': u'', u'last_modified': None, u'position': 7, u'revision_id': u'11d499ed-19f0-491c-a2bb-482b74c3cdca', u'tag_string_resource': u'', u'resource_type': u''}) (a92631fc-7197-416e-bd0e-1df1b7d5e421)
2018-09-07 20:15:41,109 INFO  [ckan.lib.jobs] Worker rq:worker:MECALDDMPCKN01.19289 starts job a92631fc-7197-416e-bd0e-1df1b7d5e421 from queue "default"
2018-09-07 20:15:44,209 DEBUG [ckanext.extractor.model] Resource metadata table already defined
2018-09-07 20:15:44,209 DEBUG [ckanext.extractor.model] Resource metadatum table already defined
2018-09-07 20:15:47,618 DEBUG [ckanext.extractor.model] Resource metadata table already defined
2018-09-07 20:15:47,618 DEBUG [ckanext.extractor.model] Resource metadatum table already defined
2018-09-07 20:15:49,236 WARNI [ckanext.extractor.tasks] Failed to download resource data from "": Invalid URL '': No schema supplied. Perhaps you meant http://?
2018-09-07 20:15:49,306 DEBUG [ckanext.extractor.logic.action] extractor_show 53fc14dd-3ffb-4407-88ea-a66feeef87e0
2018-09-07 20:15:49,314 DEBUG [ckanext.extractor.logic.action] extractor_show 990d609a-1231-4ed8-8d95-a2e53311cf6d
2018-09-07 20:15:49,330 DEBUG [ckanext.extractor.logic.action] extractor_show 9de52d49-2e53-45fb-b5be-c287b10d3cb3
2018-09-07 20:15:49,339 DEBUG [ckanext.extractor.logic.action] extractor_show 849c6583-f795-4de6-89b8-b5130fb1e3e9
2018-09-07 20:15:49,346 DEBUG [ckanext.extractor.logic.action] extractor_show 3e67f293-0cbb-417e-949c-7ceb7116829a
2018-09-07 20:15:49,354 DEBUG [ckanext.extractor.logic.action] extractor_show 45b35abb-4f05-4fed-85a5-8b198e58e578
2018-09-07 20:15:49,361 DEBUG [ckanext.extractor.logic.action] extractor_show 207db850-c5e2-4508-887b-4107d8a7684a
2018-09-07 20:15:49,368 DEBUG [ckanext.extractor.logic.action] extractor_show ce1b08ed-51e2-4ec8-9eb9-1bb0bec237e9

extracter conflicts with a before_index method

I have the method before_index in an extension which takes a multivalued field and helps sort it in slor for tags so instead of appearing as a list they appear individually.

Basically, business_area looks like ["tag1", "tag2", "tag3" ]

If there are no tags its just an empty list "[]"

    def before_index(self, data_dict):
	print(data_dict)
	#print(json.loads(data_dict.get('business_area', '[]')))
	if data_dict.get('business_area'):
            data_dict['business_area'] = json.loads(data_dict.get('business_area', '[]'))
	return data_dict

The problem is when extractor picks up it keeps getting this stack trace on every push. It keeps saying
TypeError: expected string or buffer

If I remove that method before_index all together it works fine and I don't get this error.

Why does extractor keep calling this in a separate extension? And is there any way to fix it from erroring that? Also why is it giving errors? This method works fine besides extractor complaining about it.

Traceback (most recent call last):
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/worker.py", line 588, in perform_job
    rv = job.perform()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rq/job.py", line 498, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/ckan/default/src/ckanext-extractor/ckanext/extractor/tasks.py", line 202, in extract
    index_for('package').update_dict(pkg_dict)
  File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 101, in update_dict
    self.index_package(pkg_dict, defer_commit)
  File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/index.py", line 278, in index_package
    pkg_dict = item.before_index(pkg_dict)
  File "/usr/lib/ckan/default/src/ckanext-datasettheme/ckanext/datasettheme/plugin.py", line 99, in before_index
    data_dict['business_area'] = json.loads(data_dict.get('business_area', '[]'))
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

Any plans to migrate extension to 2.9?

And if not, any guidance would be appreciated, as I'd like to take a crack at it and contribute it back.

Auto tagging

Consider integrating with http://api.reegle.info/ to automatically create tags for concepts, places, and other entities.

Some more context - actually prototyped integration with Semantic Mediawiki and it worked surprisingly well even for non-cleantech related content. It was still able to recognize generic concepts, places, and people. And the best part is the auto-tagging API is free.

NotAuthorized Error

I am seeing a NotAuthorized Error when the celery Daemon receives a task.