Comments (8)
Thanks for your report, @jbothma!
Your description does indeed suggest a connection to ckanext-extractor. However, I currently don't have an idea how ckanext-extractor could influence your tags: the metadata extracted by ckanext-extractor is stored in separate database tables and ckanext-extractor isn't supposed to modify the dataset/resource data itself.
Obviously that doesn't mean that ckanext-extractor isn't the problem, but simply that this needs further investigation 😉 I'll look into it, but am currently busy with other things. If you can spare some time to investigate on your own then that would be a big help.
from ckanext-extractor.
from ckanext-extractor.
Looks like the tag vocabulary fields (financial_year
and sphere
) are still in the index document except for the validated_data_dict
fields, suggesting it has something to do with the package data cached in solr
Someone's discussed disabling that for quicker iteration on their schema ckan/ckan#3226
Perhaps there's something wrong with my schema https://github.com/OpenUpSA/ckanext-satreasury/blob/master/ckanext/satreasury/plugin.py#L111
{
"data_dict":"{\"license_title\": \"License not specified\", \"maintainer\": \"\", \"relationships_as_object\": [], \"notes_short\": \"\", \"private\": false, \"maintainer_email\": \"\", \"num_tags\": 2, \"sphere\": [\"national\"], \"financial_year\": [\"2019-20\"], \"id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"metadata_created\": \"2018-11-13T11:39:31.160030\", \"functions\": [], \"dimensions\": [], \"metadata_modified\": \"2018-11-13T15:31:07.163809\", \"author\": \"\", \"author_email\": \"\", \"state\": \"active\", \"methodology\": \"\", \"version\": null, \"usage\": \"\", \"license_id\": \"notspecified\", \"type\": \"dataset\", \"use_for\": \"\", \"province\": [], \"num_resources\": 4, \"groups\": [], \"creator_user_id\": \"f5406233-1dc3-42e3-804e-2579a57b3cdd\", \"relationships_as_subject\": [], \"key_points\": \"\", \"organization\": {\"description\": \"\", \"created\": \"2018-05-21T17:41:56.378776\", \"title\": \"My organization\", \"name\": \"my-organization\", \"is_organization\": true, \"state\": \"active\", \"image_url\": \"\", \"revision_id\": \"21d6f9e8-518a-4e73-8718-7e2383fcba01\", \"type\": \"organization\", \"id\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"approval_status\": \"approved\"}, \"name\": \"whatever\", \"isopen\": false, \"url\": \"\", \"notes\": \"\", \"owner_org\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"resources\": [{\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"Annexure_A_-_Individual_Investor.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T13:47:17.832194\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T13:47:17.740234\", \"position\": 0, \"revision_id\": \"ee85ac6e-1ce3-44b1-a3b7-37eac4b82038\", \"url_type\": \"upload\", \"id\": \"246f57e2-2f62-4706-98e7-46cace33f0c8\", \"resource_type\": null, \"size\": 52558}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T14:35:45.738843\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T14:35:45.663218\", \"position\": 1, \"revision_id\": \"dda39d69-19a6-49bd-a211-b17ee4817f57\", \"url_type\": \"upload\", \"id\": \"a53b88f8-0052-4622-8aa4-417366bbddc3\", \"resource_type\": null, \"size\": 241245}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T15:11:28.943298\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T15:11:28.867650\", \"position\": 2, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"url_type\": \"upload\", \"id\": \"5a752ff7-b8eb-4d17-96d4-835473287d60\", \"resource_type\": null, \"size\": 241245}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T15:31:07.194859\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T15:31:07.133643\", \"position\": 3, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"url_type\": \"upload\", \"id\": \"4950f8ca-6def-47c0-a538-28667afdcc7c\", \"resource_type\": null, \"size\": 241245}], \"title\": \"whatever\", \"revision_id\": \"15f83f1b-c6dd-47eb-ada3-4435a428406e\"}",
"site_id":"default",
"financial_year":["2019-20"],
"id":"e50c37e5-cec5-40d2-b55b-e6bd512c8d71",
"metadata_created":"2018-11-13T11:39:31.160Z",
"capacity":"public",
"metadata_modified":"2018-11-13T15:31:07.163Z",
"res_format":["PDF",
"PDF",
"PDF",
"PDF"],
"state":"active",
"license_id":"notspecified",
"res_url":["http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf",
"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf",
"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf",
"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf"],
"entity_type":"package",
"title":"whatever",
"dataset_type":"dataset",
"validated_data_dict":"{\"owner_org\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"maintainer\": \"\", \"relationships_as_object\": [], \"notes_short\": \"\", \"private\": false, \"maintainer_email\": \"\", \"num_tags\": 2, \"sphere\": [], \"financial_year\": [], \"id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"metadata_created\": \"2018-11-13T11:39:31.160030\", \"functions\": [], \"dimensions\": [], \"metadata_modified\": \"2018-11-13T15:31:07.163809\", \"author\": \"\", \"author_email\": \"\", \"state\": \"active\", \"methodology\": \"\", \"version\": null, \"usage\": \"\", \"license_id\": \"notspecified\", \"type\": \"dataset\", \"use_for\": \"\", \"province\": [], \"num_resources\": 4, \"title\": \"whatever\", \"groups\": [], \"creator_user_id\": \"f5406233-1dc3-42e3-804e-2579a57b3cdd\", \"relationships_as_subject\": [], \"key_points\": \"\", \"name\": \"whatever\", \"isopen\": false, \"url\": \"\", \"notes\": \"\", \"license_title\": \"License not specified\", \"resources\": [{\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf\", \"created\": \"2018-11-13T13:47:17.832194\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T13:47:17.740234\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 0, \"revision_id\": \"ee85ac6e-1ce3-44b1-a3b7-37eac4b82038\", \"size\": 52558, \"datastore_active\": false, \"id\": \"246f57e2-2f62-4706-98e7-46cace33f0c8\", \"resource_type\": null, \"name\": \"Annexure_A_-_Individual_Investor.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T14:35:45.738843\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T14:35:45.663218\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 1, \"revision_id\": \"dda39d69-19a6-49bd-a211-b17ee4817f57\", \"size\": 241245, \"datastore_active\": false, \"id\": \"a53b88f8-0052-4622-8aa4-417366bbddc3\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T15:11:28.943298\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T15:11:28.867650\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 2, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"size\": 241245, \"datastore_active\": false, \"id\": \"5a752ff7-b8eb-4d17-96d4-835473287d60\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T15:31:07.194859\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T15:31:07.133643\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 3, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"size\": 241245, \"datastore_active\": false, \"id\": \"4950f8ca-6def-47c0-a538-28667afdcc7c\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}], \"organization\": {\"description\": \"\", \"created\": \"2018-05-21T17:41:56.378776\", \"title\": \"My organization\", \"name\": \"my-organization\", \"is_organization\": true, \"state\": \"active\", \"image_url\": \"\", \"revision_id\": \"21d6f9e8-518a-4e73-8718-7e2383fcba01\", \"type\": \"organization\", \"id\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"approval_status\": \"approved\"}, \"revision_id\": \"15f83f1b-c6dd-47eb-ada3-4435a428406e\"}",
"res_name":["Annexure_A_-_Individual_Investor.pdf",
"vote-10-public-service-and-administration.pdf",
"vote-10-public-service-and-administration.pdf",
"vote-10-public-service-and-administration.pdf"],
"name":"whatever",
from ckanext-extractor.
The pkg_dict
ckanext-extractor's worker gets from package_show
on https://github.com/stadt-karlsruhe/ckanext-extractor/blob/master/ckanext/extractor/tasks.py#L62 already has vocabulary tags converted.
So when ckanext-extractor's worker calls index_for('package').update_dict(pkg_dict)
on https://github.com/stadt-karlsruhe/ckanext-extractor/blob/master/ckanext/extractor/tasks.py#L110 there aren't any ('tag', ..., ...)
keys in the data
argument to the converters.convert_from_tags
callable https://github.com/ckan/ckan/blob/master/ckan/logic/converters.py#L93
Since converters.convert_from_tags
overwrites data[key]
, the worker's index call ends up triggering a second convert
on the tag vocabulary fields and setting them to empty lists.
I think the following are reasonable options, but I'm not sure what the best one is and would like input. I'll cross-post to the ckan-dev list:
- the worker should be operating on a pre-converted pkg_dict so that converting it has the expected result
- in this case, how? Is there a context flag to
package_show
that can give an unconverted dict?
- in this case, how? Is there a context flag to
index_for('package').update_dict(pkg_dict)
should handle an already-converted dict safely- how? Its only optional argument is defer_commit
- convertors should be idempotent, in which case this is a ckan bug
- unlikely - it sounds weird and there isn't really enough metadata to support this safely, I don't think
from ckanext-extractor.
Thanks for the detailed investigation, @jbothma!
Perhaps we can avoid this issue completely by using ckan.lib.search.rebuild
instead of ckan.lib.search.index_for('package').update_dict
. Could you please try the following:
In the file ckanext/extractor/tasks.py
, replace the line index_for('package').update_dict(pkg_dict)
with the following lines:
from ckan.lib import search
search.rebuild(package_id=res_dict['package_id'])
That would leave all the details of handling the package dict to CKAN core.
from ckanext-extractor.
That seems to work perfectly, thanks!
I've made a pull request.
from ckanext-extractor.
The mentioned change has been committed in cbf1cae.
@jbothma: Do you need this backported to 3.1?
from ckanext-extractor.
No need, thanks - I took the opportunity to upgrade (and drop celery) while debugging.
Thanks for fitting this into your schedule - much appreciated.
from ckanext-extractor.
Related Issues (16)
- Auto tagging HOT 2
- Memory leak - celery threads don't go away HOT 1
- Error for PDF Resources HOT 2
- Highlighting (snippets) HOT 2
- How do we handle resources uploaded via datastore_create? HOT 3
- ckan_worker interfering with other plugins HOT 3
- extracter conflicts with a before_index method HOT 5
- the documentation page on extensions.ckan.org needs an update
- Any plans to migrate extension to 2.9?
- Installation on Ubuntu 14.04 HOT 1
- After the extraction the index is not updated HOT 4
- Handling of HTTP errors HOT 3
- TimeOut when extracting a large dataset HOT 5
- NotAuthorized Error HOT 6
- Migration to ckan 2.7 background tasks HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ckanext-extractor.