Giter VIP home page Giter VIP logo

Comments (8)

torfsen avatar torfsen commented on May 19, 2024

Thanks for your report, @jbothma!

Your description does indeed suggest a connection to ckanext-extractor. However, I currently don't have an idea how ckanext-extractor could influence your tags: the metadata extracted by ckanext-extractor is stored in separate database tables and ckanext-extractor isn't supposed to modify the dataset/resource data itself.

Obviously that doesn't mean that ckanext-extractor isn't the problem, but simply that this needs further investigation 😉 I'll look into it, but am currently busy with other things. If you can spare some time to investigate on your own then that would be a big help.

from ckanext-extractor.

jbothma avatar jbothma commented on May 19, 2024

from ckanext-extractor.

jbothma avatar jbothma commented on May 19, 2024

Looks like the tag vocabulary fields (financial_year and sphere) are still in the index document except for the validated_data_dict fields, suggesting it has something to do with the package data cached in solr

Someone's discussed disabling that for quicker iteration on their schema ckan/ckan#3226

Perhaps there's something wrong with my schema https://github.com/OpenUpSA/ckanext-satreasury/blob/master/ckanext/satreasury/plugin.py#L111

      {
        "data_dict":"{\"license_title\": \"License not specified\", \"maintainer\": \"\", \"relationships_as_object\": [], \"notes_short\": \"\", \"private\": false, \"maintainer_email\": \"\", \"num_tags\": 2, \"sphere\": [\"national\"], \"financial_year\": [\"2019-20\"], \"id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"metadata_created\": \"2018-11-13T11:39:31.160030\", \"functions\": [], \"dimensions\": [], \"metadata_modified\": \"2018-11-13T15:31:07.163809\", \"author\": \"\", \"author_email\": \"\", \"state\": \"active\", \"methodology\": \"\", \"version\": null, \"usage\": \"\", \"license_id\": \"notspecified\", \"type\": \"dataset\", \"use_for\": \"\", \"province\": [], \"num_resources\": 4, \"groups\": [], \"creator_user_id\": \"f5406233-1dc3-42e3-804e-2579a57b3cdd\", \"relationships_as_subject\": [], \"key_points\": \"\", \"organization\": {\"description\": \"\", \"created\": \"2018-05-21T17:41:56.378776\", \"title\": \"My organization\", \"name\": \"my-organization\", \"is_organization\": true, \"state\": \"active\", \"image_url\": \"\", \"revision_id\": \"21d6f9e8-518a-4e73-8718-7e2383fcba01\", \"type\": \"organization\", \"id\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"approval_status\": \"approved\"}, \"name\": \"whatever\", \"isopen\": false, \"url\": \"\", \"notes\": \"\", \"owner_org\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"resources\": [{\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"Annexure_A_-_Individual_Investor.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T13:47:17.832194\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T13:47:17.740234\", \"position\": 0, \"revision_id\": \"ee85ac6e-1ce3-44b1-a3b7-37eac4b82038\", \"url_type\": \"upload\", \"id\": \"246f57e2-2f62-4706-98e7-46cace33f0c8\", \"resource_type\": null, \"size\": 52558}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T14:35:45.738843\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T14:35:45.663218\", \"position\": 1, \"revision_id\": \"dda39d69-19a6-49bd-a211-b17ee4817f57\", \"url_type\": \"upload\", \"id\": \"a53b88f8-0052-4622-8aa4-417366bbddc3\", \"resource_type\": null, \"size\": 241245}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T15:11:28.943298\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T15:11:28.867650\", \"position\": 2, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"url_type\": \"upload\", \"id\": \"5a752ff7-b8eb-4d17-96d4-835473287d60\", \"resource_type\": null, \"size\": 241245}, {\"mimetype\": \"application/pdf\", \"cache_url\": null, \"hash\": \"\", \"description\": \"\", \"name\": \"vote-10-public-service-and-administration.pdf\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf\", \"datastore_active\": false, \"cache_last_updated\": null, \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"created\": \"2018-11-13T15:31:07.194859\", \"state\": \"active\", \"mimetype_inner\": null, \"last_modified\": \"2018-11-13T15:31:07.133643\", \"position\": 3, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"url_type\": \"upload\", \"id\": \"4950f8ca-6def-47c0-a538-28667afdcc7c\", \"resource_type\": null, \"size\": 241245}], \"title\": \"whatever\", \"revision_id\": \"15f83f1b-c6dd-47eb-ada3-4435a428406e\"}",
        "site_id":"default",
        "financial_year":["2019-20"],
        "id":"e50c37e5-cec5-40d2-b55b-e6bd512c8d71",
        "metadata_created":"2018-11-13T11:39:31.160Z",
        "capacity":"public",
        "metadata_modified":"2018-11-13T15:31:07.163Z",
        "res_format":["PDF",
          "PDF",
          "PDF",
          "PDF"],
        "state":"active",
"license_id":"notspecified",
        "res_url":["http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf",
          "http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf",
          "http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf",
          "http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf"],
        "entity_type":"package",
        "title":"whatever",
        "dataset_type":"dataset",
        "validated_data_dict":"{\"owner_org\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"maintainer\": \"\", \"relationships_as_object\": [], \"notes_short\": \"\", \"private\": false, \"maintainer_email\": \"\", \"num_tags\": 2, \"sphere\": [], \"financial_year\": [], \"id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"metadata_created\": \"2018-11-13T11:39:31.160030\", \"functions\": [], \"dimensions\": [], \"metadata_modified\": \"2018-11-13T15:31:07.163809\", \"author\": \"\", \"author_email\": \"\", \"state\": \"active\", \"methodology\": \"\", \"version\": null, \"usage\": \"\", \"license_id\": \"notspecified\", \"type\": \"dataset\", \"use_for\": \"\", \"province\": [], \"num_resources\": 4, \"title\": \"whatever\", \"groups\": [], \"creator_user_id\": \"f5406233-1dc3-42e3-804e-2579a57b3cdd\", \"relationships_as_subject\": [], \"key_points\": \"\", \"name\": \"whatever\", \"isopen\": false, \"url\": \"\", \"notes\": \"\", \"license_title\": \"License not specified\", \"resources\": [{\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/246f57e2-2f62-4706-98e7-46cace33f0c8/download/annexure_a_-_individual_investor.pdf\", \"created\": \"2018-11-13T13:47:17.832194\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T13:47:17.740234\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 0, \"revision_id\": \"ee85ac6e-1ce3-44b1-a3b7-37eac4b82038\", \"size\": 52558, \"datastore_active\": false, \"id\": \"246f57e2-2f62-4706-98e7-46cace33f0c8\", \"resource_type\": null, \"name\": \"Annexure_A_-_Individual_Investor.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/a53b88f8-0052-4622-8aa4-417366bbddc3/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T14:35:45.738843\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T14:35:45.663218\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 1, \"revision_id\": \"dda39d69-19a6-49bd-a211-b17ee4817f57\", \"size\": 241245, \"datastore_active\": false, \"id\": \"a53b88f8-0052-4622-8aa4-417366bbddc3\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/5a752ff7-b8eb-4d17-96d4-835473287d60/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T15:11:28.943298\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T15:11:28.867650\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 2, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"size\": 241245, \"datastore_active\": false, \"id\": \"5a752ff7-b8eb-4d17-96d4-835473287d60\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}, {\"cache_last_updated\": null, \"cache_url\": null, \"mimetype_inner\": null, \"hash\": \"\", \"description\": \"\", \"format\": \"PDF\", \"url\": \"http://ckan:5000/dataset/e50c37e5-cec5-40d2-b55b-e6bd512c8d71/resource/4950f8ca-6def-47c0-a538-28667afdcc7c/download/vote-10-public-service-and-administration.pdf\", \"created\": \"2018-11-13T15:31:07.194859\", \"state\": \"active\", \"package_id\": \"e50c37e5-cec5-40d2-b55b-e6bd512c8d71\", \"last_modified\": \"2018-11-13T15:31:07.133643\", \"mimetype\": \"application/pdf\", \"url_type\": \"upload\", \"position\": 3, \"revision_id\": \"73afea75-e020-43b2-b319-76e555c8b01a\", \"size\": 241245, \"datastore_active\": false, \"id\": \"4950f8ca-6def-47c0-a538-28667afdcc7c\", \"resource_type\": null, \"name\": \"vote-10-public-service-and-administration.pdf\"}], \"organization\": {\"description\": \"\", \"created\": \"2018-05-21T17:41:56.378776\", \"title\": \"My organization\", \"name\": \"my-organization\", \"is_organization\": true, \"state\": \"active\", \"image_url\": \"\", \"revision_id\": \"21d6f9e8-518a-4e73-8718-7e2383fcba01\", \"type\": \"organization\", \"id\": \"97815896-5fde-42c9-a2a5-dda28270a831\", \"approval_status\": \"approved\"}, \"revision_id\": \"15f83f1b-c6dd-47eb-ada3-4435a428406e\"}",
        "res_name":["Annexure_A_-_Individual_Investor.pdf",
          "vote-10-public-service-and-administration.pdf",
          "vote-10-public-service-and-administration.pdf",
          "vote-10-public-service-and-administration.pdf"],
        "name":"whatever",

from ckanext-extractor.

jbothma avatar jbothma commented on May 19, 2024

The pkg_dict ckanext-extractor's worker gets from package_show on https://github.com/stadt-karlsruhe/ckanext-extractor/blob/master/ckanext/extractor/tasks.py#L62 already has vocabulary tags converted.

So when ckanext-extractor's worker calls index_for('package').update_dict(pkg_dict) on https://github.com/stadt-karlsruhe/ckanext-extractor/blob/master/ckanext/extractor/tasks.py#L110 there aren't any ('tag', ..., ...) keys in the data argument to the converters.convert_from_tags callable https://github.com/ckan/ckan/blob/master/ckan/logic/converters.py#L93

Since converters.convert_from_tags overwrites data[key], the worker's index call ends up triggering a second convert on the tag vocabulary fields and setting them to empty lists.

I think the following are reasonable options, but I'm not sure what the best one is and would like input. I'll cross-post to the ckan-dev list:

  • the worker should be operating on a pre-converted pkg_dict so that converting it has the expected result
    • in this case, how? Is there a context flag to package_show that can give an unconverted dict?
  • index_for('package').update_dict(pkg_dict) should handle an already-converted dict safely
    • how? Its only optional argument is defer_commit
  • convertors should be idempotent, in which case this is a ckan bug
    • unlikely - it sounds weird and there isn't really enough metadata to support this safely, I don't think

from ckanext-extractor.

torfsen avatar torfsen commented on May 19, 2024

Thanks for the detailed investigation, @jbothma!

Perhaps we can avoid this issue completely by using ckan.lib.search.rebuild instead of ckan.lib.search.index_for('package').update_dict. Could you please try the following:

In the file ckanext/extractor/tasks.py, replace the line index_for('package').update_dict(pkg_dict) with the following lines:

from ckan.lib import search
search.rebuild(package_id=res_dict['package_id'])

That would leave all the details of handling the package dict to CKAN core.

from ckanext-extractor.

jbothma avatar jbothma commented on May 19, 2024

That seems to work perfectly, thanks!

I've made a pull request.

from ckanext-extractor.

torfsen avatar torfsen commented on May 19, 2024

The mentioned change has been committed in cbf1cae.

@jbothma: Do you need this backported to 3.1?

from ckanext-extractor.

jbothma avatar jbothma commented on May 19, 2024

No need, thanks - I took the opportunity to upgrade (and drop celery) while debugging.

Thanks for fitting this into your schedule - much appreciated.

from ckanext-extractor.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.