Giter VIP home page Giter VIP logo

Comments (18)

callahantiff avatar callahantiff commented on June 3, 2024 1

Thanks for the issue @sanyabt and for the awesome information! I actually think I know what's causing this. Let me confirm this evening and get back to you.

from pheknowlator.

callahantiff avatar callahantiff commented on June 3, 2024 1

Hey @sanyabt! I found and resolved the bug. The funny part is that all 2,008 nodes that had an issue in the label were those that were pulled from the Cell Line Ontology. This matters because as it turns out, this ontology includes labels, synonyms, and definitions in multiple languages. While I think it's fine (and maybe even useful) to include the different languages in the synonyms, it's less helpful to have non-English labels and definitions. So, here is what I have done...

  • I created a PR (#119) and added checks to all label and definition queries to ensure that we are only keeping English labels (when other languages are available)
  • Patched the release and pushed it to PyPI -- the current version is now v3.0.2
  • See the attached JSON file (bad_node_patch.json). This file contains a nested dictionary where the outer keys are the entity_uri and the outer values are another dictionary where the inner keys are label and description/definition and the inner values for these inner keys are the updated strings without foreign characters. An example of this dictionary is shown below:
key = '<http://purl.obolibrary.org/obo/UBERON_0000468>'

print(bad_node_patch[key])
>>> {'label': 'multicellular organism', 'description/definition': 'Anatomical structure that is an individual member of a species and consists of more than one cell.'}

Hope this will be useful to you to get around the error. All future builds should no longer contain this error.


Thanks so much again for pointing out this error! Let me know if there is anything else that is needed with respect to this issue! 💪 🙇‍♀️ 😄

from pheknowlator.

sanyabt avatar sanyabt commented on June 3, 2024 1

Thank you!! This is super helpful and I really appreciate you doing it so quickly 😄

Do we know if it was random nodes from the Cell Line Ontology or all of them? It's weird that my NodeLabels file had different nodes with foreign character labels (close to 1900 nodes) but not all of them were the same as the 2008 we found above (example SO_0000704:gene). I might need to fix those as well till I can get to the next build. Thanks again!

from pheknowlator.

callahantiff avatar callahantiff commented on June 3, 2024 1

No trouble at all! actually, I have a different idea. Give me a few hours and I will send you a new file. Does that sound OK?

from pheknowlator.

callahantiff avatar callahantiff commented on June 3, 2024 1

OK, go ahead and download this file (bad_node_patch.json). Please note that this file contains all ontology classes, not just those with problematic labels, from the base set of merged ontologies. It should cover all of your nodes (I ran a check on my end using your code snippet above to verify).

from pheknowlator.

sanyabt avatar sanyabt commented on June 3, 2024 1

Thank you, this is perfect!

from pheknowlator.

sanyabt avatar sanyabt commented on June 3, 2024 1

Hi @callahantiff, sorry to open this again but I thought it would be better addressed on this thread - the PR namespace nodes have label = 'N/A' in the bad_node_path.json file 😅

Examples:
'http://purl.obolibrary.org/obo/PR_Q6V1P9',
'http://purl.obolibrary.org/obo/PR_Q9NZV6-1',
'http://purl.obolibrary.org/obo/PR_Q9GZL7',
'http://purl.obolibrary.org/obo/PR_O76083-5',
'http://purl.obolibrary.org/obo/PR_A0A087WT02'...

About 54903 nodes have 'N/A' labels and 54844 are from the PR namespace.

from pheknowlator.

callahantiff avatar callahantiff commented on June 3, 2024 1

haha oh boy, that's what I get for not running tests. OK, re-checking now. Be back in touch soon! 😄

from pheknowlator.

callahantiff avatar callahantiff commented on June 3, 2024 1

I finally figured out what was wrong, sorry for the delay. The protein ontology nodes that were missing were the result of changes that the PRO Consortium has made to their endpoint. Prior to October, there was not a limit on the number of rows that you could return when querying their system. Now, they only allow you to download 10,000 rows at one time. This impacted the most recent build, which introduced all of the missing values for the set of most recent build, this resulted in these nodes not fully being added to the merged core set of ontologies (i.e., they appeared as a class, but were missing all of their associated metadata).

I have fixed the code so that we no longer depend on the SPARQL endpoint, which will make for more stable builds in the future (#120) . I also revereted the output in the current_builds directory on GCS to the September build which does not have this error. I will also re-trigger an updated October build later this weekend, which should be available by the end of next week at the latest.

In the meantime, I also updated the bad_node_patch.json file to include all non-forgien characters, but using the core set of ontologies for the September build, which is not missing values for the labels. I hope this will work for you for the next few days until the October build is refreshed which should be free from foreign characters and the weird missing data bug.

Thanks for helping me get to the bottom of this, you are awesome! 😄

from pheknowlator.

sanyabt avatar sanyabt commented on June 3, 2024 1

Ohh that makes so much sense! No wonder I couldn't see the issue with my previous build's NodeLabels file 😂

A huge thank you for fixing it so quickly and creating the JSON file again! I won't start a new build till at least mid-November so take your time :) Have lots to update you about the natural product-drug interactions KG too! Closing the issue now

from pheknowlator.

callahantiff avatar callahantiff commented on June 3, 2024

Thank you!! This is super helpful and I really appreciate you doing it so quickly 😄

Do we know if it was random nodes from the Cell Line Ontology or all of them? It's weird that my NodeLabels file had different nodes with foreign character labels (close to 1900 nodes) but not all of them were the same as the 2008 we found above (example SO_0000704:gene). I might need to fix those as well till I can get to the next build. Thanks again!

Interesting! 🤔 Are all of the nodes that are missing from the dictionary I sent from the SO namespace? I am happy to help you recover those. Code-wise we should still be covered for future builds, but I want to make you sure you are covered for the current build too.

from pheknowlator.

sanyabt avatar sanyabt commented on June 3, 2024

I don't think so - I found nodes with GO and UBERON namespaces too. Should I share the file with you?

Sorry for the extra trouble - I can run the recent build if that is easier.

from pheknowlator.

callahantiff avatar callahantiff commented on June 3, 2024

haha, ignore the re-opening and closing of the issue, my computer just freaked out! Sorry about that 😊 .

from pheknowlator.

sanyabt avatar sanyabt commented on June 3, 2024

Sounds good! Haha no worries 😄

from pheknowlator.

callahantiff avatar callahantiff commented on June 3, 2024

Thank you for pointing out this bug!

from pheknowlator.

callahantiff avatar callahantiff commented on June 3, 2024

OK, so this is not a bug per se, but really a bad assumption that I was making. I assumed that all ontologies would provide labels for the classes they defined. This is unfortunately not always the case (as you proved, hehe 😱 ). I will need to think through where I add the code to address this specifically -- i.e. importing labels from uniprot. It should be no problem since we download most of this data anyways, I just need to extend the Uniprot query.

In case you want to verify this take a look at the http://purl.obolibrary.org/obo/pr.owl file you can serach for the URIs and confirm that the ontology does not include rdfs:label informaiton for many of the classes that retain the Uniprot identifiers.

So, I will play with thinking through how I can extend the current functionality to catch nodes from ontologies when the ontology does not provide a label. Updates to come!

from pheknowlator.

sanyabt avatar sanyabt commented on June 3, 2024

Sounds good, thank you so much! Let me know if you need any help :)

from pheknowlator.

callahantiff avatar callahantiff commented on June 3, 2024

Sounds great! I am really looking forward to hearing those updates!! 😄

from pheknowlator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.