Giter VIP home page Giter VIP logo

Comments (7)

cmungall avatar cmungall commented on August 23, 2024

this has been running for 1day and 10hrs on pan so far...

ontobio-parse-assocs.py -r target/go-ontology.json -f target/groups/goa_uniprot_gcrp/goa_uniprot_gcrp-src.gaf -o target/groups/goa_uniprot_gcrp/goa_uniprot_gcrp.gaf -m target/groups/goa_uniprot_gcrp/goa_uniprot_gcrp.report validate

from ontobio.

dougli1sqrd avatar dougli1sqrd commented on August 23, 2024

So the file reading line by line that's currently present does not in fact load the whole file into memory. The memory intensive parts will likely be as we parse the GAF file, we store the associations into a list and we store them and any messages into the Report object. If the GAF is large, these will correspondingly become large. Perhaps we should instead think about the CLI and decide what parts of the parsing infrastructure we should keep on just a "raw" parse if we're uninterested in the associations. Like, what command represents merely a parse, with no associations? What command represents building the associations? If we're storing those associations, we could buffer/write those out as we create them to save memory. Maybe the Report object should be more of a traditional logger, and that way someone who is curious about the parse details could examine the log. Currently I believe we only print out a subset of the saved messages sent to the report, to be printed as a summary. Maybe if we want a summary, we can just generate a Summary or something, that only holds on to a subset of the data at a time.

In conclusion, let's chat about the specific use cases of this and we can clear up where to slice the memory saving. Especially as iterating through the file already only reads line by line, and doesn't read the whole file at once.

from ontobio.

cmungall avatar cmungall commented on August 23, 2024

We can easily add a stream option to the report objects

If the report is too large then I think something else is going on.

from ontobio.

dougli1sqrd avatar dougli1sqrd commented on August 23, 2024

Proposal:
Report doesn't actually have to keep track of every message or association since we only ever use the report as a summary. For example, we print out only 10 taxons, subjects and objects in the summary; so we only need the first 10 and we're done. We do print out each message, but honestly it's the same sort of format as a logger. Why don't we just use a logger for messages instead of the report? The logger can be configured in the same way (file vs stdout vs stderr) that the current Report is too.

We should also be clear whether or not we need the associations. If we don't need them, let's not build them. If we do need them, let's write them out in a buffered manner.

All of these should be configured by subcommands/options in the CLI.

from ontobio.

cmungall avatar cmungall commented on August 23, 2024

Why don't we just use a logger for messages instead of the report?

we need the abstraction layer to be able to easily capture aggregate stats of type of error etc.

We want to leave open the option of formatting in html etc in the future.

We should also be clear whether or not we need the associations

We do in some contexts we don't in others.

from ontobio.

dougli1sqrd avatar dougli1sqrd commented on August 23, 2024

We can capture aggregate stats about the parsing without holding onto each message in memory. The aggregate, which we can still call Report could still be formatted however we want.

As far as Associations, I meant we should be clear about what contexts those are in which we need them or don't.

from ontobio.

dougli1sqrd avatar dougli1sqrd commented on August 23, 2024

#76

from ontobio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.