Add a new method called something like parse_iter , th

this has been running for 1day and 10hrs on pan so far... <div class="snippet-clip

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="24

Association parsers should return generators about ontobio HOT 7 CLOSED

cmungall commented on August 23, 2024

Association parsers should return generators

from ontobio.

Comments (7)

cmungall commented on August 23, 2024

this has been running for 1day and 10hrs on pan so far...

ontobio-parse-assocs.py -r target/go-ontology.json -f target/groups/goa_uniprot_gcrp/goa_uniprot_gcrp-src.gaf -o target/groups/goa_uniprot_gcrp/goa_uniprot_gcrp.gaf -m target/groups/goa_uniprot_gcrp/goa_uniprot_gcrp.report validate

from ontobio.

dougli1sqrd commented on August 23, 2024

So the file reading line by line that's currently present does not in fact load the whole file into memory. The memory intensive parts will likely be as we parse the GAF file, we store the associations into a list and we store them and any messages into the Report object. If the GAF is large, these will correspondingly become large. Perhaps we should instead think about the CLI and decide what parts of the parsing infrastructure we should keep on just a "raw" parse if we're uninterested in the associations. Like, what command represents merely a parse, with no associations? What command represents building the associations? If we're storing those associations, we could buffer/write those out as we create them to save memory. Maybe the Report object should be more of a traditional logger, and that way someone who is curious about the parse details could examine the log. Currently I believe we only print out a subset of the saved messages sent to the report, to be printed as a summary. Maybe if we want a summary, we can just generate a Summary or something, that only holds on to a subset of the data at a time.

In conclusion, let's chat about the specific use cases of this and we can clear up where to slice the memory saving. Especially as iterating through the file already only reads line by line, and doesn't read the whole file at once.

from ontobio.

cmungall commented on August 23, 2024

We can easily add a stream option to the report objects

If the report is too large then I think something else is going on.

from ontobio.

dougli1sqrd commented on August 23, 2024

Proposal:
Report doesn't actually have to keep track of every message or association since we only ever use the report as a summary. For example, we print out only 10 taxons, subjects and objects in the summary; so we only need the first 10 and we're done. We do print out each message, but honestly it's the same sort of format as a logger. Why don't we just use a logger for messages instead of the report? The logger can be configured in the same way (file vs stdout vs stderr) that the current Report is too.

We should also be clear whether or not we need the associations. If we don't need them, let's not build them. If we do need them, let's write them out in a buffered manner.

All of these should be configured by subcommands/options in the CLI.

from ontobio.

cmungall commented on August 23, 2024

Why don't we just use a logger for messages instead of the report?

we need the abstraction layer to be able to easily capture aggregate stats of type of error etc.

We want to leave open the option of formatting in html etc in the future.

We should also be clear whether or not we need the associations

We do in some contexts we don't in others.

from ontobio.

dougli1sqrd commented on August 23, 2024

We can capture aggregate stats about the parsing without holding onto each message in memory. The aggregate, which we can still call Report could still be formatted however we want.

As far as Associations, I meant we should be clear about what contexts those are in which we need them or don't.

from ontobio.

dougli1sqrd commented on August 23, 2024

#76

from ontobio.

Association parsers should return generators about ontobio HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent