Comments (7)
this has been running for 1day and 10hrs on pan so far...
ontobio-parse-assocs.py -r target/go-ontology.json -f target/groups/goa_uniprot_gcrp/goa_uniprot_gcrp-src.gaf -o target/groups/goa_uniprot_gcrp/goa_uniprot_gcrp.gaf -m target/groups/goa_uniprot_gcrp/goa_uniprot_gcrp.report validate
from ontobio.
So the file reading line by line that's currently present does not in fact load the whole file into memory. The memory intensive parts will likely be as we parse the GAF file, we store the associations into a list and we store them and any messages into the Report
object. If the GAF is large, these will correspondingly become large. Perhaps we should instead think about the CLI and decide what parts of the parsing infrastructure we should keep on just a "raw" parse if we're uninterested in the associations. Like, what command represents merely a parse, with no associations? What command represents building the associations? If we're storing those associations, we could buffer/write those out as we create them to save memory. Maybe the Report
object should be more of a traditional logger, and that way someone who is curious about the parse details could examine the log. Currently I believe we only print out a subset of the saved messages sent to the report, to be printed as a summary. Maybe if we want a summary, we can just generate a Summary
or something, that only holds on to a subset of the data at a time.
In conclusion, let's chat about the specific use cases of this and we can clear up where to slice the memory saving. Especially as iterating through the file already only reads line by line, and doesn't read the whole file at once.
from ontobio.
We can easily add a stream option to the report objects
If the report is too large then I think something else is going on.
from ontobio.
Proposal:
Report doesn't actually have to keep track of every message or association since we only ever use the report as a summary. For example, we print out only 10 taxons, subjects and objects in the summary; so we only need the first 10 and we're done. We do print out each message, but honestly it's the same sort of format as a logger. Why don't we just use a logger for messages instead of the report? The logger can be configured in the same way (file vs stdout vs stderr) that the current Report is too.
We should also be clear whether or not we need the associations. If we don't need them, let's not build them. If we do need them, let's write them out in a buffered manner.
All of these should be configured by subcommands/options in the CLI.
from ontobio.
Why don't we just use a logger for messages instead of the report?
we need the abstraction layer to be able to easily capture aggregate stats of type of error etc.
We want to leave open the option of formatting in html etc in the future.
We should also be clear whether or not we need the associations
We do in some contexts we don't in others.
from ontobio.
We can capture aggregate stats about the parsing without holding onto each message in memory. The aggregate, which we can still call Report
could still be formatted however we want.
As far as Associations, I meant we should be clear about what contexts those are in which we need them or don't.
from ontobio.
from ontobio.
Related Issues (20)
- Unknown GAF qualifier/relation breaks parser HOT 1
- get_minimal_subgraph gives "RecursionError: maximum recursion depth exceeded"
- ModuleNotFoundError: No module named ‘ontobio.graph_io'
- Emit creation-date on Evidence individuals HOT 1
- error in gocamgen HOT 1
- Obsolete GO term with no replacement breaks with/from validation HOT 2
- pyparsing upgrade causes AttributeError HOT 1
- ogr bug? HOT 1
- loading local json files fails HOT 1
- ontobio output to obograph?
- Cannot render subgraphs using GraphRenderer
- GO-CAM translation should convert 'transports or maintains localization of' and other relations HOT 3
- "category" field in Token should be "categories"
- Converting GPAD to GAF can result in blank evidence codes if no ECO-to-GAF mapping exists HOT 2
- Use gaf-eco-mapping-derived.txt in EcoMap in GpadParser and GafWriter HOT 2
- EcoMap should parse eco.owl as ontology HOT 1
- enrichment_test() function not returning anything
- Import breaks on >=Python3.7 because of dataclasses package HOT 1
- sqlite3.DatabaseError: file is not a database
- ontobio pull request failing workflow actions due to Monarch HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ontobio.