mozi-ai / annotation-scheme Goto Github PK
View Code? Open in Web Editor NEWHuman Gene annotation service backend
License: GNU General Public License v3.0
Human Gene annotation service backend
License: GNU General Public License v3.0
After PR #43, the output interactions between genes that interact with the main gene are missing.
Pull req #125 introduces a new bug, and ignores an existing bug... both should be fixed. See comments there for details.
So, following the README, I loaded datasets/smpdb_chebi_wname.scm
and then
(gene-go-annotation '("IGF1") "biological_process molecular_function cellular_component" 0 #:id "")
which gives
;;; <stdin>:18:0: warning: possibly wrong number of arguments to `gene-go-annotation'
ice-9/boot-9.scm:1655:16: In procedure raise-exception:
Invalid keyword: 0
So the leading single-quote seems wrong, single quotes are short-handle for (quote stuff)
Breakage due to pull req #172 -- see discussion there for work-around. I get test-suite.log
with contents:
test-name: annotate-genes
location: /home/ubuntu/src/annotation-scheme/tests/main-tests.scm:33
source:
+ (test-assert
+ "annotate-genes"
+ (string?
+ (annotate-genes (list "IGF1") "Dfaer" req)))
actual-value: #f
actual-error:
+ (wrong-type-arg
+ "string->list"
+ "Wrong type argument in position ~A (expecting ~A): ~S"
+ (1
+ "string"
+ ("data"
+ ("id" . "Uniprot:Q01094")
+ ("name" . "E2F1")
+ ("definition"
+ .
+ "https://www.uniprot.org/uniprot/Q01094")
Sorry to pester you with so many issues, but .. I can't keep track of them all without writing them down.
I ran the following:
(gene-pathway-annotation gene-list "my-path-results"
#:pathway "reactome smpdb"
#:namespace "biological_process molecular_function cellular_component"
#:parents 0)
It took 60 hours to compute (2.5 days) During the entire time, the guile heap was at 150MBytes -- I know because I printed the heap size after each gene. Then it went to write the results file: this took 5 minutes to write, and the guile heap exploded to 16GB during those five minutes.
The file write itself wrote 743843 results (this was the length of the "results" list) The results file xgene-pathway.scm
was 3974636585 bytes long (that's 4GBytes in size).
I conclude that there's something pathological about the file-writes; and perhaps things would go better if you wrote incrementally to a pipe, and then piped that pipe to a file.
I am trying to annotate BRCA1 genes and when I run the annotate-genes
function, it fails with the following error:
In procedure cog-outgoing-set: Wrong type argument in position 1 (expecting opencog atom): #<Invalid handle>
The error happens in the run-query
function. It looks like the issue is due the pubmed id caching and may it is referring to a delete SetLink
(I could be wrong though)
To reproduce the issue, load the datasets and run annotate-genes
function with the following arguments
(annotate-genes (list "TSPAN6") "agingSymbols" "[{\"function_name\": \"gene-pathway-annotation\", \"filters\": [{\"filter\": \"pathway\", \"value\": \"smpdb reactome\"},{\"filter\": \"include_prot\", \"value\": \"True\"}, {\"filter\": \"include_sm\", \"value\": \"False\"},{\"filter\": \"coding\", \"value\": \"True\"},{\"filter\": \"noncoding\", \"value\": \"True\"}, {\"filter\": \"biogrid\", \"value\": \"0\"}]}, {\"function_name\": \"gene-go-annotation\", \"filters\": [{\"filter\": \"namespace\", \"value\": \"biological_process cellular_component molecular_function\"}, {\"filter\": \"parents\", \"value\": \"0\"}, {\"filter\": \"protein\", \"value\": \"True\"}]}, {\"function_name\": \"include-rna\", \"filters\": [{\"filter\": \"coding\", \"value\": \"True\"},{\"filter\": \"noncoding\", \"value\": \"True\"},{\"filter\": \"protein\", \"value\": \"1\"}]},{\"function_name\": \"biogrid-interaction-annotation\", \"filters\": [{\"filter\": \"interaction\", \"value\": \"Proteins\"}]}]")
@linas can you check this please?
The macro GUILE_SITE_GO_DIR
is undefined but referenced in configure.ac.
If the intent is to determine the target directory for compiled Guile modules the macro GUILE_SITE_DIR
should be sufficient as it sets the variable GUILE_SITE_CCACHE
.
The generate-result
returns list of many things at the same time. I think it should be broken up into smaller functions that each return a single result. This can help us avoid having every huge lists in the result.
The simple caching below gives a 1.5x speedup for biogrid annotation. I'm just cut-n-pasting the patch here, you have to adapt it to your own tastes.
--- a/annotation/functions.scm
+++ b/annotation/functions.scm
@@ -1149,7 +1149,19 @@ rv)
rv))
(define-public (find-pubmed-id gene-a gene-b)
- (let ([pub (cog-outgoing-set (cog-execute!
+ (cache-find-pubmed-id (Set gene-a gene-b)))
+
+(define cache-find-pubmed-id
+ (make-afunc-cache do-find-pubmed-id))
+
+(define (do-find-pubmed-id gene-set)
+"
+ This is expecting a (SetLink (Gene \"a\") (Gene \"b\")
+"
+ (let* (
+ [gene-a (cog-outgoing-atom gene-set 0)]
+ [gene-b (cog-outgoing-atom gene-set 1)]
+ [pub (run-query
(GetLink
(VariableNode "$pub")
(EvaluationLink
@@ -1165,9 +1177,9 @@ rv)
)
)
- )))])
+ ))])
(if (null? pub)
- (set! pub (cog-outgoing-set (cog-execute!
+ (set! pub (run-query
(GetLink
(VariableNode "$pub")
(EvaluationLink
@@ -1182,10 +1194,11 @@ rv)
(VariableNode "$pub")
)
)
- )))
+ ))
))
pub
))
+
(define-public (find-crna gene protein)
(cog-execute! (BindLink
(VariableList
So if I do grep True annotation
I find a dozen uses of (if (equal? prot "True") ...)
and similar. Instead of using the string "True", just say #t and that way you can write (if prot ...)
The idea being that in scheme, #t and #f denote true and false...
The current code doesn't have any tests. This is a problem. Without tests, we cannot be confident the code does that it was intended to do. Adding unit-tests is a task that is long overdue.
We can start by adding tests to the three annotation functions at first and add more to the pattern matching functions.
this relates to #23.
I get a warning and a crash on build:
$ make
./env /usr/local/bin/guild compile -Wunbound-variable -Warity-mismatch -Wformat -o "annotation/util.go" "../annotation/util.scm"
;;; note: auto-compilation is enabled, set GUILE_AUTO_COMPILE=0
;;; or pass the --no-auto-compile argument to disable.
;;; compiling /usr/local/bin/guild
====> Attention!
====> Obsolete! You do not need this module any longer.
====> Please use `(opencog exec)` and `cog-execute!` instead.
Backtrace:
In ice-9/boot-9.scm:
1722:10 19 (with-exception-handler _ _ #:unwind? _ # _)
In system/base/compile.scm:
59:11 18 (_)
155:11 17 (_ #<closed: file 5632d31f97e0>)
175:2 16 (read-and-compile #<input: annotation/util.scm 13> # _ # …)
183:32 15 (compile-fold (#<procedure compile-tree-il (x e opts)>) …)
In ice-9/boot-9.scm:
2792:4 14 (save-module-excursion _)
In language/scheme/compile-tree-il.scm:
31:15 13 (_)
In ice-9/psyntax.scm:
1241:36 12 (expand-top-sequence _ _ _ #f _ _ _)
1233:19 11 (parse _ (("placeholder" placeholder)) ((top) #(# # …)) …)
285:10 10 (parse _ (("placeholder" placeholder)) (()) _ c&e (# #) #)
In ice-9/eval.scm:
293:34 9 (_ #<module (#{ g409}#) 5632d2ec08c0>)
In ice-9/boot-9.scm:
3366:4 8 (define-module* _ #:filename _ #:pure _ #:version _ # _ …)
2551:24 7 (call-with-deferred-observers _)
3379:24 6 (_)
222:29 5 (map1 (((opencog)) ((opencog query)) ((opencog exec)) …))
222:29 4 (map1 (((opencog query)) ((opencog exec)) ((# #)) (#) …))
222:29 3 (map1 (((opencog exec)) ((opencog bioscience)) ((#)) # …))
222:17 2 (map1 (((opencog bioscience)) ((json)) ((ice-9 #)) (#) …))
3286:6 1 (resolve-interface (opencog bioscience) #:select _ # _ # …)
1655:16 0 (raise-exception _ #:continuable? _)
ice-9/boot-9.scm:1655:16: In procedure raise-exception:
no code for module (opencog bioscience)
Makefile:1082: recipe for target 'annotation/util.go' failed
make: *** [annotation/util.go] Error 1
I'll submit a fix for the warning shortly, and then go hunting for the missing module. ... the configure.ac
should check for dependencies...
The scheme result of the annotation-service is supposed to be used for Pattern miner and PLN works. As the bio-atomspace is too big to work with, we wanted to have a smaller version by annotating the ~800 genes from the mosses model. The problem with the annotation result is it has additional ListLink's ( which we used For example to return >1 hypergraphs on certain condition
check here
The ListLink causes error (when the scheme result is used as a dataset for the Pattern miner code here https://github.com/ngeiswei/reasoning-bio-as-xp)
a ListLink is considered not not have many outgoing nodes. What should we do to get rid of these ListLinks?
i get this trying to install:
checking for Guile site directory... /usr/share/guile/site/2.2
checking for Guile site-ccache directory using pkgconfig... /usr/lib/x86_64-linux-gnu/guile/2.2/site-ccache
checking for Guile extensions directory... /usr/lib/x86_64-linux-gnu/guile/2.2/extensions
checking if (json) is available... yes
checking if (opencog bioscience) is available... no
configure: error: The (opencog bioscience) module is needed. See https://github.com/opencog/agi-bio for details.
even though it's installed:
$ ls /usr/share/guile/site/2.2/opencog/
...
bioscience logger.scm persist.scm test-runner.scm
...
i have (use-modules (opencog))
in ~/.guile but guile still has problems finding it:
$ guile
scheme@(guile-user)> (use-modules (opencog bioscience))
While compiling expression:
In procedure dynamic-link: file: "libbioscience-types", message: "file not found"
scheme@(guile-user)> (use-modules (opencog))
scheme@(guile-user)> (use-modules (opencog bioscience))
scheme@(guile-user)>
what am i doing wrong?
While loading the current datasets, I get an error: In procedure unquote: expression not valid outside of quasiquote
its not clear which dataset provoked this. Investigating ...
Currently, the tests
directory contains examples, but they are not executable, so they cannot be used to automatically verify that the code still behaves as expected.
I suggest using srfi-64 to define executable tests. See here for an example: https://git.savannah.gnu.org/cgit/gwl.git/tree/tests/processes.scm
The following patch speeds up my code by another factor of 1.5x for pathway searches. I'm posting here rather than as a pull req so that you can adjust it to your coding style. See issue #98 for performance measurements.
--- a/annotation/util.scm
+++ b/annotation/util.scm
@@ -94,7 +94,10 @@
;; Find node name and description
-(define-public (node-info node)
+(define-public node-info
+ (make-afunc-cache xnode-info))
+
+(define-public (xnode-info node)
; (format #t "duude node-info=~A\n" node)
(if (cog-node? node)
(list
@@ -201,8 +204,26 @@
)
)
+(define (run-query QUERY)
+"
+ Call (cog-execute! QUERY), return results, delete the SetLink.
+ This avoids a memory leak of SetLinks
+"
+ ; Run the query
+ (define set-link (cog-execute! QUERY))
+ ; Get the query results
+ (define results (cog-outgoing-set set-link))
+ ; Delete the SetLink
+ (cog-delete set-link)
+ ; Return the results.
+ results
+)
+
+(define-public find-name
+ (make-afunc-cache probe-find-name))
+
;;finds go name for parser function
-(define-public (find-name a)
+(define (probe-find-name a)
(find-name-ctr #:enter? #t)
(let ((rv (oldfind-name a)))
(find-name-ctr #:enter? #f)
@@ -214,8 +235,8 @@
[predicate (if (regexp-match? (string-match "GO:[0-9]+" (cog-name atom))) "GO_name" "has_name")]
)
(get-name
- (cog-outgoing-set
- (cog-execute! ; cog-execat! ; cog-execute!
+ (run-query
+ ; cog-execute! ; cog-execat! ; cog-execute!
(GetLink
(VariableNode "$name")
@@ -224,7 +245,7 @@
(ListLink
atom
(VariableNode "$name")
- )))))))))
+ ))))))))
;;Given an atom and list of namespaces finds the parents of that atom in the specified namespaces
@@ -324,17 +345,19 @@
))
)
-(define-public (locate-node a)
+(define-public locate-node
+ (make-afunc-cache probe-locate-node))
+
+(define (probe-locate-node a)
(locate-node-ctr #:enter? #t)
(let ((rv (xlocate-node a)))
(locate-node-ctr #:enter? #f)
rv))
-
-(define-public xlocate-node
+(define xlocate-node
(lambda(node)
; (format #t "duude in locate node=~A\n" node)
- (let ([loc (cog-outgoing-set (cog-execute!
+ (let ([loc (run-query
(BindLink
(VariableNode "$go")
(AndLink
@@ -354,12 +377,12 @@
(VariableNode "$go")
))
)
- ))
+ )
])
; (format #t "duude in locate loc=~A\n" loc)
(if (null? loc)
(set! loc
- (cog-outgoing-set (cog-execute!
+ (run-query
(BindLink
(VariableNode "$loc")
(EvaluationLink
@@ -373,7 +396,7 @@
node
(VariableNode "$loc")))
)
- )))
+ ))
)
loc
)
this will fix the 'N/A' name given to uniprot proteins added when the pathway protein switch is invoked, make the reactome uniprot names consistent (currently some are symbols, some are phrases, some have indicators of post translational modifications, etc). it will also permit removal of 'has name' predicates from the knowledgebase.
OK, got build and install to work. Now make check
fails:
$ make check
make check-TESTS
make[1]: Entering directory '/home/ubuntu/src/annotation-scheme/build'
make[2]: Entering directory '/home/ubuntu/src/annotation-scheme/build'
FAIL: tests/parser-tests.scm
Makefile:779: recipe for target 'tests/util-tests.log' failed
make[2]: *** [tests/util-tests.log] Error 1
So
$ cat tests/util-tests.log
Backtrace:
1 (primitive-load-path "tests/util-tests.scm")
0 (primitive-load "tests/sample_dataset.scm")
ERROR: In procedure primitive-load:
In procedure open-file: No such file or directory: "tests/sample_dataset.scm"
Pursuant to comment #98 (comment) and similar to issue #103 the speed of biogrid annotation can be more than doubled, with the code below. Viz, before, the patch, perf is this:
Time: 2945.8490 secs. calls: 75 avg: 39277986.8 usec/call for find-output-interactors
Time: 2067.2802 secs. calls: 46366 avg: 44586.1 usec/call for generate-result
after patching, its this:
Time: 1409.3709 secs. calls: 75 avg: 18791612.0 usec/call for find-output-interactors
Time: 330.18767 secs. calls: 45971 avg: 7182.5 usec/call for generate-result
The patch is this (i'm not making a pull req, because it requires additional cleanup for production code) Also, plenty more optimization opportunities are readily available in this code-block.
(define-public (xgenerate-result gene-a gene-b prot go)
(let* (
[output (find-pubmed-id gene-a gene-b)]
[res (filter
(lambda (ggg) (not (reported-genes ggg)))
(list gene-a gene-b))]
[pairs (reported-grid-pairs (Set gene-a gene-b))]
[interaction (if (= 1 (string->number (cog-name prot)))
(ListLink
(build-interaction gene-a gene-b output "interacts_with")
(build-interaction (find-protein-form gene-a) (find-protein-form gene-b) output "inferred_interaction"))
(build-interaction gene-a gene-b output "interacts_with"))]
[namespace (if (null? (cog-outgoing-set go)) '() (car (cog-outgoing-set go)))]
[parent (if (null? (cog-outgoing-set go)) '() (cadr (cog-outgoing-set go)))]
)
and the rest of the function is more-or-less same as before, except:
biogrid-pairs
and biogrid-genes
they are not neededGeneNode
its already an atom. So this gives: (if (null? interaction)
(ListLink)
)
(match res
((a b)
(begin
(if (= 1 (string->number (cog-name prot)))
(let ([coding-prot-a (find-protein-form a)]
[coding-prot-b (find-protein-form b)])
(if (or (equal? coding-prot-a (ListLink)) (equal? coding-prot-b (ListLink)))
(ListLink)
(ListLink
interaction
(Evaluation (Predicate "expresses") (ListLink a coding-prot-a))
(node-info a)
(node-info coding-prot-a)
(locate-node coding-prot-a)
(EvaluationLink (PredicateNode "expresses") (ListLink b coding-prot-b))
(node-info b)
(node-info coding-prot-b)
(locate-node coding-prot-b)) ;;; <<< bug fix here
))
(ListLink
interaction
(node-info a)
(locate-node a)
(node-info b)
(locate-node b)
(if (not (null? namespace))
(ListLink
(ConceptNode "gene-go-annotation")
; bug fix here!!!
(find-go-term a (string-split (cog-name namespace) #\ ) (string->number (cog-name parent)))
(find-go-term b (string-split (cog-name namespace) #\ ) (string->number (cog-name parent)))
(ListLink (ConceptNode "biogrid-interaction-annotation"))
)
'() )
)
))
)
((a)
(begin
(if (= 1 (string->number (cog-name prot)))
(let ([coding-prot (find-protein-form a)])
(if (equal? coding-prot (ListLink))
(ListLink)
(ListLink
interaction
(EvaluationLink (PredicateNode "expresses") (ListLink a coding-prot))
(node-info a)
(node-info coding-prot)
(locate-node coding-prot))
))
(ListLink
interaction
(node-info a)
(locate-node a)
(if (not (null? namespace))
(ListLink (ConceptNode "gene-go-annotation")
(find-go-term a
(string-split (cog-name namespace) #\ )
(string->number (cog-name parent)))
(ListLink (ConceptNode "biogrid-interaction-annotation"))
)
'()
))
))
)
(()
(if pairs
(ListLink)
(ListLink
interaction
)
)
)
)
)
)
Note that the above also contains three bug fixes: two to find-go-term
and one to (locate-node coding-prot-b)) ;;; <<< bug fix here
@tanksha and @Habush I have a generic idea I want to explore. How many of the biome datasets are available in json format?
I'm thinking that it might be worth creating a generic json->atomese importer, and a generic exporter. However, this would work only if
the input json dataset formats are not badly/insanely designed (i.e. the resulting imported atomese would have to look reasonable enough to be useable)
There are not too many conversions needed to obtain something that can be easily reasoned on or pattern-mined. So, for example, the import process might be (a) generic json->atomese (b) run a BindLink
to convert this generic-atomese into something friendlier for mining/pln/moses (c) actually run mining/pln/moses. (so maybe @ngeiswei you'd have an idea about this?)
What is not clear is whether step (b) above is easier than just writing a custom json importer. If it's not easier, then this generic-import idea seems like a not-very-good idea. But I can't really tell...
(follow up to issue #164)
As of commit 679ae5b there are a bunch of new test failures:
edge-count
)opencog_deps
no longer existsparse-request
, annotate-genes
, and protein-goterm
.There were no test failures at commit 7a79385.
I would like to contribute to this project, but I cannot seem to find any license. Could you please clarify under what license (if any) this project is released?
Loading the current datasets, and running this:
(gene-pathway-annotation gene-list
"my-path-results"
#:pathway "reactome smpdb"
#:namespace "biological_process molecular_function cellular_component"
#:parents 0))
results in a crash:
Backtrace:
In ice-9/boot-9.scm:
260:13 19 (for-each #<procedure 7f22f81a39b0 at annotation/gene-pathway.…> …)
260:13 18 (for-each #<procedure 7f22f81a3000 at annotation/gene-pathway.…> …)
In annotation/gene-pathway.scm:\n 44:10 17 (_ _)
109:24 16 (reactome \"NDUFAF7\" \"True\" \"True\" _ # 1)
In ice-9/boot-9.scm:
222:17 15 (map1 (#))
In annotation/gene-pathway.scm:
115:32 14 (_ _)
In annotation/functions.scm:
301:22 13 (find-pathway-genes _ _)
In unknown file:
12 (opencog-extension cog-execute! (#))
In ice-9/boot-9.scm:
1722:10 11 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
10 (apply-smob/0 #<thunk 7f22ea068b60>)
In ice-9/boot-9.scm:
1722:10 9 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
8 (apply-smob/0 #<thunk 7f22ea068b20>)
In annotation/functions.scm:
339:124 7 (add-pathway-genes _ _ _)
125:2 6 (find-go-term \"ECSIT\" (\"biological_process\" \"molecular_func…\" …) …)
71:6 5 (find-memberln \"ECSIT\" (\"biological_process\" \"molecular_func…\" …))
In srfi/srfi-1.scm:
634:9 4 (for-each #<procedure 7f22ea06a020 at annotation/functions.scm…> …)
In annotation/functions.scm:
76:14 3 (_ \"biological_process\")
In unknown file:
2 (cog-new-link 81 \"ECSIT\" (VariableNode \"$a\")
)
In ice-9/boot-9.scm:
1655:16 1 (raise-exception _ #:continuable? _)
In unknown file:
0 (apply-smob/1 #<exception-handler 7f22ea068b00> #<&compound-exc…>)
ERROR: In procedure apply-smob/1:
In procedure cog-new-link: Wrong type argument in position 2 (expecting opencog atom): \"ECSIT\"
ABORT: wrong-type-arg
(/home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:1066)
Function args:
((BindLink
(VariableList
(TypedVariableLink
(VariableNode \"$p\")
(TypeNode \"MoleculeNode\")
)
(TypedVariableLink
(VariableNode \"$g\")
(TypeNode \"GeneNode\")
)
)
(AndLink
(MemberLink
(VariableNode \"$p\")
(ConceptNode \"R-HSA-6799198\")
) (EvaluationLink
(PredicateNode \"expresses\")
(ListLink
(VariableNode \"$g\")
(VariableNode \"$p\")
)
)
)
(ExecutionOutputLink
(GroundedSchemaNode \"scm: add-pathway-genes\")
(ListLink
(ConceptNode \"R-HSA-6799198\")
(VariableNode \"$g\")
(ListLink
(ConceptNode \"biological_process molecular_function cellular_component\")
(NumberNode \"0\")
)
)
)
)
)")'.
I suspect that find-similar-gene
is performing it's tests in an exactly backwards manner... for example: (find-similar-gene "SH")
returns the empty list, even though there are several genes that have names starting with SH
. By contrast, (find-similar-gene "SHISA5asdfasdf")
returns a list of one gene: ((GeneNode "SHISA5"))
... surely, this is not what you wanted? I'm thinking the search pattern and the match are exactly reversed ...
If I reverse them I get a reasonable result: (reverse-fsg "SH")
gives ((GeneNode "SHC1") (GeneNode "SHISA5"))
This is a performance optimization: find-pathway-name
performs a very simple search, using a Update: this previous statement might be completely false. I have not measured. The proposal below and in #97 might be a bad idea. I will try to measure, and get a better feel for what is going on. In the mean time, I'm leaving this issue open.GetLink
. it would almost surely be much faster if it was written out, long-hand.
Why might a hand-written version be faster? Because of three things:
Flipside: doing it "by hand", in scheme, requires multiple entries in and out of the guile interpreter, each of which has significant overhead, that might outweigh the losses mentioned above.
To add injury to insult: the SetLink is never removed, so it now clutters up the atomspace forever, slowing down future searches. Also, the GetLink is never removed, which also clutters up the atomspace. Both of these can be avoided by using a temp atomspace (I can provide code for that, it's in the learning repo, but maybe should be moved to the main repo). But, for such simple searches, its best to just avoid using GetLink, and reserve that for complicated searches.
Reading through the code for is-cellular-component?
suggests its inefficient and probably buggy.
Buggy: it performs a loop, and sets a response for every acceptable item in the loop. But only the last response is ever kept! All the other loop iterations are discarded! Do you really want to discard all answers, keeping only the last one?
Inefficient: the loop could be run in reverse, and terminated upon the first acceptable answer. This avoids running it for longer than needed.
In the current version, all the annotation functions are executed sequentially. However, since there is a little dependency across the functions, it will be great to add multithreading support to improve performance.
All three annotation should run in their separate threads. But one catch here is, whether the parsing should be done in parallel or after all the annotations are done. If it is done in parallel along with the annotations, then some data variables need to be synchronized.
In Biogrid annotation,
(biogrid-interaction-annotation (list "IGF1") "DD" #:interaction "Proteins" #:namespace "biological_process")
The result don't include a GO cross annotation (for genes from Biogrid). it works when the interaction type is of "Genes" but not when "Proteins".
The current sample_dataset.scm
is too big and takes a bit of time to load for unit-tests that test a specific functionality. So to reduce the test run times, it is better if each test dataset is tailored for that unit-test.
This is another dataset issue bug report. Gene interactions should be symmetric. But not all of them are marked as such. The current dataset has 537839 interacting gene-pairs. Of these, approx 193653 do not have a symmetric partner. There are two possible fixes for this: (1) use a SetLink not a ListLink (2) use a script to generate the missing ones. I am using the following script to clean this up:
(define (symmetrize-gene-interactions)
"
The gene interactions use the asymmetric ListLink to denote
gene pairs. But gene interactions are symmetrix, so they should
have used the SetLink. Oh well. At this time, it is convenient
to use the ListLink during pattern searches; but in this case,
the interactions should be symmetrized. That's what this does.
"
(define interact-q
(Bind
(VariableList
(TypedVariable (Variable "g1") (Type 'GeneNode))
(TypedVariable (Variable "g2") (Type 'GeneNode)))
(Present
(Evaluation (Predicate "interacts_with")
(List (Variable "g1") (Variable "g2"))))
(Evaluation (Predicate "interacts_with")
(List (Variable "g2") (Variable "g1")))))
(define sym-set (cog-execute! interact-q))
(format #t "Found ~A gene interactions\n"
(length (cog-outgoing-set sym-set)))
(cog-delete sym-set)
)
I now get
PASS: tests/parser-tests.scm
PASS: tests/util-tests.scm
PASS: tests/pathway-tests.scm
Makefile:789: recipe for target 'tests/main-tests.log' failed
make[2]: *** [tests/main-tests.log] Error 1
make[2]: Leaving directory '/home/ubuntu/src/annotation-scheme/build'
Makefile:768: recipe for target 'check-TESTS' failed
make[1]: *** [check-TESTS] Error 2
make[1]: Leaving directory '/home/ubuntu/src/annotation-scheme/build'
Makefile:960: recipe for target 'check-am' failed
make: *** [check-am] Error 2
$ cat tests/main-tests.log
Backtrace:
1 (primitive-load-path "tests/main-tests.scm")
0 (primitive-load "tests/sample_dataset.scm")
ERROR: In procedure primitive-load:
In procedure open-file: No such file or directory: "tests/sample_dataset.scm"
These were working a week ago ...
Currently, the files in the "fuctions" directory are just a collection of procedure definitions. Actually using them requires loading them from that file. The Guile way is to have files define modules, so that they can be used with (use-modules (my module))
as long as they are on the GUILE_LOAD_PATH.
Using modules would also unlock compilation, which provides a significant performance boost.
File loading performance could be increased by writing, at the start of the file
(define loc (PredicateNode "has_location"))
and then using that: (Evaluation loc (List ...))
Also doing this for each gene, pathway, protein would also speed things a little bit:
(define muniA0A075B6P5 (Molecule "Uniprot:A0A075B6P5"))
(define rhsa166663 (Concept "R-HSA-166663"))
By using such defines, you save a tiny amount of time to drop into the C++ code and atomspace, to perform the lookup. Also, the files get slightly smaller.
Originally posted by @linas in #85 (comment)
Based on the results described in this comment:
it appears that the two functions pathway-hierarchy
and check-pathway
are responsible for the explosion in the guile heap usage, as well as being giant CPU-time losers. To further ttrace this and fix this, I am using the following alternate implementation:
(define-public (run-query QUERY)
"
Call (cog-execute! QUERY), return results, delete the SetLink.
This avoids a memory leak of SetLinks
"
; Run the query
(define set-link (cog-execute! QUERY))
; Get the query results
(define results (cog-outgoing-set set-link))
; Delete the SetLink
(cog-delete set-link)
; Return the results.
results
)
(define-public (pathway-hierarchy pw lst)
(format #t "Enter pathway-hierarchy sizeof lst: ~A\n" (length lst))
(let* (
[parents (run-query
(Get (Variable "$parentpw")
(Inheritance pw (Variable "$parentpw"))))]
[junk (format #t "pathway-hierarchy found parents=~A\n" (length parents))]
[res-parent (map
(lambda (parent-pw) (check-pathway pw parent-pw lst))
parents)]
[childs (run-query
(Get (Variable "$childpw")
(Inheritance (Variable "$childpw") pw)))]
[jank (format #t "pathway-hierarchy found childs=~A\n" (length childs))]
[res-child (map
(lambda (child-pw) (check-pathway child-pw pw lst))
childs)]
)
(append res-parent res-child)
))
(define-public (check-pathway pw parent-pw lst)
(if (and (member parent-pw lst) (member pw lst))
(Inheritance pw parent-pw)
))
Commentary:
run-query
as defined above, instead of the existing (cog-outgoing-set (cog-execute!
style, which leaks memory. (Technically, it doesn't actually "leak", its just creating database records which aren't ever needed again.)check-pathway
to no wrap the InheritanceLink in a ListLink. This seemed like the right thing to do.There are two possible performance enhancements to the above code:
(member pw lst)
check before starting the search, instead of after it.member
to determine membership. The hash-set runs in O(1) time, vs. O(n) for member
.As before, I don't think I'll do a pull req for this; because (a) I'm still experimenting (b) I have no way to test. (c) it might be better for you to work out the details as needed.
if the pathway protein switch is on and biogrid is selected, pathway proteins should be linked to a protein expressed by a biogrid gene
Based on the instrumented code results described here: #98 (comment) it seems that a significant amount of CPU time is spent in generate-interactors
. Reading through the code, it appears that biogrid-pairs-pathway
is a set of pairs of strings. If a pair is NOT in the set, then it is "output". To determine set-membership, find
is used to perform an exhaustive search. This can be optimized in several ways: one is to use scheme/guile hash-sets, which provides O(1) results instead of O(n) results. However, it might be even easier to just do this:
(cog-link 'EvaluationLink (PredicateNode "interacts_with") (ListLink var1 var2))
which will return that link, if it exists already (and so you do not need to record it again) or it will return the empty list. This is an O(1) search of the atomspace (because the atomspace is using hash tables).
I'm assuming that this idea works, because I am assuming that initially there are zero instances of "interacts_with". Perhaps this is not the case?
I'm reading annotation/functions.scm
and I notice the use of delete-duplicates
. If you are trying to doelete duplicate atoms, instead of other things, then using delete-dup-atoms
will be usually be much faster, at least if there are more than 10 atoms in the list.
So, in rna.scm
I see this: (list (cog-outgoing-set ...
This is pointless, because the outgoing set is already a list Wrapping it in a list again does nothing at all except waste cpu time.
That block of code would also run more efficiently if the if-tests were outside of the map
instead of inside of it. That way, the if-test wold be performed only once, instead of hundreds (or millions) of times.
Comment no longer applies, the code is no longer structured like this.
It is better to avoid mutation with set!
whenever possible. In most cases where set!
is currently used one could instead use simple let bindings or use higher-order functions such as fold
(defined in (srfi srfi-1)
).
Sometimes the first argument to find-go-term
is passed as a string, and sometimes as an atom. It cannot possibly be both ... well, maybe it can be, but that would be confusing and inefficient..
For example, here it is a string:
annotation-scheme/annotation/functions.scm
Lines 714 to 722 in 7fc4538
because (GeneNode a)
can only work if a
is a string. Similarly here:
annotation-scheme/annotation/functions.scm
Line 337 in 7fc4538
because cog-name
returns a string.
However, here:
annotation-scheme/annotation/gene-go.scm
Line 39 in 7fc4538
we can clearly see that the first argument is an atom.
So the code is either buggy, or contains obsolete, non-functional code-blocks ...
OK, this is not a bug report, but rather a status update (and a request that someone carry out the work mentioned below)
Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS
error when running gene-pathway-annotation
. I am currently retrying with a version of bdwgc with ./configure --enable-large-config
set to see if that helps.Too many heap sections
is due to severe memory fragmentation. In particular, the error showed up while guile was using "only" 6GB RAM, of which probably 98% of that is in the atomspace, and not guile itself. (gc-stats)
shows 4GB in the guile heap, and 0.5GB of the guile heap free. GroundedSchemaNodes
is the root cause of the memory fragmentation. (Update: now I'm not so sure)I'm going to ponder if there is some really clever way to avoid this. However, It might be worthwhile to pre-emptively redesign the code to be flatter, and less recursive. That means replacing the BindLink/GroundedSchema combination with GetLink and srfi-1 map. Just use GetLink to get lists of things, and use map
to apply whatever processing is needed to the results. I'm guessing this will solve this problem, and maybe even improve runtime. Not sure. This is an interesting alternative design to explore, and it might (or might not) make things smaller, faster. I'm not sure.
Here's one problem with GroundedSchema
: every time that it is called, the guile subsystem has to be re-entered again, and there is some non-trivial number of microseconds (maybe 50?) needed to do this. It also creates a half-dozen C stack frames, and so you risk over-flowing the C stack, if not careful. Many years ago, the atomspace code was tuned to minimize both of these effects, and so maybe they are not as severe as I remember them being, but still -- its a potential trouble area. Or not. Maybe all this is OK as is.
See also next comment
I see that in the Dockerfile a forked variant of guile-json is installed. Why is the fork necessary? I see that in the fork a bunch of fields are made mutable. Mutation is pretty rare in idiomatic Scheme code, so I wonder why this is needed.
Since the fork is pretty young it would be great if we could undo it, perhaps by avoiding mutation in annotation-scheme.
What do you think?
The dataset includes this: (MoleculeNode "ChEBI:nan")
which is clearly wrong; we expect an integer here, e.g. (MoleculeNode "ChEBI:15846")
This atom appears in 70766 different links; many of these are MemberLinks to pathways.
Search performance could be improved by creating and using more biospecifc link and node types. For example:
PathwayNode "R-HSA-166663" ;; -- holds taged pathways
This would allow quicker discovery of all pathways.
(LocationLink (Molecule "Uniprot:A0A075B6P5") (Concept "extracellular region"))
The above is instead of
(Evaluation (Predicate "has_location")
(List (Molecule "Uniprot:A0A075B6P5") (Concept "extracellular region")))
Maybe even
LocationNode "extracellular region"
for specific named spatial locations
Even simple nodes for different tags would help:
(SMPNode "SMP0000055") ;; instead of (ConceptNode "SMP0000055")
(UniNode "Uniprot:P80404") ;; instead of (MoleculeNode "Uniprot:P80404")
(HSANode "R-HSA-114608") ;; instead of (ConceptNode "R-HSA-114608")
(BioGridNode "Bio:121885"); instead of (ConceptNode "Bio:121885")
The above would solve the need for regex searches that are currently used to find these things. But also one could have a RegexNode
as described in opencog/atomspace#2474
As a general rule, any time one has a frequently-used EvaluationLink of the form
(EvaluationLink (Predicate "foo") (ListLink ...stuff..))
it would probably be an overall win to define a custom (FooLink ... stuff...)
instead. Mostly this makes the atomspace a little smaller (fewer atoms) and pattern searches a little faster (less to explore). Whether or not this is worth it, I can't say. Maybe it would add extra complexity to other processing stages...
After the latest improvements to the parser by @rekado, the annotate-genes
function is returning empty list of nodes and edges. To reproduce this issue, load the sample_dataset.scm
and run the annotation for IGF1
as follows:
(annotate-genes (list "IGF1") "test-res" "[{\"function_name\": \"gene-pathway-annotation\", \"filters\": [{\"filter\": \"pathway\", \"value\": \"smpdb reactome\"},{\"filter\": \"include_prot\", \"value\": \"True\"}, {\"filter\": \"include_sm\", \"value\": \"False\"},{\"filter\": \"coding\", \"value\": \"True\"},{\"filter\": \"noncoding\", \"value\": \"True\"}, {\"filter\": \"biogrid\", \"value\": \"0\"}]}, {\"function_name\": \"gene-go-annotation\", \"filters\": [{\"filter\": \"namespace\", \"value\": \"biological_process cellular_component molecular_function\"}, {\"filter\": \"parents\", \"value\": \"0\"}, {\"filter\": \"protein\", \"value\": \"True\"}]}, {\"function_name\": \"include-rna\", \"filters\": [{\"filter\": \"coding\", \"value\": \"True\"},{\"filter\": \"noncoding\", \"value\": \"True\"},{\"filter\": \"protein\", \"value\": \"1\"}]},{\"function_name\": \"biogrid-interaction-annotation\", \"filters\": [{\"filter\": \"interaction\", \"value\": \"Proteins\"}]}]")
The above returns the following json:
"{\"nodes\":{},\"edges\":{}}"
The current dataset contains 2945 self-interacting genes. i.e. genes in the form of
(Evaluation
(Predicate "interacts_with")
(List (Gene "FOO") (Gene "FOO")))
This is a dataset issue, not a codebase issue; I'm reporting it here because I don't know where else to report it.
I am using the code below to manually clean this up.
(define (delete-self-interaction)
"
Many genes are marked as interacting with themselves.
Delete these, they screw up the topology of the searches.
"
(define selfie-q
(Get (List (Variable "$x") (Variable "$x"))))
(define selfie-set (cog-execute! selfie-q))
(define selfies (cog-outgoing-set selfie-set))
(cog-delete selfie-set)
(for-each
(lambda (gene) (cog-delete-recursive (List gene gene)))
selfies))
This is not an issue report; rather, its a performance diary, tracking performance from where #98 left off.
Instrumented code to be measured as follows:
git checkout master
git pull
git checkout -b perf
git pull https://github.com/MOZI-AI/annotation-scheme.git measure
git commit
autoreconf -vif
mkdir build; cd build
../configure
make
sudo make install
Then, start guile, ideally w/ hugetlb enabled:
$ HUGETLB_MORECORE=yes LD_PRELOAD=/usr/lib/libhugetlbfs.so.0 guile
Then start cogserver:
guile> (use-modules (opencog cogserver))
guile> (start-cogserver)
Then load your dataset, and run whatever job you wish to profile. Meanwhile, in another terminal:
$ rlwrap telnet localhost 17001
guile> (report)
This will print out the current profile of the currently-running code.
Alternately, if you don't have the cogserver, then just wait for your job to finish, and then run (report)
at the guile REPL prompt.
(Note to self: for me, do this: $cd work/src; (load "annotate-all.scm")
)
Follow up to discussions in issue #66 This comment: #66 (comment)_ I think everything will be cleaner and better if the unit tests live in thier own directory, and build-aux is managed by the build tools (i.e. remove build-aux) I'll try this shortly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.