mozi-ai / annotation-scheme Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 4.0 4.11 MB

Human Gene annotation service backend

License: GNU General Public License v3.0

Scheme 99.84% Dockerfile 0.01% Makefile 0.01% Shell 0.01% M4 0.14%

annotation-scheme's People

Contributors

Stargazers

Watchers

Forkers

tanksha habush linas leungmanhin

annotation-scheme's Issues

Fix the output interaction missing link in biogrid

After PR #43, the output interactions between genes that interact with the main gene are missing.

Fix bugs in pull req #125

Pull req #125 introduces a new bug, and ignores an existing bug... both should be fixed. See comments there for details.

Run issues

So, following the README, I loaded datasets/smpdb_chebi_wname.scm and then

(gene-go-annotation '("IGF1") "biological_process molecular_function cellular_component" 0 #:id "")

which gives

;;; <stdin>:18:0: warning: possibly wrong number of arguments to `gene-go-annotation'
ice-9/boot-9.scm:1655:16: In procedure raise-exception:
Invalid keyword: 0

So the leading single-quote seems wrong, single quotes are short-handle for (quote stuff)

Unit tests break on `string->list`

Breakage due to pull req #172 -- see discussion there for work-around. I get test-suite.log with contents:

test-name: annotate-genes
location: /home/ubuntu/src/annotation-scheme/tests/main-tests.scm:33
source:
+ (test-assert
+   "annotate-genes"
+   (string?
+     (annotate-genes (list "IGF1") "Dfaer" req)))
actual-value: #f
actual-error:
+ (wrong-type-arg
+   "string->list"
+   "Wrong type argument in position ~A (expecting ~A): ~S"
+   (1
+    "string"
+    ("data"
+     ("id" . "Uniprot:Q01094")
+     ("name" . "E2F1")
+     ("definition"
+      .
+      "https://www.uniprot.org/uniprot/Q01094")

guile heap blow-up on file write

Sorry to pester you with so many issues, but .. I can't keep track of them all without writing them down.
I ran the following:

(gene-pathway-annotation gene-list  "my-path-results"
      #:pathway "reactome smpdb"
      #:namespace "biological_process molecular_function cellular_component"
      #:parents 0)

It took 60 hours to compute (2.5 days) During the entire time, the guile heap was at 150MBytes -- I know because I printed the heap size after each gene. Then it went to write the results file: this took 5 minutes to write, and the guile heap exploded to 16GB during those five minutes.

The file write itself wrote 743843 results (this was the length of the "results" list) The results file xgene-pathway.scm was 3974636585 bytes long (that's 4GBytes in size).

I conclude that there's something pathological about the file-writes; and perhaps things would go better if you wrote incrementally to a pipe, and then piped that pipe to a file.

Error while trying to run annotation

I am trying to annotate BRCA1 genes and when I run the annotate-genes function, it fails with the following error:

In procedure cog-outgoing-set: Wrong type argument in position 1 (expecting opencog atom): #<Invalid handle>

The error happens in the run-query function. It looks like the issue is due the pubmed id caching and may it is referring to a delete SetLink (I could be wrong though)

To reproduce the issue, load the datasets and run annotate-genes function with the following arguments

(annotate-genes (list "TSPAN6") "agingSymbols" "[{\"function_name\": \"gene-pathway-annotation\", \"filters\": [{\"filter\": \"pathway\", \"value\": \"smpdb reactome\"},{\"filter\": \"include_prot\", \"value\": \"True\"}, {\"filter\": \"include_sm\", \"value\": \"False\"},{\"filter\": \"coding\", \"value\": \"True\"},{\"filter\": \"noncoding\", \"value\": \"True\"}, {\"filter\": \"biogrid\", \"value\": \"0\"}]}, {\"function_name\": \"gene-go-annotation\", \"filters\": [{\"filter\": \"namespace\", \"value\": \"biological_process cellular_component molecular_function\"}, {\"filter\": \"parents\", \"value\": \"0\"}, {\"filter\": \"protein\", \"value\": \"True\"}]}, {\"function_name\": \"include-rna\", \"filters\": [{\"filter\": \"coding\", \"value\": \"True\"},{\"filter\": \"noncoding\", \"value\": \"True\"},{\"filter\": \"protein\", \"value\": \"1\"}]},{\"function_name\": \"biogrid-interaction-annotation\", \"filters\": [{\"filter\": \"interaction\", \"value\": \"Proteins\"}]}]")

@linas can you check this please?

GUILE_SITE_GO_DIR is undefined

The macro GUILE_SITE_GO_DIR is undefined but referenced in configure.ac.
If the intent is to determine the target directory for compiled Guile modules the macro GUILE_SITE_DIR should be sufficient as it sets the variable GUILE_SITE_CCACHE.

Refactor generate-result function

The generate-result returns list of many things at the same time. I think it should be broken up into smaller functions that each return a single result. This can help us avoid having every huge lists in the result.

PATCH: Cache pubmed-id lookup.

The simple caching below gives a 1.5x speedup for biogrid annotation. I'm just cut-n-pasting the patch here, you have to adapt it to your own tastes.

--- a/annotation/functions.scm
+++ b/annotation/functions.scm
@@ -1149,7 +1149,19 @@ rv)
        rv))
        
 (define-public (find-pubmed-id gene-a gene-b)
- (let ([pub (cog-outgoing-set (cog-execute!
+       (cache-find-pubmed-id (Set gene-a gene-b)))
+       
+(define cache-find-pubmed-id
+       (make-afunc-cache do-find-pubmed-id))
+
+(define (do-find-pubmed-id gene-set)
+"
+       This is expecting a (SetLink (Gene \"a\") (Gene \"b\")
+"
+ (let* (
+       [gene-a (cog-outgoing-atom gene-set 0)]
+       [gene-b (cog-outgoing-atom gene-set 1)]
+       [pub (run-query
      (GetLink
        (VariableNode "$pub")
        (EvaluationLink
@@ -1165,9 +1177,9 @@ rv)
            )
          )
 
-   )))])
+   ))])
    (if (null? pub)
-     (set! pub (cog-outgoing-set (cog-execute!
+     (set! pub (run-query
      (GetLink
        (VariableNode "$pub")
        (EvaluationLink
@@ -1182,10 +1194,11 @@ rv)
              (VariableNode "$pub")
            )
          )
-   )))
+   ))
    ))
    pub
 ))
+
 (define-public (find-crna gene protein)
   (cog-execute! (BindLink
   (VariableList

Use #t and #f not string "True"

So if I do grep True annotation I find a dozen uses of (if (equal? prot "True") ...) and similar. Instead of using the string "True", just say #t and that way you can write (if prot ...) The idea being that in scheme, #t and #f denote true and false...

Test failure dueto these changes

@Habush Tests are failing with the error
(scm-error misc-error #f "~A ~S" ("vector-map: expected vector, got" (#<hash-table 557692c662c0 2/31> #<hash-table 557692c719a0 2/31> #<hash-table 557692c7a860 …> …)) #)

Following the changes at PR #143
Please check them out.

Add unit tests for annotation functions

The current code doesn't have any tests. This is a problem. Without tests, we cannot be confident the code does that it was intended to do. Adding unit-tests is a task that is long overdue.

We can start by adding tests to the three annotation functions at first and add more to the pattern matching functions.
this relates to #23.

build failures

I get a warning and a crash on build:

$ make
./env /usr/local/bin/guild compile -Wunbound-variable -Warity-mismatch -Wformat -o "annotation/util.go" "../annotation/util.scm"
;;; note: auto-compilation is enabled, set GUILE_AUTO_COMPILE=0
;;;       or pass the --no-auto-compile argument to disable.
;;; compiling /usr/local/bin/guild

====> Attention!
====> Obsolete! You do not need this module any longer.
====> Please use `(opencog exec)` and `cog-execute!` instead.

Backtrace:
In ice-9/boot-9.scm:
  1722:10 19 (with-exception-handler _ _ #:unwind? _ # _)
In system/base/compile.scm:
    59:11 18 (_)
   155:11 17 (_ #<closed: file 5632d31f97e0>)
    175:2 16 (read-and-compile #<input: annotation/util.scm 13> # _ # …)
   183:32 15 (compile-fold (#<procedure compile-tree-il (x e opts)>) …)
In ice-9/boot-9.scm:
   2792:4 14 (save-module-excursion _)
In language/scheme/compile-tree-il.scm:
    31:15 13 (_)
In ice-9/psyntax.scm:
  1241:36 12 (expand-top-sequence _ _ _ #f _ _ _)
  1233:19 11 (parse _ (("placeholder" placeholder)) ((top) #(# # …)) …)
   285:10 10 (parse _ (("placeholder" placeholder)) (()) _ c&e (# #) #)
In ice-9/eval.scm:
   293:34  9 (_ #<module (#{ g409}#) 5632d2ec08c0>)
In ice-9/boot-9.scm:
   3366:4  8 (define-module* _ #:filename _ #:pure _ #:version _ # _ …)
  2551:24  7 (call-with-deferred-observers _)
  3379:24  6 (_)
   222:29  5 (map1 (((opencog)) ((opencog query)) ((opencog exec)) …))
   222:29  4 (map1 (((opencog query)) ((opencog exec)) ((# #)) (#) …))
   222:29  3 (map1 (((opencog exec)) ((opencog bioscience)) ((#)) # …))
   222:17  2 (map1 (((opencog bioscience)) ((json)) ((ice-9 #)) (#) …))
   3286:6  1 (resolve-interface (opencog bioscience) #:select _ # _ # …)
  1655:16  0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1655:16: In procedure raise-exception:
no code for module (opencog bioscience)
Makefile:1082: recipe for target 'annotation/util.go' failed
make: *** [annotation/util.go] Error 1

I'll submit a fix for the warning shortly, and then go hunting for the missing module. ... the configure.ac should check for dependencies...

Avoid the use of ListLink with many outgoing nodes

@linas @Habush

The scheme result of the annotation-service is supposed to be used for Pattern miner and PLN works. As the bio-atomspace is too big to work with, we wanted to have a smaller version by annotating the ~800 genes from the mosses model. The problem with the annotation result is it has additional ListLink's ( which we used For example to return >1 hypergraphs on certain condition

check here

The ListLink causes error (when the scheme result is used as a dataset for the Pattern miner code here https://github.com/ngeiswei/reasoning-bio-as-xp)

a ListLink is considered not not have many outgoing nodes. What should we do to get rid of these ListLinks?

installation fails, installation script can't find installed bioscience module.

i get this trying to install:

checking for Guile site directory... /usr/share/guile/site/2.2
checking for Guile site-ccache directory using pkgconfig... /usr/lib/x86_64-linux-gnu/guile/2.2/site-ccache
checking for Guile extensions directory... /usr/lib/x86_64-linux-gnu/guile/2.2/extensions
checking if (json) is available... yes
checking if (opencog bioscience) is available... no
configure: error: The (opencog bioscience) module is needed. See https://github.com/opencog/agi-bio for details.

even though it's installed:

$ ls /usr/share/guile/site/2.2/opencog/
...
bioscience            logger.scm        persist.scm       test-runner.scm
...

i have (use-modules (opencog)) in ~/.guile but guile still has problems finding it:

$ guile
scheme@(guile-user)> (use-modules (opencog bioscience))
While compiling expression:
In procedure dynamic-link: file: "libbioscience-types", message: "file not found"
scheme@(guile-user)> (use-modules (opencog))
scheme@(guile-user)> (use-modules (opencog bioscience))
scheme@(guile-user)>

what am i doing wrong?

loading current datasets generates error

While loading the current datasets, I get an error: In procedure unquote: expression not valid outside of quasiquote its not clear which dataset provoked this. Investigating ...

Make tests executable

Currently, the tests directory contains examples, but they are not executable, so they cannot be used to automatically verify that the code still behaves as expected.

I suggest using srfi-64 to define executable tests. See here for an example: https://git.savannah.gnu.org/cgit/gwl.git/tree/tests/processes.scm

PATCH: memoize locate-node, etc. for 1.5x perf

The following patch speeds up my code by another factor of 1.5x for pathway searches. I'm posting here rather than as a pull req so that you can adjust it to your coding style. See issue #98 for performance measurements.

--- a/annotation/util.scm
+++ b/annotation/util.scm
@@ -94,7 +94,10 @@
 
 ;; Find node name and description
 
-(define-public (node-info node)
+(define-public node-info
+       (make-afunc-cache xnode-info))
+
+(define-public (xnode-info node)
 ; (format #t "duude node-info=~A\n" node)
   (if (cog-node? node)
     (list
@@ -201,8 +204,26 @@
   )
 )
 
+(define (run-query QUERY)
+"
+  Call (cog-execute! QUERY), return results, delete the SetLink.
+  This avoids a memory leak of SetLinks
+"
+   ; Run the query
+   (define set-link (cog-execute! QUERY))
+   ; Get the query results
+   (define results (cog-outgoing-set set-link))
+   ; Delete the SetLink
+   (cog-delete set-link)
+   ; Return the results.
+   results
+)
+
+(define-public find-name
+       (make-afunc-cache probe-find-name))
+
 ;;finds go name for parser function
-(define-public (find-name a)
+(define (probe-find-name a)
        (find-name-ctr #:enter? #t)
        (let ((rv (oldfind-name a)))
        (find-name-ctr #:enter? #f)
@@ -214,8 +235,8 @@
           [predicate (if (regexp-match? (string-match "GO:[0-9]+" (cog-name atom))) "GO_name" "has_name")]
         )
       (get-name
-       (cog-outgoing-set
-        (cog-execute! ; cog-execat!  ; cog-execute!
+       (run-query
+        ; cog-execute! ; cog-execat!  ; cog-execute!
          (GetLink
           (VariableNode "$name")
 
@@ -224,7 +245,7 @@
            (ListLink
             atom
             (VariableNode "$name")
-           )))))))))
+           ))))))))
 
 ;;Given an atom and list of namespaces finds the parents of that atom in the specified namespaces
@@ -324,17 +345,19 @@
     ))
 )
 
-(define-public (locate-node a)
+(define-public locate-node
+       (make-afunc-cache probe-locate-node))
+
+(define (probe-locate-node a)
    (locate-node-ctr #:enter? #t)
    (let ((rv (xlocate-node a)))
    (locate-node-ctr #:enter? #f)
    rv))
 
-
-(define-public xlocate-node
+(define xlocate-node
   (lambda(node)
 ; (format #t "duude in locate node=~A\n" node)
-      (let ([loc (cog-outgoing-set (cog-execute!
+      (let ([loc (run-query
         (BindLink
         (VariableNode "$go")
         (AndLink
@@ -354,12 +377,12 @@
             (VariableNode "$go")
           ))
           )
-      ))
+      )
       ])
 ; (format #t "duude in locate loc=~A\n" loc)
       (if (null? loc)
       (set! loc 
-      (cog-outgoing-set (cog-execute!
+      (run-query
         (BindLink
           (VariableNode "$loc")
           (EvaluationLink
@@ -373,7 +396,7 @@
                 node
                 (VariableNode "$loc")))
           )
-        )))
+        ))
       )
       loc
       )

make the name of (MoleculeNode 'Uniprot:xxxx') the symbol of it's expressing gene

this will fix the 'N/A' name given to uniprot proteins added when the pathway protein switch is invoked, make the reactome uniprot names consistent (currently some are symbols, some are phrases, some have indicators of post translational modifications, etc). it will also permit removal of 'has name' predicates from the knowledgebase.

test failure

OK, got build and install to work. Now make check fails:

$ make check
make  check-TESTS
make[1]: Entering directory '/home/ubuntu/src/annotation-scheme/build'
make[2]: Entering directory '/home/ubuntu/src/annotation-scheme/build'
FAIL: tests/parser-tests.scm
Makefile:779: recipe for target 'tests/util-tests.log' failed
make[2]: *** [tests/util-tests.log] Error 1

$ cat tests/util-tests.log
Backtrace:
           1 (primitive-load-path "tests/util-tests.scm")
           0 (primitive-load "tests/sample_dataset.scm")

ERROR: In procedure primitive-load:
In procedure open-file: No such file or directory: "tests/sample_dataset.scm"

PATCH: optimize generate-result

Pursuant to comment #98 (comment) and similar to issue #103 the speed of biogrid annotation can be more than doubled, with the code below. Viz, before, the patch, perf is this:

Time: 2945.8490 secs. calls: 75 avg: 39277986.8 usec/call for find-output-interactors
Time: 2067.2802 secs. calls: 46366 avg:  44586.1 usec/call for generate-result

after patching, its this:

Time: 1409.3709 secs. calls: 75 avg: 18791612.0 usec/call for find-output-interactors
Time: 330.18767 secs. calls: 45971 avg:   7182.5 usec/call for generate-result

The patch is this (i'm not making a pull req, because it requires additional cleanup for production code) Also, plenty more optimization opportunities are readily available in this code-block.

(define-public (xgenerate-result gene-a gene-b prot go)
      (let* (
            [output (find-pubmed-id gene-a gene-b)]
            [res (filter
               (lambda (ggg) (not (reported-genes ggg)))
               (list gene-a gene-b))]
            [pairs (reported-grid-pairs (Set gene-a gene-b))]

            [interaction (if (= 1 (string->number (cog-name prot)))
                (ListLink
                  (build-interaction gene-a gene-b output "interacts_with")
                  (build-interaction (find-protein-form gene-a) (find-protein-form gene-b) output "inferred_interaction"))
                (build-interaction gene-a gene-b output "interacts_with"))]
            [namespace (if (null? (cog-outgoing-set go)) '() (car (cog-outgoing-set go)))]
            [parent (if (null? (cog-outgoing-set go)) '() (cadr (cog-outgoing-set go)))]
          )

and the rest of the function is more-or-less same as before, except:

delete biogrid-pairs and biogrid-genes they are not needed
delete GeneNode its already an atom. So this gives:

          (if (null? interaction)
            (ListLink)
          )
          (match res
              ((a b)
                  (begin
                      (if (= 1 (string->number (cog-name prot)))
                        (let ([coding-prot-a (find-protein-form a)]
                              [coding-prot-b (find-protein-form b)])
                        (if (or (equal? coding-prot-a (ListLink)) (equal? coding-prot-b (ListLink)))
                          (ListLink)
                          (ListLink
                            interaction
                            (Evaluation (Predicate "expresses") (ListLink a coding-prot-a))
                            (node-info a)
                            (node-info coding-prot-a)
                            (locate-node coding-prot-a)
                            (EvaluationLink (PredicateNode "expresses") (ListLink b coding-prot-b))
                            (node-info b)
                            (node-info coding-prot-b)
                            (locate-node coding-prot-b))  ;;; <<< bug fix here
                        ))
                      (ListLink
                          interaction
                          (node-info a)
                          (locate-node  a)
                          (node-info b)
                          (locate-node  b)
                          (if (not (null? namespace))
                          (ListLink
                            (ConceptNode "gene-go-annotation")
                            ; bug fix here!!! 
                            (find-go-term a (string-split (cog-name namespace) #\ ) (string->number (cog-name parent)))
                            (find-go-term b (string-split (cog-name namespace) #\ ) (string->number (cog-name parent)))
                            (ListLink (ConceptNode "biogrid-interaction-annotation"))
                            )
                            '() )
                      )
                  ))
              )
              ((a)
                  (begin
                      (if (= 1 (string->number (cog-name prot)))
                        (let ([coding-prot (find-protein-form a)])
                        (if (equal? coding-prot (ListLink))
                          (ListLink)
                          (ListLink
                            interaction
                            (EvaluationLink (PredicateNode "expresses") (ListLink a coding-prot))
                            (node-info a)
                            (node-info coding-prot)
                            (locate-node coding-prot))
                        ))
                      (ListLink
                          interaction
                          (node-info a)
                          (locate-node  a)
                          (if (not (null? namespace))
                          (ListLink (ConceptNode "gene-go-annotation")
                             (find-go-term a
                             (string-split (cog-name namespace) #\ )
                             (string->number (cog-name parent)))
                          (ListLink (ConceptNode "biogrid-interaction-annotation"))
                          )
                          '()
                          ))
                  ))
              )
              (()
                  (if pairs
                    (ListLink)
                    (ListLink
                          interaction
                      )
                  )
              )
          )
      )
)

Note that the above also contains three bug fixes: two to find-go-term and one to (locate-node coding-prot-b)) ;;; <<< bug fix here

atomese->json and json->atomese

@tanksha and @Habush I have a generic idea I want to explore. How many of the biome datasets are available in json format?

I'm thinking that it might be worth creating a generic json->atomese importer, and a generic exporter. However, this would work only if

the input json dataset formats are not badly/insanely designed (i.e. the resulting imported atomese would have to look reasonable enough to be useable)
There are not too many conversions needed to obtain something that can be easily reasoned on or pattern-mined. So, for example, the import process might be (a) generic json->atomese (b) run a BindLink to convert this generic-atomese into something friendlier for mining/pln/moses (c) actually run mining/pln/moses. (so maybe @ngeiswei you'd have an idea about this?)

What is not clear is whether step (b) above is easier than just writing a custom json importer. If it's not easier, then this generic-import idea seems like a not-very-good idea. But I can't really tell...

(follow up to issue #164)

New test failures since recent merges

As of commit 679ae5b there are a bunch of new test failures:

tests/parser-tests.scm has one new test failure (namely edge-count)
tests/main-tests.scm cannot be run because the file opencog_deps no longer exists
when the above problem is fixed, three new test failures are revealed: parse-request, annotate-genes, and protein-goterm.

There were no test failures at commit 7a79385.

License missing

I would like to contribute to this project, but I cannot seem to find any license. Could you please clarify under what license (if any) this project is released?

Crash in gene-pathway-annotation when namespace is set.

Loading the current datasets, and running this:

(gene-pathway-annotation gene-list
      "my-path-results"
      #:pathway "reactome smpdb"
      #:namespace "biological_process molecular_function cellular_component"
      #:parents 0))

results in a crash:

Backtrace:
In ice-9/boot-9.scm:
   260:13 19 (for-each #<procedure 7f22f81a39b0 at annotation/gene-pathway.…> …)
   260:13 18 (for-each #<procedure 7f22f81a3000 at annotation/gene-pathway.…> …)
In annotation/gene-pathway.scm:\n    44:10 17 (_ _)
   109:24 16 (reactome \"NDUFAF7\" \"True\" \"True\" _ # 1)
In ice-9/boot-9.scm:
   222:17 15 (map1 (#))
In annotation/gene-pathway.scm:
   115:32 14 (_ _)
In annotation/functions.scm:
   301:22 13 (find-pathway-genes _ _)
In unknown file:
          12 (opencog-extension cog-execute! (#))
In ice-9/boot-9.scm:
  1722:10 11 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
          10 (apply-smob/0 #<thunk 7f22ea068b60>)
In ice-9/boot-9.scm:
  1722:10  9 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
           8 (apply-smob/0 #<thunk 7f22ea068b20>)
In annotation/functions.scm:
  339:124  7 (add-pathway-genes _ _ _)
    125:2  6 (find-go-term \"ECSIT\" (\"biological_process\" \"molecular_func…\" …) …)
     71:6  5 (find-memberln \"ECSIT\" (\"biological_process\" \"molecular_func…\" …))
In srfi/srfi-1.scm:
    634:9  4 (for-each #<procedure 7f22ea06a020 at annotation/functions.scm…> …)
In annotation/functions.scm:
    76:14  3 (_ \"biological_process\")
In unknown file:
           2 (cog-new-link 81 \"ECSIT\" (VariableNode \"$a\")
)
In ice-9/boot-9.scm:
  1655:16  1 (raise-exception _ #:continuable? _)
In unknown file:
           0 (apply-smob/1 #<exception-handler 7f22ea068b00> #<&compound-exc…>)

ERROR: In procedure apply-smob/1:
In procedure cog-new-link: Wrong type argument in position 2 (expecting opencog atom): \"ECSIT\"
ABORT: wrong-type-arg
 (/home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:1066)
Function args:
((BindLink
   (VariableList
      (TypedVariableLink
         (VariableNode \"$p\")
         (TypeNode \"MoleculeNode\")
      )
      (TypedVariableLink
         (VariableNode \"$g\")
         (TypeNode \"GeneNode\")
      )
   )
   (AndLink
      (MemberLink
         (VariableNode \"$p\")
         (ConceptNode \"R-HSA-6799198\")
      )      (EvaluationLink
         (PredicateNode \"expresses\")
         (ListLink
            (VariableNode \"$g\")
            (VariableNode \"$p\")
         )
      )
   )
   (ExecutionOutputLink
      (GroundedSchemaNode \"scm: add-pathway-genes\")
      (ListLink
         (ConceptNode \"R-HSA-6799198\")
         (VariableNode \"$g\")
         (ListLink
            (ConceptNode \"biological_process molecular_function cellular_component\")
            (NumberNode \"0\")
         )
      )
   )
)
)")'.

find-similar-gene seems to be incorrect?

I suspect that find-similar-gene is performing it's tests in an exactly backwards manner... for example: (find-similar-gene "SH") returns the empty list, even though there are several genes that have names starting with SH. By contrast, (find-similar-gene "SHISA5asdfasdf") returns a list of one gene: ((GeneNode "SHISA5")) ... surely, this is not what you wanted? I'm thinking the search pattern and the match are exactly reversed ...

If I reverse them I get a reasonable result: (reverse-fsg "SH") gives ((GeneNode "SHC1") (GeneNode "SHISA5"))

find-pathway-name should be written out

This is a performance optimization: find-pathway-name performs a very simple search, ~~using a GetLink. it would almost surely be much faster if it was written out, long-hand.~~ Update: this previous statement might be completely false. I have not measured. The proposal below and in #97 might be a bad idea. I will try to measure, and get a better feel for what is going on. In the mean time, I'm leaving this issue open.

Why might a hand-written version be faster? Because of three things:

Creating the GetLink incurs overhead, of .. creating it and inserting it into the atomspace.
The GetLink constructor performs a pattern compilation, which takes up some non-trivial of CPU, even when the pattern is very easy.
Running the GetLink places the search results into a SetLink. Adding that SetLink to the AtomSpace incurs CPU overhead.
All of the above could be avoided by just writing out the fairly simple vertex-edge graph chase. I think that the base opencog library already has utils for this kind of stuff.

Flipside: doing it "by hand", in scheme, requires multiple entries in and out of the guile interpreter, each of which has significant overhead, that might outweigh the losses mentioned above.

To add injury to insult: the SetLink is never removed, so it now clutters up the atomspace forever, slowing down future searches. Also, the GetLink is never removed, which also clutters up the atomspace. Both of these can be avoided by using a temp atomspace (I can provide code for that, it's in the learning repo, but maybe should be moved to the main repo). But, for such simple searches, its best to just avoid using GetLink, and reserve that for complicated searches.

is-cellular-component? appears to be buggy?

Reading through the code for is-cellular-component? suggests its inefficient and probably buggy.

Buggy: it performs a loop, and sets a response for every acceptable item in the loop. But only the last response is ever kept! All the other loop iterations are discarded! Do you really want to discard all answers, keeping only the last one?

Inefficient: the loop could be run in reverse, and terminated upon the first acceptable answer. This avoids running it for longer than needed.

Add Multithreading support

In the current version, all the annotation functions are executed sequentially. However, since there is a little dependency across the functions, it will be great to add multithreading support to improve performance.

All three annotation should run in their separate threads. But one catch here is, whether the parsing should be done in parallel or after all the annotations are done. If it is done in parallel along with the annotations, then some data variables need to be synchronized.

Incorrect biogrid result

In Biogrid annotation,

(biogrid-interaction-annotation (list "IGF1") "DD" #:interaction "Proteins" #:namespace "biological_process")

The result don't include a GO cross annotation (for genes from Biogrid). it works when the interaction type is of "Genes" but not when "Proteins".

Create smaller datasets for unit-tests

The current sample_dataset.scm is too big and takes a bit of time to load for unit-tests that test a specific functionality. So to reduce the test run times, it is better if each test dataset is tailored for that unit-test.

Dataset issues - non-symmetric intractions

This is another dataset issue bug report. Gene interactions should be symmetric. But not all of them are marked as such. The current dataset has 537839 interacting gene-pairs. Of these, approx 193653 do not have a symmetric partner. There are two possible fixes for this: (1) use a SetLink not a ListLink (2) use a script to generate the missing ones. I am using the following script to clean this up:

(define (symmetrize-gene-interactions)
"
  The gene interactions use the asymmetric ListLink to denote
  gene pairs. But gene interactions are symmetrix, so they should
  have used the SetLink. Oh well. At this time, it is convenient
  to use the ListLink during pattern searches; but in this case,
  the interactions should be symmetrized. That's what this does.
"
   (define interact-q
      (Bind
         (VariableList
            (TypedVariable (Variable "g1") (Type 'GeneNode))
            (TypedVariable (Variable "g2") (Type 'GeneNode)))
         (Present
            (Evaluation (Predicate "interacts_with")
               (List (Variable "g1") (Variable "g2"))))
         (Evaluation (Predicate "interacts_with")
            (List (Variable "g2") (Variable "g1")))))

   (define sym-set (cog-execute! interact-q))
   (format #t "Found ~A gene interactions\n"
      (length (cog-outgoing-set sym-set)))
   (cog-delete sym-set)
)

Unit tests broken.

I now get

PASS: tests/parser-tests.scm
PASS: tests/util-tests.scm
PASS: tests/pathway-tests.scm
Makefile:789: recipe for target 'tests/main-tests.log' failed
make[2]: *** [tests/main-tests.log] Error 1
make[2]: Leaving directory '/home/ubuntu/src/annotation-scheme/build'
Makefile:768: recipe for target 'check-TESTS' failed
make[1]: *** [check-TESTS] Error 2
make[1]: Leaving directory '/home/ubuntu/src/annotation-scheme/build'
Makefile:960: recipe for target 'check-am' failed
make: *** [check-am] Error 2
$ cat tests/main-tests.log
Backtrace:
           1 (primitive-load-path "tests/main-tests.scm")
           0 (primitive-load "tests/sample_dataset.scm")

ERROR: In procedure primitive-load:
In procedure open-file: No such file or directory: "tests/sample_dataset.scm"

These were working a week ago ...

Use modules

Currently, the files in the "fuctions" directory are just a collection of procedure definitions. Actually using them requires loading them from that file. The Guile way is to have files define modules, so that they can be used with (use-modules (my module)) as long as they are on the GUILE_LOAD_PATH.

Using modules would also unlock compilation, which provides a significant performance boost.

Data-file size and loading enhancements

File loading performance could be increased by writing, at the start of the file

(define loc (PredicateNode "has_location"))

and then using that: (Evaluation loc (List ...))

Also doing this for each gene, pathway, protein would also speed things a little bit:

(define muniA0A075B6P5 (Molecule "Uniprot:A0A075B6P5"))
(define rhsa166663 (Concept "R-HSA-166663"))

By using such defines, you save a tiny amount of time to drop into the C++ code and atomspace, to perform the lookup. Also, the files get slightly smaller.

Originally posted by @linas in #85 (comment)

PATCH: Fixes for pathway-hierarchy/check-pathway

Based on the results described in this comment:

#98 (comment)

it appears that the two functions pathway-hierarchy and check-pathway are responsible for the explosion in the guile heap usage, as well as being giant CPU-time losers. To further ttrace this and fix this, I am using the following alternate implementation:

(define-public (run-query QUERY)
"
  Call (cog-execute! QUERY), return results, delete the SetLink.
  This avoids a memory leak of SetLinks
"
   ; Run the query
   (define set-link (cog-execute! QUERY))
   ; Get the query results
   (define results (cog-outgoing-set set-link))
   ; Delete the SetLink
   (cog-delete set-link)
   ; Return the results.
   results
)

(define-public (pathway-hierarchy pw lst)
   (format #t "Enter pathway-hierarchy sizeof lst: ~A\n" (length lst))
   (let* (
         [parents (run-query
            (Get (Variable "$parentpw")
               (Inheritance pw (Variable "$parentpw"))))]
         [junk (format #t "pathway-hierarchy found parents=~A\n" (length parents))]
         [res-parent (map
               (lambda (parent-pw) (check-pathway pw parent-pw lst))
               parents)]
         [childs (run-query
            (Get (Variable "$childpw")
               (Inheritance (Variable "$childpw") pw)))]
         [jank (format #t "pathway-hierarchy found childs=~A\n" (length childs))]
         [res-child (map
               (lambda (child-pw) (check-pathway child-pw pw lst))
               childs)]
      )
      (append res-parent res-child)
))

(define-public (check-pathway pw parent-pw lst)
   (if (and (member parent-pw lst) (member pw lst))
      (Inheritance pw parent-pw)
   ))

Commentary:

Pretty much all of the code in this repo should use run-query as defined above, instead of the existing (cog-outgoing-set (cog-execute! style, which leaks memory. (Technically, it doesn't actually "leak", its just creating database records which aren't ever needed again.)
I changed check-pathway to no wrap the InheritanceLink in a ListLink. This seemed like the right thing to do.

There are two possible performance enhancements to the above code:

Perform the (member pw lst) check before starting the search, instead of after it.
Use an atom hash-set instead of member to determine membership. The hash-set runs in O(1) time, vs. O(n) for member.

As before, I don't think I'll do a pull req for this; because (a) I'm still experimenting (b) I have no way to test. (c) it might be better for you to work out the details as needed.

link pathway proteins to biogrid genes

if the pathway protein switch is on and biogrid is selected, pathway proteins should be linked to a protein expressed by a biogrid gene

optimize generate-interactors

Based on the instrumented code results described here: #98 (comment) it seems that a significant amount of CPU time is spent in generate-interactors. Reading through the code, it appears that biogrid-pairs-pathway is a set of pairs of strings. If a pair is NOT in the set, then it is "output". To determine set-membership, find is used to perform an exhaustive search. This can be optimized in several ways: one is to use scheme/guile hash-sets, which provides O(1) results instead of O(n) results. However, it might be even easier to just do this:

(cog-link 'EvaluationLink (PredicateNode "interacts_with") (ListLink var1 var2))

which will return that link, if it exists already (and so you do not need to record it again) or it will return the empty list. This is an O(1) search of the atomspace (because the atomspace is using hash tables).

I'm assuming that this idea works, because I am assuming that initially there are zero instances of "interacts_with". Perhaps this is not the case?

Use delete-dup-atoms instead of delete-duplicates

I'm reading annotation/functions.scm and I notice the use of delete-duplicates. If you are trying to doelete duplicate atoms, instead of other things, then using delete-dup-atoms will be usually be much faster, at least if there are more than 10 atoms in the list.

General optimizations...

~~So, in rna.scm I see this: (list (cog-outgoing-set ... This is pointless, because the outgoing set is already a list Wrapping it in a list again does nothing at all except waste cpu time.~~

That block of code would also run more efficiently if the if-tests were outside of the map instead of inside of it. That way, the if-test wold be performed only once, instead of hundreds (or millions) of times.

Comment no longer applies, the code is no longer structured like this.

Avoid the use of `set!`

It is better to avoid mutation with set! whenever possible. In most cases where set! is currently used one could instead use simple let bindings or use higher-order functions such as fold (defined in (srfi srfi-1)).

find-go-term is used inconsistently

Sometimes the first argument to find-go-term is passed as a string, and sometimes as an atom. It cannot possibly be both ... well, maybe it can be, but that would be confusing and inefficient..

For example, here it is a string:

annotation-scheme/annotation/functions.scm

Lines 714 to 722 in 7fc4538

 (node-info (GeneNode a)) 

 (locate-node (GeneNode a)) 

 (node-info (GeneNode b)) 

 (locate-node (GeneNode b)) 

 (if (not (null? namespace)) 

 (ListLink 

 (ConceptNode "gene-go-annotation") 

 (find-go-term a (string-split (cog-name namespace) #\ ) (string->number (cog-name parent))) 

 (find-go-term b (string-split (cog-name namespace) #\ ) (string->number (cog-name parent)))

because (GeneNode a) can only work if a is a string. Similarly here:

annotation-scheme/annotation/functions.scm

Line 337 in 7fc4538

 (ListLink (ConceptNode "gene-go-annotation") (find-go-term (cog-name gene) (string-split (cog-name namespace) #\ ) (string->number (cog-name parent))) 

because cog-name returns a string.

However, here:

annotation-scheme/annotation/gene-go.scm

Line 39 in 7fc4538

(find-go-term (GeneNode gene) (string-split namespace #\ ) parents)

we can clearly see that the first argument is an atom.

So the code is either buggy, or contains obsolete, non-functional code-blocks ...

Performance Status

OK, this is not a bug report, but rather a status update (and a request that someone carry out the work mentioned below)

I've queued up 9 pull reqs and 11 open issues .. please deal with these.
After all of the above, I finally reproduced the Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS error when running gene-pathway-annotation. I am currently retrying with a version of bdwgc with ./configure --enable-large-config set to see if that helps.
I'm fairly certain that the Too many heap sections is due to severe memory fragmentation. In particular, the error showed up while guile was using "only" 6GB RAM, of which probably 98% of that is in the atomspace, and not guile itself. ~~(Right now, on a re-run, the process is using 4.1GB RAM, and guile itself has only 26MBytes in the heap, which seems modest enough to me.)~~
~~So RAM consumption per-se does not seem to be an issue, neither by the atomspace or guile.~~ Correction: just a few minutes ago, guile went into GC overdrive. Its now trying to GC every10 seconds or so, and it's using almost all CPU cores to do the GC. This happened after about 5 hours runtime, when everything was steady, stable. top/ps aux shows RAM usage suddenly exploded to 10GB RSS and 19GB virt. Currently (gc-stats) shows 4GB in the guile heap, and 0.5GB of the guile heap free. ~~No clue what/why/where in the code would cause such a monstrous demand on the heap.~~ Update: this is now fixed in #105. I still don't understand the root-cause of the explosion, but the fix does fix it.
~~I'm fairly certain that the excessive usage of deeply nested GroundedSchemaNodes is the root cause of the memory fragmentation. (Update: now I'm not so sure)~~ I am now certain that this is not the issue, at all.

~~I'm going to ponder if there is some really clever way to avoid this. However,~~ It might be worthwhile to pre-emptively redesign the code to be flatter, and less recursive. That means replacing the BindLink/GroundedSchema combination with GetLink and srfi-1 map. Just use GetLink to get lists of things, and use map to apply whatever processing is needed to the results. ~~I'm guessing this will solve this problem, and maybe even improve runtime. Not sure.~~ This is an interesting alternative design to explore, and it might (or might not) make things smaller, faster. I'm not sure.

Here's one problem with GroundedSchema: every time that it is called, the guile subsystem has to be re-entered again, and there is some non-trivial number of microseconds (maybe 50?) needed to do this. It also creates a half-dozen C stack frames, and so you risk over-flowing the C stack, if not careful. Many years ago, the atomspace code was tuned to minimize both of these effects, and so maybe they are not as severe as I remember them being, but still -- its a potential trouble area. Or not. Maybe all this is OK as is.

Why fork guile-json?

I see that in the Dockerfile a forked variant of guile-json is installed. Why is the fork necessary? I see that in the fork a bunch of fields are made mutable. Mutation is pretty rare in idiomatic Scheme code, so I wonder why this is needed.

Since the fork is pretty young it would be great if we could undo it, perhaps by avoiding mutation in annotation-scheme.

What do you think?

dataset issues: `(MoleculeNode "ChEBI:nan")`

The dataset includes this: (MoleculeNode "ChEBI:nan") which is clearly wrong; we expect an integer here, e.g. (MoleculeNode "ChEBI:15846") This atom appears in 70766 different links; many of these are MemberLinks to pathways.

Implement more link types.

Search performance could be improved by creating and using more biospecifc link and node types. For example:

PathwayNode "R-HSA-166663"  ;; -- holds taged pathways

This would allow quicker discovery of all pathways.

(LocationLink (Molecule "Uniprot:A0A075B6P5") (Concept "extracellular region"))

The above is instead of

(Evaluation (Predicate "has_location")
 (List (Molecule "Uniprot:A0A075B6P5") (Concept "extracellular region")))

Maybe even

LocationNode "extracellular region"

for specific named spatial locations

Even simple nodes for different tags would help:

(SMPNode "SMP0000055") ;; instead of (ConceptNode "SMP0000055")
(UniNode "Uniprot:P80404") ;; instead of (MoleculeNode "Uniprot:P80404")
(HSANode "R-HSA-114608") ;; instead of (ConceptNode "R-HSA-114608")
(BioGridNode "Bio:121885"); instead of (ConceptNode "Bio:121885")

The above would solve the need for regex searches that are currently used to find these things. But also one could have a RegexNode as described in opencog/atomspace#2474

As a general rule, any time one has a frequently-used EvaluationLink of the form

(EvaluationLink (Predicate "foo") (ListLink ...stuff..))

it would probably be an overall win to define a custom (FooLink ... stuff...) instead. Mostly this makes the atomspace a little smaller (fewer atoms) and pattern searches a little faster (less to explore). Whether or not this is worth it, I can't say. Maybe it would add extra complexity to other processing stages...

annotate-genes returning empty list of nodes and edges.

After the latest improvements to the parser by @rekado, the annotate-genes function is returning empty list of nodes and edges. To reproduce this issue, load the sample_dataset.scm and run the annotation for IGF1 as follows:

(annotate-genes (list "IGF1") "test-res"  "[{\"function_name\": \"gene-pathway-annotation\", \"filters\": [{\"filter\": \"pathway\", \"value\": \"smpdb reactome\"},{\"filter\": \"include_prot\", \"value\": \"True\"}, {\"filter\": \"include_sm\", \"value\": \"False\"},{\"filter\": \"coding\", \"value\": \"True\"},{\"filter\": \"noncoding\", \"value\": \"True\"}, {\"filter\": \"biogrid\", \"value\": \"0\"}]}, {\"function_name\": \"gene-go-annotation\", \"filters\": [{\"filter\": \"namespace\", \"value\": \"biological_process cellular_component molecular_function\"}, {\"filter\": \"parents\", \"value\": \"0\"}, {\"filter\": \"protein\", \"value\": \"True\"}]}, {\"function_name\": \"include-rna\", \"filters\": [{\"filter\": \"coding\", \"value\": \"True\"},{\"filter\": \"noncoding\", \"value\": \"True\"},{\"filter\": \"protein\", \"value\": \"1\"}]},{\"function_name\": \"biogrid-interaction-annotation\", \"filters\": [{\"filter\": \"interaction\", \"value\": \"Proteins\"}]}]")

The above returns the following json:

"{\"nodes\":{},\"edges\":{}}"

Dataset issues - self-interaction

The current dataset contains 2945 self-interacting genes. i.e. genes in the form of

(Evaluation
    (Predicate "interacts_with")
    (List (Gene "FOO") (Gene "FOO")))

This is a dataset issue, not a codebase issue; I'm reporting it here because I don't know where else to report it.

I am using the code below to manually clean this up.

(define (delete-self-interaction)
"
  Many genes are marked as interacting with themselves.
  Delete these, they screw up the topology of the searches.
"
   (define selfie-q
      (Get (List (Variable "$x") (Variable "$x"))))
   (define selfie-set (cog-execute! selfie-q))
   (define selfies (cog-outgoing-set selfie-set))
   (cog-delete selfie-set)
   (for-each
      (lambda (gene) (cog-delete-recursive (List gene gene)))
      selfies))

Performance diary, continued.

This is not an issue report; rather, its a performance diary, tracking performance from where #98 left off.

Instrumented code to be measured as follows:

git checkout master
git pull
git checkout -b perf
git pull https://github.com/MOZI-AI/annotation-scheme.git measure
git commit
autoreconf -vif
mkdir build; cd build
../configure
make
sudo make install

Then, start guile, ideally w/ hugetlb enabled:

$ HUGETLB_MORECORE=yes LD_PRELOAD=/usr/lib/libhugetlbfs.so.0 guile

Then start cogserver:

guile> (use-modules (opencog cogserver))
guile> (start-cogserver)

Then load your dataset, and run whatever job you wish to profile. Meanwhile, in another terminal:

$ rlwrap telnet localhost 17001
guile> (report)

This will print out the current profile of the currently-running code.

Alternately, if you don't have the cogserver, then just wait for your job to finish, and then run (report) at the guile REPL prompt.

(Note to self: for me, do this: $cd work/src; (load "annotate-all.scm"))

Remove build-aux directory

Follow up to discussions in issue #66 This comment: #66 (comment)_ I think everything will be cleaner and better if the unit tests live in thier own directory, and build-aux is managed by the build tools (i.e. remove build-aux) I'll try this shortly.

	(node-info (GeneNode a))
	(locate-node (GeneNode a))
	(node-info (GeneNode b))
	(locate-node (GeneNode b))
	(if (not (null? namespace))
	(ListLink
	(ConceptNode "gene-go-annotation")
	(find-go-term a (string-split (cog-name namespace) #\ ) (string->number (cog-name parent)))
	(find-go-term b (string-split (cog-name namespace) #\ ) (string->number (cog-name parent)))