Hi, First of all thanks for developing and maintaining Gapseq! From what I've

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Exploit gapfilled model for metabolic comparison about gapseq HOT 5 OPEN

mgabriell1 commented on July 22, 2024

Exploit gapfilled model for metabolic comparison

from gapseq.

Comments (5)

jotech commented on July 22, 2024

Hi @mgabriell1

Thank you for your interest in gapseq!

What you are proposing is indeed an interesting application! My first guess would be to check out the reaction attributes table of the final model after gap-filling. Using that table, you can trace the origin of each reaction (added because of sequence homology, gap filling, etc.), which can then be linked back to the pathway level.
More details can be found here: https://gapseq.readthedocs.io/en/latest/tutorials/traceability.html

Let me know if this can work for you.

from gapseq.

mgabriell1 commented on July 22, 2024

Hi,
Thanks for the quick reply!

That table seems to contain all the info that I need, but I am not completely sure on how to link it to meta metaCYC pathways. Using the ecoli.RDS test model I see in the reaction attribute table there are 3029 reactions which are distributed in 560 pathways, while in the all-Pathways.tbl file there are 2900 pathways. So in the model a lot are excluded (makes sense), but I would also like to know which pathways are not complete.

Looking at the source file I found the file meta_rea_pwy-gapseq.tbl which seems to contain the mapping between pathways and reactions which could be used to estimate the pathway completeness, but I'm having troubles linking that file with the reactions attribute table.
For example, \|PWY-7184\| (a random pwy) has a completeness of 88% in all-Pathways.tbl and the rxn DCDPKIN-RXN is among the ones found. However, reaction attribute table this rxn ID is not present in the column rxn, but rather in the column biocyc ID. So if I would use the column rxn I would not detect this. Conversely, column biocyc ID is often empty and so I cannot use that column as well. Other rxn ids are present in the column gapseq in meta_rea_pwy-gapseq.tbl, but often these are empty and so again not useful to map the two tables.
Given that the information seems to be linked across different columns, I'm not sure if I'm using the correct information. Could you confirm that this seems to be the right track?
If so, I guess I can check the different columns and then check whether any of them has a match.

Another question that I have is: gapseq estimates pwy completeness just by dividing the number of genes present over the total number of genes? I'm asking this to understand whether I wonder to consider "parallel" pwy routes or that has already be taken into account in metaCYC.

I hope this was clear enough.
Thanks again for the support!

from gapseq.

Waschina commented on July 22, 2024

Hi @mgabriell1

I would not use the meta_rea_pwy-gapseq.tbl. Instead, the problem could be addressed for example like this:

library(sybil)
library(data.table)

# read model data
mod_filled <- readRDS("yogurt/ldel.RDS")
mod_draft  <- readRDS("yogurt/ldel-draft.RDS")
gs_findR   <- fread("yogurt/ldel-all-Reactions.tbl")
gs_findP   <- fread("yogurt/ldel-all-Pathways.tbl")

# read and prepare pathway DB
pwyDB <- rbind(fread("~/Software/gapseq/dat/meta_pwy.tbl"),
               fread("~/Software/gapseq/dat/custom_pwy.tbl"))
pwyDB <- pwyDB[!duplicated(id)]
pwyDB <- pwyDB[, .(id, name, spont, reaId)]
pwyDB <- pwyDB[id %in% gs_findP$ID]
pwyDB[, reaNr := length(unlist(strsplit(reaId, ","))), by = id]
pwyDB[, spontNr := length(unlist(strsplit(spont, ","))), by = id]

# identify which reactions were added during gapfilling
rxnGF <- mod_filled@react_id[!(mod_filled@react_id %in% mod_draft@react_id)]
rxnGF <- rxnGF[!grepl("^EX_",rxnGF)] # exclude exchange reactions
rxnGF <- gsub("_c0$","",rxnGF)

# get BioCyc-IDs of added reactions
newBC <- lapply(rxnGF, function(x) gs_findR[grepl(x, dbhit), rxn])
newBC <- unique(unlist(newBC))

# get Pathways, in which die newBC participate
newBC_pwys <- lapply(newBC, function(x) gs_findR[rxn == x, unique(pathway)])
names(newBC_pwys) <- newBC

# add new reaction to pathway table
gs_findP$newReactionsFound <- ""
for(rxni in newBC) {
  for(pwyi in newBC_pwys[[rxni]]) {
    gs_findP[ID == pwyi, newReactionsFound := paste0(newReactionsFound,rxni, sep = " ")]
  }
}
gs_findP[, newReactionsFound := gsub("^ | $","",newReactionsFound)] # remove trailing spaces

# merge and recalc completeness
gs_findP <- merge(gs_findP, pwyDB, by.x = "ID", by.y = "id")
gs_findP[,Nold := length(unlist(strsplit(ReactionsFound, " "))), by = "ID"]
gs_findP[,Nnew := length(unlist(strsplit(newReactionsFound, " "))), by = "ID"]

gs_findP[,C_old := (Nold + VagueReactions)/(reaNr - spontNr)*100]
gs_findP[,C_new := (Nold + Nnew + VagueReactions)/(reaNr - spontNr)*100]

I hope the code comments make it clear what happens at each step. The examples use gapseq data from here: https://github.com/Waschina/gapseq.tutorial.data/tree/master/yogurt

I noticed that there seem to be some inconsistencies in the way the pathway completeness is calculated in the ...-all-Pathways.tbl. I'll investigate this.

from gapseq.

Waschina commented on July 22, 2024

Small addition: With the example above, I noticed that there is a small inconsistency in how the completeness reported in the ...-all-pathways.tbl is calculated. See issue #216 . We are working on a fix

from gapseq.

mgabriell1 commented on July 22, 2024

Hi Silvio,
Thanks a lot for the code!
I've tried it with the tutorial file and one of my genomes (attached testData.zip) and I've noticed that a few pathways have a new completeness estimation above 100%. I found a couple of reasons while that was the case:

The spontaneous reactions were included both in the number of newly added reactions by gapfilling and while "discounting" them for pathway completeness estimation (as you wrote here: #216). My solution was to remove them from the count of newly added reactions during completeness estimation
The other issue was that in a few cases the column newReactionsFound included some of the ones present in ReactionsFound. I haven't understood completely why (maybe some mismatch caused by trailing whitespaces?), but I've implemented a loop to remove these duplicated reactions
Finally, I've noticed that some reactions attributed in mod_filled@react_attr and thus in ReactionsFound were not included in pwyDB (e..g., "ACETOACETATE-DECARBOXYLASE-RXN" is found in "|PWY-5451|" even though this is not found here (https://metacyc.org/pathway?orgid=META&id=PWY-5451) or in pwyDB). I've opted to remove those from ReactionsFound, but this might indicate that there are other issues in other parts of the pipeline

With these edits I've managed to reduce the number of pathways more than 100% complete, but in few cases this still occurred. All these pathways present vague reactions and their number (derived from -all-Pathways.tbl) is higher than the number of reactions in pwyDB (e.g., "|PWY-7695|" is indicated with 12 vague reactions, but only 7 reactions in pwyDB). Could it be an issue with the reactions database?

Here is the edited code (not the prettiest, but it seems to do the job):



###
library(sybil)
library(data.table)

# read model data
# mod_filled <- readRDS("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel.RDS")
# mod_draft  <- readRDS("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel-draft.RDS")
# gs_findR   <- fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel-all-Reactions.tbl")
# gs_findP   <- fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel-all-Pathways.tbl")

mod_filled <- readRDS("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_models/Legio00005_manualCheck-contigs.RDS")
mod_draft  <- readRDS("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_models/model_drafts/Legio00005_manualCheck-contigs-draft.RDS")
gs_findR   <- fread("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_reactions/Legio00005_manualCheck-contigs-all-Reactions.tbl.gz")
gs_findP   <- fread("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_pathways/Legio00005_manualCheck-contigs-all-Pathways.tbl")

# read and prepare pathway DB
pwyDB <- rbind(fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/dat/meta_pwy.tbl"),
               fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/dat/custom_pwy.tbl"))
pwyDB <- pwyDB[!duplicated(id)]
pwyDB <- pwyDB[, .(id, name, spont, reaId)]
pwyDB <- pwyDB[id %in% gs_findP$ID]
pwyDB[, reaNr := length(unlist(strsplit(reaId, ","))), by = id]
pwyDB[, spontNr := length(unlist(strsplit(spont, ","))), by = id]

# identify which reactions were added during gapfilling
rxnGF <- mod_filled@react_id[!(mod_filled@react_id %in% mod_draft@react_id)]
rxnGF <- rxnGF[!grepl("^EX_",rxnGF)] # exclude exchange reactions
rxnGF <- gsub("_c0$","",rxnGF)

# get BioCyc-IDs of added reactions
newBC <- lapply(rxnGF, function(x) gs_findR[grepl(x, dbhit), rxn])
newBC <- unique(unlist(newBC))

# get Pathways, in which die newBC participate
newBC_pwys <- lapply(newBC, function(x) gs_findR[rxn == x, unique(pathway)])
names(newBC_pwys) <- newBC

# add new reaction to pathway table
gs_findP$newReactionsFound <- ""
for(rxni in newBC) {
  for(pwyi in newBC_pwys[[rxni]]) {
    gs_findP[ID == pwyi, newReactionsFound := paste0(newReactionsFound,rxni, sep = " ")]
  }
}
gs_findP[, newReactionsFound := gsub("^ | $","",newReactionsFound)] # remove trailing spaces

# # Remove duplicated entries and ones not present in pwyDB
for (i in 1:nrow(gs_findP)){
  #print(c(i, gs_findP[i, "newReactionsFound"]))

  newReacts_tmp <- unlist(strsplit(as.character(gs_findP[i, "newReactionsFound"]), " "))
  oldReacts_tmp <- unlist(strsplit(as.character(gs_findP[i, "ReactionsFound"]), " "))
  gs_findP[i, "newReactionsFound"] <- paste0(newReacts_tmp[!(newReacts_tmp %in% oldReacts_tmp)], collapse  = " ")
  
  allReacts_tmp <- unlist(strsplit(pwyDB[ id == as.character(gs_findP[i, ID]), reaId], ","))
  gs_findP[i, "ReactionsFound"] <- paste0(oldReacts_tmp[(oldReacts_tmp %in% allReacts_tmp)], collapse  = " ")
}

# merge and recalc completeness
gs_findP <- merge(gs_findP, pwyDB, by.x = "ID", by.y = "id")
gs_findP[,Nold := length(unlist(strsplit(ReactionsFound, " "))), by = "ID"]
gs_findP[,Nnew := length(unlist(strsplit(newReactionsFound, " "))), by = "ID"]
gs_findP[,Nnew_spont := length(Reduce(intersect, list(unlist(strsplit(newReactionsFound, " ")), unlist(strsplit(spont, ","))))), by = "ID"]

gs_findP[,C_old := (Nold + VagueReactions)/(reaNr - spontNr)*100]
gs_findP[,C_new := (Nold + Nnew + VagueReactions)/(reaNr - spontNr)*100]

gs_findP[,diff_C := C_new - C_old]
gs_findP[,C_new_corr := (Nold + Nnew - Nnew_spont + VagueReactions)/(reaNr - spontNr)*100] # Remove new spontaneous reactions

Thanks again for the help!

from gapseq.

Exploit gapfilled model for metabolic comparison about gapseq HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent