Hi Rcdk team I have a small question regarding isotope annotation fo

Rcdk isotopes in generate.formula.iter() about cdkr HOT 24 CLOSED

cdk-r commented on August 15, 2024

Rcdk isotopes in generate.formula.iter()

from cdkr.

Comments (24)

lauperbe commented on August 15, 2024

minimal_working_example.txt

from cdkr.

rajarshi commented on August 15, 2024

This looks like an issue on the CDK side, where it explicitly ignores mass numbers when generating the string formula

from cdkr.

rajarshi commented on August 15, 2024

If you update to the latest github versions of rcdklibs and rcdk, the resultant formulae strings are now labeled with mass numbers.

Also generate.formula.iter is updated to take the same elements argument as generate.formula. So your updated code looks like

library(rJava)
library(reshape2)
library(stringi)
library(rcdk)
library(RMassBank)

#parent substance (known)
target_name <- "3,5-dibromo-4-hydroxybenzoic_acid_804"
target_formula <- "C7H4Br2O3"

#masses of peaks to analyze (first is monoisotopic parent, next two are isotopologues and last is a possible in-source-fragment)
target_peaks <- c(292.8454296, 294.8432039, 296.8411527, 135.0452735)

subformula <- c()
elements <- lapply(formulastring.to.list(target_formula), range, 0)   #gives me a list to limit the formula generation

for (i in names(elements)) {
    tmp <- elements[[i]]
    tmp <- c(i, tmp)
    elements[[i]] <- tmp
}

elements[[5]] <- c("Br",0,2,81)

results <- lapply(target_peaks, function(tp) {
    mit <- generate.formula.iter(target_peaks[tp], window = 0.05, elements, charge = 1, as.string=FALSE) 
    hit <- itertools::ihasNext(mit)
    as.list(hit)
})


result <- c()
for (j in 1:length(target_peaks)){
  result<-c()
  mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=TRUE) 
  hit <- itertools::ihasNext(mit)
  while (itertools::hasNext(hit))
    result <-  iterators::nextElem(hit) 

  
  if(!is.null(result)){ # writes found formulae into vector
    subformula[j] <- result
  }else{subformula[j]<-NA}
  
}
subformula

from cdkr.

lauperbe commented on August 15, 2024

Thank you for your answer.
This was exactly what I was looking for. Unfortunately, the output of the function is now not compatible anymore with further rcdk analysis.
If I run the updated script, the output reads (> subformula
[1] "[12C]7[1H]3[79Br]2[16O]3" "[12C]7[1H]3[79Br][81Br][16O]3" "[12C]7[1H]3[81Br]2[16O]3" "none" )

But if I now try to generate a Rcdk formula element from the output via
get.formula(subformula[1],1)
I get the error: Error in .jcall(manipulator, "Lorg/openscience/cdk/interfaces/IMolecularFormula;", :
java.lang.NullPointerException

minimal_working_example_updated_isotopes.txt

from cdkr.

rajarshi commented on August 15, 2024

Ah yes - the rest of the CDK code doesn't recognize the mass number annotated formulae. One workaround for now is to tell the generator to return formula objects rather than strings.

  mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=FALSE)

So, doing this gives you something like

> subformula <- list()
> for (j in 1:length(target_peaks)){
+   result<-c()
+   mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=FALSE) 
+   hit <- itertools::ihasNext(mit)
+   while (itertools::hasNext(hit))
+     result <-  iterators::nextElem(hit) 
+   if(is.null(result)==F){ # writes found formulae into vector
+       subformula[[j]] <- result
+   }else{
+       subformula[[j]]<- "none"
+   }
+ }
> subformula
[[1]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@67f89fa3}"

[[2]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@4ac68d3e}"

[[3]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@277c0f21}"

[[4]]
[1] "none"

You could then manipulate the formula objects using CDK classes/methods via .jcall. It's a bit klunky, but until we update the CDK side of things, this would be the best way

from cdkr.

rajarshi commented on August 15, 2024

Also, on a somewhat unrelated note, the elements list looks like

> elements
$C
[1] "C" "0" "7"

$H
[1] "H" "0" "4"

$Br
[1] "Br" "0"  "2" 

$O
[1] "O" "0" "3"

[[5]]
[1] "Br" "0"  "2"  "81"

For the entries where mass number is not specified, is it expected or assumed that the major isotope is to be used?

from cdkr.

lauperbe commented on August 15, 2024

For the not specified mass numbers, the major isotope is assumed and cdk also uses it like this.

Thanks for the answers

from cdkr.

schymane commented on August 15, 2024

How is the major isotope defined? The isotope of maximum intensity, or a la the InChI definition (the rounded average atomic mass of the element) – which is not the same and leads to confusion for elements such as Se and Sn? For Sn this would mean either 120Sn (max abundance of 32%) or 119Sn (rounded from average mass of 118.71) is the reference. Is there a way to make the isotope explicit to avoid confusion/assumptions?

from cdkr.

schymane commented on August 15, 2024

@lauperbe you may also want to check out how compatible this all is with enviPat. This surely handles isotopes because they have to; we should try and get rcdk and enviPat annotating formulas in consistent ways that follow clearly defined chemical conventions if possible (a lot of my/our code uses both packages) (> subformula [1] "[12C]7[1H]3[79Br]2[16O]3" "[12C]7[1H]3[79Br][81Br][16O]3" "[12C]7[1H]3[81Br]2[16O]3" "none" )

from cdkr.

trljcl commented on August 15, 2024

If I can jump in here, I also thought the standard notation for identifying less common isotopes is an integer value wrapped in square brackets prefixing the elemental symbol, e.g. in the example [12C]7[1H]3[79Br]2[16O]3 as 79Br is the only element with a less common isotope, this should be implicitly written as C7H3[79]Br2O3 enviPat and (and also commercial MS software packages with formula generators / interpreters) recognize this. They do not recognize C7H3[79Br]2O3 thanks Tony

…

On 20 April 2018 at 08:05, Emma Schymanski ***@***.***> wrote: @lauperbe you may also want to check out how compatible this all is with enviPat. This surely handles isotopes because they have to; we should try and get rcdk and enviPat annotating formulas in consistent ways that follow clearly defined chemical conventions if possible (a lot of my/our code uses both packages) (> subformula [1] <https://maps.google.com/?q=1%5D++%5B12C%5D7&entry=gmail&source=g>" [12C]7 <https://maps.google.com/?q=1%5D++%5B12C%5D7&entry=gmail&source=g>[1H]3[79Br]2[16O]3" "[12C]7[1H]3[79Br][81Br][16O]3" "[12C]7[1H]3[81Br]2[16O]3" "none" ) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#65 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADeo0l6Kc0RmClUGILEpi7s7zmY_W2Ynks5tqYi5gaJpZM4TWWKm> .

-- Dr. Tony R. Larson Head of Metabolomics & Proteomics, Department of Biology, Area 15 University of York Wentworth Way Heslington York YO10 5DD UK Tel: +44(0)1904 328 733 (office) Tel: +44(0)7833 471 685 (mobile) [email protected] http://scholar.google.com/citations?user=9hLFka4AAAAJ www.york.ac.uk/biology/technology-facility/proteomics/ www.york.ac.uk/mass-spectrometry

from cdkr.

schymane commented on August 15, 2024

@trljcl thanks for jumping in; have you found any links to information defining the actual conventions? I have just emailed a colleague if he knows any (as we just debated this at great length for InChI specs)

from cdkr.

schymane commented on August 15, 2024

@ChemConnector we need the ACS Style guide open! ... I can't find info in Wikipedia and the InChI specs don't cover this ...
This does not cover computational representations for formulae well as far as I can see:
https://en.wikipedia.org/wiki/Chemical_formula

from cdkr.

trljcl commented on August 15, 2024

No, sorry - I am a primarily a small molecule biologist who uses various packages to annotate monoisotopic MS data. All software I've come across use the [i]symbol notation for less common isotopes - so this is at least pragmatically the best way! It would be nice if rcdk recognized these as inputs for mass calculations, but AFAIK the underlying cdk java code assumes natural isotope abundance distributions for input elemental symbols to calculate monoisotopic and average masses. This would need some tweaking to effectively hard-set the isotope distributions. Going the other way (i.e. defining isotope limits to generate a formula from an input mass, with outputs containing annotated isotope symbols) is relatively easier given the underlying cdk java code already does those calculations, and Raj recently very helpfully updated the rcdk code to incorporate this. Tony

…

On 20 April 2018 at 08:37, Emma Schymanski ***@***.***> wrote: @trljcl <https://github.com/trljcl> thanks for jumping in; have you found any links to information defining the actual conventions? I have just emailed a colleague if he knows any (as we just debated this at great length for InChI specs) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#65 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADeo0oIA4WkEon-d3s0AvT95YMWYT6Miks5tqZBBgaJpZM4TWWKm> .

from cdkr.

rajarshi commented on August 15, 2024

Thanks for all the feedback - especially since the usage you're all discussing is pretty far from my expertise!

If you can point me to docs regarding standardized (or even commonly accepted) format for mass number annotation in a formula, I can look at that on the CDK side. Personally, the current version (adding mass number to every element) is ugly, and it appears to also be incompatible.

@schymane re the definition of major isotope, one way around it is to manually specify the desired mass number in the element list, which forces rcdk to employ that specific isotope, rather than go with a major isotope (however that is defined)

from cdkr.

schymane commented on August 15, 2024

I will post any documentation if/when I find any :-) Re: “however it is defined” … if it is clearly defined/documented somewhere in the CDK how it is defined, that’s fine (I find it very difficult to navigate CDK documentation…). Otherwise when I get a chance I will do some number crunching with the tricky cases (Sn, Se) and see what happens … it would be nice not to have to define things explicitly every time. If you add some simple cases to your documentation I can try out some trickier cases and add them to your rcdk documentation if/once I get the chance so it’s clear. Btw thanks for getting this moving, we’ve been doing workaround functions for years, but now that we really have to handle isotopes properly and now that CDK2.0 is out I think it’s a great chance to get this fixed!

from cdkr.

rajarshi commented on August 15, 2024

My bad - the CDK Javadocs do define what the major isotope is. See here

Returns the most abundant (major) isotope with a given atomic number.
The isotope's abundance is for atoms with atomic number 60 and smaller defined 
as a number that is proportional to the 100 of the most abundant isotope. For atoms 
with higher atomic numbers, the abundance is defined as a percentage.

from cdkr.

lauperbe commented on August 15, 2024

Just checked with enviPat:
It uses the notation of []Brackets before the element symbols for none-main Isotopes and no brackets for major isotopes. It also always needs an atom count, even if it is 1.
Ex:
[15]N1H3

But in enviPat one can always define new isotopes with whatever nomenclature one wants by simply appending to their isotope list.

from cdkr.

hunter-moseley commented on August 15, 2024

Emma asked me if I had any thoughts.

My only recommendation for mass calculations is that "[##]Ee" refer to specific isotope mass and that "Ee" refer to elemental natural abundance mass. If done this way, then a combination of exact isotope mass and/or elemental natural abundance mass can be used to calculate molecular mass.

from cdkr.

schymane commented on August 15, 2024

Thanks @hunter-moseley - If we know the exact assumption that CDK uses for defining the major isotope, then surely we can do both natural abundance and exact isotope mass implicitly if [##] is missing for the major isotope? It will be rather ugly to have to deal with explicitly-defined numbers in every formula...and this is not something I would like to e.g. see annotated in MassBank records, ideally we'd be able to have a compact and readable molecular formula / fragment annotation (and hide the details behind the scenes)
See e.g. PK$ANNOTATION here:
https://massbank.eu/MassBank/jsp/RecordDisplay.jsp?id=AU169406&dsn=UOA

@rajarshi I find the CDK's definition rather strange re 100 vs % above atomic number 60 ... can't quite visualize the consequences but haven't had a chance to crunch the numbers. Is there a reason for such a disjoint definition? Does the >atomic number 60 definition overlap with the way it is defined here?
http://www.sisweb.com/referenc/source/exactmas.htm

from cdkr.

schymane commented on August 15, 2024

The original display chosen by @rajarshi is consistent with the SMILES annotation ... but I think we should still aim for consistency between other software approaches? The square brackets in SMILES capture different/additional information in a different way that is not relevant to us.
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

Smiles	Name
[12C]	carbon-12
[13C]	carbon-13
[C]	carbon (unspecified mass)
[13CH4]	C-13 methane

from cdkr.

rajarshi commented on August 15, 2024

@schymane re the CDK definition of major isotope - I unfortunately don't know why it was chosen. The Javadocs indicate it was written by Chris Steinbeck, so I guess he could shed some light.

Re the string representation - yes, the SMILES approach influenced me, but given that molecular formulae strings are not SMILES, I don't think we have to be stick to that, and rather go with the more accepted representation used by this community.

from cdkr.

rajarshi commented on August 15, 2024

Interestingly, looking at the Java sources suggests that the major isotope is simply the most abundant isotope for any element (and no consideration is made for atomic numbers < 60 or > 60).

from cdkr.

hunter-moseley commented on August 15, 2024

The problem is that "major" isotope loses some of its meaning when the percentage drop below 50%. Take molybdenum for instance: https://en.wikipedia.org/wiki/Isotopes_of_molybdenum .
I suggest that you have two interpretation of mass based on the definitions of "nominal mass" and "most abundant mass".

Definition of "nominal mass": https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Nominal_mass
This would be in contrast to the definition of "most abundant mass": https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Most_abundant_mass

from cdkr.

rajarshi commented on August 15, 2024

So currently, CDK's getMajorIsotope corresponds to the Most abundant mass definition. I guess we'd have to add an annotation for stable isotopes to be able to return the Nominal mass result.

For the case of Mo, it seems that the nominal and most abundant masses correspond to the same isotope?

from cdkr.

Rcdk isotopes in generate.formula.iter() about cdkr HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent