Giter VIP home page Giter VIP logo

Comments (24)

lauperbe avatar lauperbe commented on August 15, 2024

minimal_working_example.txt

from cdkr.

rajarshi avatar rajarshi commented on August 15, 2024

This looks like an issue on the CDK side, where it explicitly ignores mass numbers when generating the string formula

from cdkr.

rajarshi avatar rajarshi commented on August 15, 2024

If you update to the latest github versions of rcdklibs and rcdk, the resultant formulae strings are now labeled with mass numbers.

Also generate.formula.iter is updated to take the same elements argument as generate.formula. So your updated code looks like

library(rJava)
library(reshape2)
library(stringi)
library(rcdk)
library(RMassBank)

#parent substance (known)
target_name <- "3,5-dibromo-4-hydroxybenzoic_acid_804"
target_formula <- "C7H4Br2O3"

#masses of peaks to analyze (first is monoisotopic parent, next two are isotopologues and last is a possible in-source-fragment)
target_peaks <- c(292.8454296, 294.8432039, 296.8411527, 135.0452735)

subformula <- c()
elements <- lapply(formulastring.to.list(target_formula), range, 0)   #gives me a list to limit the formula generation

for (i in names(elements)) {
    tmp <- elements[[i]]
    tmp <- c(i, tmp)
    elements[[i]] <- tmp
}

elements[[5]] <- c("Br",0,2,81)

results <- lapply(target_peaks, function(tp) {
    mit <- generate.formula.iter(target_peaks[tp], window = 0.05, elements, charge = 1, as.string=FALSE) 
    hit <- itertools::ihasNext(mit)
    as.list(hit)
})


result <- c()
for (j in 1:length(target_peaks)){
  result<-c()
  mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=TRUE) 
  hit <- itertools::ihasNext(mit)
  while (itertools::hasNext(hit))
    result <-  iterators::nextElem(hit) 

  
  if(!is.null(result)){ # writes found formulae into vector
    subformula[j] <- result
  }else{subformula[j]<-NA}
  
}
subformula

from cdkr.

lauperbe avatar lauperbe commented on August 15, 2024

Thank you for your answer.
This was exactly what I was looking for. Unfortunately, the output of the function is now not compatible anymore with further rcdk analysis.
If I run the updated script, the output reads (> subformula
[1] "[12C]7[1H]3[79Br]2[16O]3" "[12C]7[1H]3[79Br][81Br][16O]3" "[12C]7[1H]3[81Br]2[16O]3" "none" )

But if I now try to generate a Rcdk formula element from the output via
get.formula(subformula[1],1)
I get the error: Error in .jcall(manipulator, "Lorg/openscience/cdk/interfaces/IMolecularFormula;", :
java.lang.NullPointerException

minimal_working_example_updated_isotopes.txt

from cdkr.

rajarshi avatar rajarshi commented on August 15, 2024

Ah yes - the rest of the CDK code doesn't recognize the mass number annotated formulae. One workaround for now is to tell the generator to return formula objects rather than strings.

  mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=FALSE) 

So, doing this gives you something like

> subformula <- list()
> for (j in 1:length(target_peaks)){
+   result<-c()
+   mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=FALSE) 
+   hit <- itertools::ihasNext(mit)
+   while (itertools::hasNext(hit))
+     result <-  iterators::nextElem(hit) 
+   if(is.null(result)==F){ # writes found formulae into vector
+       subformula[[j]] <- result
+   }else{
+       subformula[[j]]<- "none"
+   }
+ }
> subformula
[[1]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@67f89fa3}"

[[2]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@4ac68d3e}"

[[3]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@277c0f21}"

[[4]]
[1] "none"

You could then manipulate the formula objects using CDK classes/methods via .jcall. It's a bit klunky, but until we update the CDK side of things, this would be the best way

from cdkr.

rajarshi avatar rajarshi commented on August 15, 2024

Also, on a somewhat unrelated note, the elements list looks like

> elements
$C
[1] "C" "0" "7"

$H
[1] "H" "0" "4"

$Br
[1] "Br" "0"  "2" 

$O
[1] "O" "0" "3"

[[5]]
[1] "Br" "0"  "2"  "81"

For the entries where mass number is not specified, is it expected or assumed that the major isotope is to be used?

from cdkr.

lauperbe avatar lauperbe commented on August 15, 2024

For the not specified mass numbers, the major isotope is assumed and cdk also uses it like this.

Thanks for the answers

from cdkr.

schymane avatar schymane commented on August 15, 2024

from cdkr.

schymane avatar schymane commented on August 15, 2024

from cdkr.

trljcl avatar trljcl commented on August 15, 2024

from cdkr.

schymane avatar schymane commented on August 15, 2024

@trljcl thanks for jumping in; have you found any links to information defining the actual conventions? I have just emailed a colleague if he knows any (as we just debated this at great length for InChI specs)

from cdkr.

schymane avatar schymane commented on August 15, 2024

@ChemConnector we need the ACS Style guide open! ... I can't find info in Wikipedia and the InChI specs don't cover this ...
This does not cover computational representations for formulae well as far as I can see:
https://en.wikipedia.org/wiki/Chemical_formula

from cdkr.

trljcl avatar trljcl commented on August 15, 2024

from cdkr.

rajarshi avatar rajarshi commented on August 15, 2024

Thanks for all the feedback - especially since the usage you're all discussing is pretty far from my expertise!

If you can point me to docs regarding standardized (or even commonly accepted) format for mass number annotation in a formula, I can look at that on the CDK side. Personally, the current version (adding mass number to every element) is ugly, and it appears to also be incompatible.

@schymane re the definition of major isotope, one way around it is to manually specify the desired mass number in the element list, which forces rcdk to employ that specific isotope, rather than go with a major isotope (however that is defined)

from cdkr.

schymane avatar schymane commented on August 15, 2024

from cdkr.

rajarshi avatar rajarshi commented on August 15, 2024

My bad - the CDK Javadocs do define what the major isotope is. See here

Returns the most abundant (major) isotope with a given atomic number.
The isotope's abundance is for atoms with atomic number 60 and smaller defined 
as a number that is proportional to the 100 of the most abundant isotope. For atoms 
with higher atomic numbers, the abundance is defined as a percentage.

from cdkr.

lauperbe avatar lauperbe commented on August 15, 2024

Just checked with enviPat:
It uses the notation of []Brackets before the element symbols for none-main Isotopes and no brackets for major isotopes. It also always needs an atom count, even if it is 1.
Ex:
[15]N1H3

But in enviPat one can always define new isotopes with whatever nomenclature one wants by simply appending to their isotope list.

from cdkr.

hunter-moseley avatar hunter-moseley commented on August 15, 2024

Emma asked me if I had any thoughts.

My only recommendation for mass calculations is that "[##]Ee" refer to specific isotope mass and that "Ee" refer to elemental natural abundance mass. If done this way, then a combination of exact isotope mass and/or elemental natural abundance mass can be used to calculate molecular mass.

from cdkr.

schymane avatar schymane commented on August 15, 2024

Thanks @hunter-moseley - If we know the exact assumption that CDK uses for defining the major isotope, then surely we can do both natural abundance and exact isotope mass implicitly if [##] is missing for the major isotope? It will be rather ugly to have to deal with explicitly-defined numbers in every formula...and this is not something I would like to e.g. see annotated in MassBank records, ideally we'd be able to have a compact and readable molecular formula / fragment annotation (and hide the details behind the scenes)
See e.g. PK$ANNOTATION here:
https://massbank.eu/MassBank/jsp/RecordDisplay.jsp?id=AU169406&dsn=UOA

@rajarshi I find the CDK's definition rather strange re 100 vs % above atomic number 60 ... can't quite visualize the consequences but haven't had a chance to crunch the numbers. Is there a reason for such a disjoint definition? Does the >atomic number 60 definition overlap with the way it is defined here?
http://www.sisweb.com/referenc/source/exactmas.htm

from cdkr.

schymane avatar schymane commented on August 15, 2024

The original display chosen by @rajarshi is consistent with the SMILES annotation ... but I think we should still aim for consistency between other software approaches? The square brackets in SMILES capture different/additional information in a different way that is not relevant to us.
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

Smiles Name
[12C] carbon-12
[13C] carbon-13
[C] carbon (unspecified mass)
[13CH4] C-13 methane

from cdkr.

rajarshi avatar rajarshi commented on August 15, 2024

@schymane re the CDK definition of major isotope - I unfortunately don't know why it was chosen. The Javadocs indicate it was written by Chris Steinbeck, so I guess he could shed some light.

Re the string representation - yes, the SMILES approach influenced me, but given that molecular formulae strings are not SMILES, I don't think we have to be stick to that, and rather go with the more accepted representation used by this community.

from cdkr.

rajarshi avatar rajarshi commented on August 15, 2024

Interestingly, looking at the Java sources suggests that the major isotope is simply the most abundant isotope for any element (and no consideration is made for atomic numbers < 60 or > 60).

from cdkr.

hunter-moseley avatar hunter-moseley commented on August 15, 2024

The problem is that "major" isotope loses some of its meaning when the percentage drop below 50%. Take molybdenum for instance: https://en.wikipedia.org/wiki/Isotopes_of_molybdenum .
I suggest that you have two interpretation of mass based on the definitions of "nominal mass" and "most abundant mass".

Definition of "nominal mass": https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Nominal_mass
This would be in contrast to the definition of "most abundant mass": https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Most_abundant_mass

from cdkr.

rajarshi avatar rajarshi commented on August 15, 2024

So currently, CDK's getMajorIsotope corresponds to the Most abundant mass definition. I guess we'd have to add an annotation for stable isotopes to be able to return the Nominal mass result.

For the case of Mo, it seems that the nominal and most abundant masses correspond to the same isotope?

from cdkr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.