Comments (24)
from cdkr.
This looks like an issue on the CDK side, where it explicitly ignores mass numbers when generating the string formula
from cdkr.
If you update to the latest github versions of rcdklibs
and rcdk
, the resultant formulae strings are now labeled with mass numbers.
Also generate.formula.iter
is updated to take the same elements argument as generate.formula
. So your updated code looks like
library(rJava)
library(reshape2)
library(stringi)
library(rcdk)
library(RMassBank)
#parent substance (known)
target_name <- "3,5-dibromo-4-hydroxybenzoic_acid_804"
target_formula <- "C7H4Br2O3"
#masses of peaks to analyze (first is monoisotopic parent, next two are isotopologues and last is a possible in-source-fragment)
target_peaks <- c(292.8454296, 294.8432039, 296.8411527, 135.0452735)
subformula <- c()
elements <- lapply(formulastring.to.list(target_formula), range, 0) #gives me a list to limit the formula generation
for (i in names(elements)) {
tmp <- elements[[i]]
tmp <- c(i, tmp)
elements[[i]] <- tmp
}
elements[[5]] <- c("Br",0,2,81)
results <- lapply(target_peaks, function(tp) {
mit <- generate.formula.iter(target_peaks[tp], window = 0.05, elements, charge = 1, as.string=FALSE)
hit <- itertools::ihasNext(mit)
as.list(hit)
})
result <- c()
for (j in 1:length(target_peaks)){
result<-c()
mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=TRUE)
hit <- itertools::ihasNext(mit)
while (itertools::hasNext(hit))
result <- iterators::nextElem(hit)
if(!is.null(result)){ # writes found formulae into vector
subformula[j] <- result
}else{subformula[j]<-NA}
}
subformula
from cdkr.
Thank you for your answer.
This was exactly what I was looking for. Unfortunately, the output of the function is now not compatible anymore with further rcdk analysis.
If I run the updated script, the output reads (> subformula
[1] "[12C]7[1H]3[79Br]2[16O]3" "[12C]7[1H]3[79Br][81Br][16O]3" "[12C]7[1H]3[81Br]2[16O]3" "none" )
But if I now try to generate a Rcdk formula element from the output via
get.formula(subformula[1],1)
I get the error: Error in .jcall(manipulator, "Lorg/openscience/cdk/interfaces/IMolecularFormula;", :
java.lang.NullPointerException
minimal_working_example_updated_isotopes.txt
from cdkr.
Ah yes - the rest of the CDK code doesn't recognize the mass number annotated formulae. One workaround for now is to tell the generator to return formula objects rather than strings.
mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=FALSE)
So, doing this gives you something like
> subformula <- list()
> for (j in 1:length(target_peaks)){
+ result<-c()
+ mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=FALSE)
+ hit <- itertools::ihasNext(mit)
+ while (itertools::hasNext(hit))
+ result <- iterators::nextElem(hit)
+ if(is.null(result)==F){ # writes found formulae into vector
+ subformula[[j]] <- result
+ }else{
+ subformula[[j]]<- "none"
+ }
+ }
> subformula
[[1]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@67f89fa3}"
[[2]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@4ac68d3e}"
[[3]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@277c0f21}"
[[4]]
[1] "none"
You could then manipulate the formula objects using CDK classes/methods via .jcall
. It's a bit klunky, but until we update the CDK side of things, this would be the best way
from cdkr.
Also, on a somewhat unrelated note, the elements
list looks like
> elements
$C
[1] "C" "0" "7"
$H
[1] "H" "0" "4"
$Br
[1] "Br" "0" "2"
$O
[1] "O" "0" "3"
[[5]]
[1] "Br" "0" "2" "81"
For the entries where mass number is not specified, is it expected or assumed that the major isotope is to be used?
from cdkr.
For the not specified mass numbers, the major isotope is assumed and cdk also uses it like this.
Thanks for the answers
from cdkr.
from cdkr.
from cdkr.
from cdkr.
@trljcl thanks for jumping in; have you found any links to information defining the actual conventions? I have just emailed a colleague if he knows any (as we just debated this at great length for InChI specs)
from cdkr.
@ChemConnector we need the ACS Style guide open! ... I can't find info in Wikipedia and the InChI specs don't cover this ...
This does not cover computational representations for formulae well as far as I can see:
https://en.wikipedia.org/wiki/Chemical_formula
from cdkr.
from cdkr.
Thanks for all the feedback - especially since the usage you're all discussing is pretty far from my expertise!
If you can point me to docs regarding standardized (or even commonly accepted) format for mass number annotation in a formula, I can look at that on the CDK side. Personally, the current version (adding mass number to every element) is ugly, and it appears to also be incompatible.
@schymane re the definition of major isotope, one way around it is to manually specify the desired mass number in the element list, which forces rcdk
to employ that specific isotope, rather than go with a major isotope (however that is defined)
from cdkr.
from cdkr.
My bad - the CDK Javadocs do define what the major isotope is. See here
Returns the most abundant (major) isotope with a given atomic number.
The isotope's abundance is for atoms with atomic number 60 and smaller defined
as a number that is proportional to the 100 of the most abundant isotope. For atoms
with higher atomic numbers, the abundance is defined as a percentage.
from cdkr.
Just checked with enviPat:
It uses the notation of []Brackets before the element symbols for none-main Isotopes and no brackets for major isotopes. It also always needs an atom count, even if it is 1.
Ex:
[15]N1H3
But in enviPat one can always define new isotopes with whatever nomenclature one wants by simply appending to their isotope list.
from cdkr.
Emma asked me if I had any thoughts.
My only recommendation for mass calculations is that "[##]Ee" refer to specific isotope mass and that "Ee" refer to elemental natural abundance mass. If done this way, then a combination of exact isotope mass and/or elemental natural abundance mass can be used to calculate molecular mass.
from cdkr.
Thanks @hunter-moseley - If we know the exact assumption that CDK uses for defining the major isotope, then surely we can do both natural abundance and exact isotope mass implicitly if [##] is missing for the major isotope? It will be rather ugly to have to deal with explicitly-defined numbers in every formula...and this is not something I would like to e.g. see annotated in MassBank records, ideally we'd be able to have a compact and readable molecular formula / fragment annotation (and hide the details behind the scenes)
See e.g. PK$ANNOTATION here:
https://massbank.eu/MassBank/jsp/RecordDisplay.jsp?id=AU169406&dsn=UOA
@rajarshi I find the CDK's definition rather strange re 100 vs % above atomic number 60 ... can't quite visualize the consequences but haven't had a chance to crunch the numbers. Is there a reason for such a disjoint definition? Does the >atomic number 60 definition overlap with the way it is defined here?
http://www.sisweb.com/referenc/source/exactmas.htm
from cdkr.
The original display chosen by @rajarshi is consistent with the SMILES annotation ... but I think we should still aim for consistency between other software approaches? The square brackets in SMILES capture different/additional information in a different way that is not relevant to us.
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
Smiles | Name |
---|---|
[12C] | carbon-12 |
[13C] | carbon-13 |
[C] | carbon (unspecified mass) |
[13CH4] | C-13 methane |
from cdkr.
@schymane re the CDK definition of major isotope - I unfortunately don't know why it was chosen. The Javadocs indicate it was written by Chris Steinbeck, so I guess he could shed some light.
Re the string representation - yes, the SMILES approach influenced me, but given that molecular formulae strings are not SMILES, I don't think we have to be stick to that, and rather go with the more accepted representation used by this community.
from cdkr.
Interestingly, looking at the Java sources suggests that the major isotope is simply the most abundant isotope for any element (and no consideration is made for atomic numbers < 60 or > 60).
from cdkr.
The problem is that "major" isotope loses some of its meaning when the percentage drop below 50%. Take molybdenum for instance: https://en.wikipedia.org/wiki/Isotopes_of_molybdenum .
I suggest that you have two interpretation of mass based on the definitions of "nominal mass" and "most abundant mass".
Definition of "nominal mass": https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Nominal_mass
This would be in contrast to the definition of "most abundant mass": https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Most_abundant_mass
from cdkr.
So currently, CDK's getMajorIsotope
corresponds to the Most abundant mass
definition. I guess we'd have to add an annotation for stable isotopes to be able to return the Nominal mass
result.
For the case of Mo, it seems that the nominal and most abundant masses correspond to the same isotope?
from cdkr.
Related Issues (20)
- view.image.2d doesn't work if no proper X11 is available
- Add link to documentation website HOT 1
- Parsing SMILES with heavy atoms HOT 1
- Having problem with "eval.desc" HOT 1
- failed to load rcdk HOT 3
- view.molecule.2D not working on Rstudio HOT 5
- get.volume giving error HOT 8
- Major Changes to JDK17 HOT 7
- If possible, provide more verbose messaging on parse.smiles failures
- `get.formula` fails if elements has length one
- library(rcdk) crashes Rstudio HOT 2
- eval.desc return error : "segfault from C stack overflow"
- generate.formula for charged species: window not correctly calculated?
- rcdk::matches() function bugs HOT 3
- cdk implementation in R::depict HOT 3
- InChIKey functionality crashes with "Exception: java.lang.StackOverflowError thrown from the UncaughtExceptionHandler in thread "process reaper"" HOT 6
- How to use HOSEcode HOT 1
- Remove atoms with CDK HOT 1
- Partial charge not-supported for element: 'B'
- Only "0" returned for AromaticAtomsCountDescriptor and AromaticBondsCountDescriptor HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cdkr.