massbank / massbank-data Goto Github PK

Official repository of open data MassBank records

massbank mass-spectrometry repository library

massbank-data's Introduction

MassBank-data validation status

main branch
dev branch
Zenodo release

MassBank-data introduction

This repo contains all MassBank records and uses GitHub Actions to validate the content of all records with the Validator from MassBank-web.

Documentation can be found at https://massbank.github.io/MassBank-documentation.

massbank-data's People

Contributors

Stargazers

Watchers

massbank-data's Issues

Error in DTXSID mapping?

Bug report from external user:

First of all, thank you for the massive effort in developing and maintaining MassBank! I was very pleased to see in the News that all the records were linked to Comptox (if registered), so I gave it a go: the first record I randomly tested was MSJ01067 (Acetamiprid; GC-EI-Q; MS; Positive; M+), I clicked the Comptox link (DTXSID60861331) and...the substance ID does not exist - Acetamiprid ID is DTXSID0034300.

I therefore tested many other records which were all ok, so I assume that I was really unlucky (or an excellent proof-reader) :-)

I don't know if it's an isolated case, but give it a check.

Follow-up:
Indeed that DTXSID doesn't appear to exist in the public Dashboard, nor do I get a match for that InChIKey. If this is a name match, it's wrong ...
https://massbank.eu/MassBank/RecordDisplay.jsp?id=MSJ01067

https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID60861331

https://comptox.epa.gov/dashboard/dsstoxdb/results?search=WCXDHFDTOYPNIE-UHFFFAOYSA-N

This is the correct match:
https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID0034300
and is also found by name:
https://comptox.epa.gov/dashboard/dsstoxdb/results?search=Acetamiprid

Any ideas what went wrong here @meier-rene @ChemConnector ?
PubChem link looks fine
https://pubchem.ncbi.nlm.nih.gov/compound/213021

Upload and link raw mass spectral data

#MetSoc2019, Towards FAIR Spectral Libraries workshop. There is a request to make raw mass spectral files associated with MassBank records available for the public via any repository. Should be vendor's format, not mzML.

Webhook

The people from Mona asked for a web-hook some while ago. We should add it for easier announcement of changes in the records.

Problem with command line validator

Hi, my validator does not work correctly. There is maybe a problem with the illegal method?

root@massbank2:/MassBank-data# mvn -q -f .scripts/MassBank-web/MassBank-Project/MassBank-lib/pom.xml install
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$1 (file:/usr/share/maven/lib/guice.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
root@massbank2:/MassBank-data# mc
MobaXterm X11 proxy: Unsupported authorisation protocol

root@massbank2:/MassBank-data# mc
MobaXterm X11 proxy: Unsupported authorisation protocol

root@massbank2:/MassBank-data#
root@massbank2:/MassBank-data# ./.scripts/validate.sh ./
Validating 43 files
Exception in thread "main" java.io.IOException: File 'AAFC' exists but is a directory
at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:291)
at org.apache.commons.io.FileUtils.readFileToString(FileUtils.java:1805)
at massbank.Validator.main(Validator.java:230)
root@massbank2:/MassBank-data# ./.scripts/validate.sh /MassBank-data/
Validating 43 files
Exception in thread "main" java.io.IOException: File 'AAFC' exists but is a directory
at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:291)
at org.apache.commons.io.FileUtils.readFileToString(FileUtils.java:1805)
at massbank.Validator.main(Validator.java:230)

UF40260X are simazine not desethylterbutylazine

UF402601, UF402602, UF402603 and UF402604 are simazine and not desethylterbutylazine, the SPLASHes are identical with the simazine spectra and Martin Krauss has confirmed the retention times also match simazine and not desethylterbutylazine. All (simazine and desethylterbutylazine) spectra were flagged as simazine by Herbert Oberacher.
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF402603

@meier-rene can you update the compound information (CH$ fields) of the UF40260X records with the compound information from here:
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF404103

Pls let me know if you need more information, or if you want me to do the updates instead, thanks!

add ChemSpider IDs with new API on MassBank-data side

With new API and token rules, it will be easier to add ChemSpider IDs to new records on the MassBank-data side rather than via RMassBank. Posting this as a result of discussions on MassBank/RMassBank#192

Note from Dave at RSC:
I believe that there is a way to ensure that Travis has access to an API token but it is encrypted so that the token is only accessible to the test runner: https://stackoverflow.com/questions/9338428/using-secret-api-keys-on-travis-ci

Incorrect permissions in validator

The validator cannot run some scripts and hence the permission settings in the script folder must be set or maybe sudo should be used to run the script

Contributor's table is incosistent

Will be fixed with registration of VETIST

Strange naming in old records

I'm not sure how this got through the validator but it doesn't look like these meet the Record Format requirements? @meier-rene can you look into this? Thanks!

Multiple names in the title field, including names that are clearly wrong (e.g. including the metal salt - sodium and lithium)

This was the query I ran:
https://massbank.eu/MassBank/Result.jsp?compound=&op1=and&mz=&tol=0.3&op2=and&formula=C4H6O3&type=quick&searchType=keyword&sortKey=not&sortAction=1&pageNo=1&exec=&inst_grp=ESI&inst=CE-ESI-TOF&inst=ESI-ITFT&inst=ESI-ITTOF&inst=ESI-QIT&inst=ESI-QTOF&inst=ESI-TOF&inst=LC-ESI-IT&inst=LC-ESI-ITFT&inst=LC-ESI-ITTOF&inst=LC-ESI-Q&inst=LC-ESI-QFT&inst=LC-ESI-QIT&inst=LC-ESI-QQ&inst=LC-ESI-QQQ&inst=LC-ESI-QTOF&inst=LC-ESI-TOF&ms=MS2&ion=0

Spectrum search

Would it be possible to search database by using m/z vs absolute intensity format? When I copy/paste a spectrum directly from MassLynx it gives me m/z intensity format which the database apparently does not support.

ISAS_Dortmund records derive server error

@meier-rene

Type Exception Report

Message java.lang.ArrayIndexOutOfBoundsException

Description The server encountered an unexpected condition that prevented it from fulfilling the request.

Exception

org.apache.jasper.JasperException: java.lang.ArrayIndexOutOfBoundsException
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:598)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:514)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330)
javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)

Root Cause

java.lang.ArrayIndexOutOfBoundsException

Note The full stack trace of the root cause is available in the server logs.

MassBank Accession / InChIKey CC0 dump for Wikidata

@meier-rene can we get a CC0 dump of MassBank Accession IDs with InChIKey mappings for @egonw to add to WikiData? He's registering this property now ;-)

UF41490X spectra are not 17beta estradiol

The UF414901-04 spectra are not 17-beta-estradiol, as this does not ionise with the method used. Confirmed by Martin Krauss; flagged by Herbert Oberacher as potentially being 19-norandrostenedione. @tsufz can you confirm what compound information should be associated with this record, or indicate whether @meier-rene should deprecate?
Especially the 03 and 04 records appear good quality spectra, 01 and 02 are close to noise levels.
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF414904

Missing CONFIDENCE value

There are 10 records having the tag
COMMENT: CONFIDENCE
without any confidence value. I think this should not be valid, so please create a rule for the validator and correct the confidence values.

This applies to:
AAFC/AC000433
AAFC/AC000427
AAFC/AC000428
AAFC/AC000432
AAFC/AC000429
AAFC/AC000425
AAFC/AC000431
AAFC/AC000430
AAFC/AC000434
AAFC/AC000426

mzML as export format for MassBank-data

Hi, the OpenMS team has a prototype for conversion of MassBank records to a single mzML file
in https://github.com/OpenMS/MassBankUpdate/

This can now be done with eba8b81d5819e996a4f115be45b127271bb4e6fa
in https://github.com/sneumann/MassBankUpdate/ and eventually https://github.com/OpenMS/MassBankUpdate

Yours, Steffen

New accession specification and contributor naming guidelines -- DISCUSSION

I would like to develop some guidelines for new contributors how to name their accession and how and when to create new directories. This has become urgent due to some email discussion about new contributors and particular some new contributions, like #82.

There are different demands for which we need to find some compromise.

I, as a maintainer of the whole project would like to have data compact and not cluttered. Some directories are desired but not too much. Technical we only support one level with directories at the moment.

There are demands from contributors. They would like to separate their contribution by contributing group, but also sometimes by a specific project, which supported the creation of this records. I expect that a entry in the COMMENT section does not suffice. For them its most likely also matter of public image. Sometimes this separation is not an issue, because there is just one contributing person/group at a particular institution. In other cases more "separation" or "distinguishability" is desired. You all know that everyone has to justify his/her projects somehow...

Possible solutions - technical view -:
-most easy way for contributors: allow subdirectories in the institution directories. This creates major headaches on my side, because it would mean a lot of adjustments to the codebase
-use a directory naming scheme like the one which is already in the current data and resulted in this this discussion issue. Examples for the scheme could be RIKEN_IMS, RIKEN_NPDepo... This is a easy solution because it works right now. Only drawback is the increasing directory number which makes the project view bit more confusing.
-ease the requirement on accession naming, thats easy to implement but might not be sufficient "distinguishability"

Besides directory naming we also have the question of accession naming.

Standardize voltages

Mix of voltage representations (2000 V, 2 kV). I suggest to standardise from V to kV:

2000 V -> 2 kV
160 V -> 0.16 kV

UF41570X should be Cortisone

The UF415701-04 records should be Cortisone and not Prednisolone:
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF415701
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF415702
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF415703
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF415704

@meier-rene can you update these records with the compound information from https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF415401 ?
Again, the analytical information is correct, just the CH$ entries are for the incorrect compound. The SPLASHes are identical for UF415701 and UF415401.

.mlb export

Dear MassBank team,
I would very much like to use the MassBank data in .mlb format so that I can use Bruker's Library Editor to review and manipulate the library. In the next step I then would like to use sub-libraries of compounds that are more likely to be contained in my samples when I process my data using Bruker's DataAnalysis and Metaboscape.
Thanks for considering my request.
Best regards,
Joerg

Relicensing Waters records

According to Mr. Katsutoshi Nagase, Waters Japan, Tokyo, Japan,
all of the 2,992 Waters_Japan records currently on
https://massbank.eu/MassBank/jsp/Result.jsp?type=rcdidx&idxtype=site&srchkey=14&sortKey=name&sortAction=1&pageNo=1&exec=
can be re-licensed to the CC License ("CC BY-NC") and integrated to MassBank-data.
Note: the -NC clause of the re-licensing. Yours, Steffen

JEOL_Ltd/JEL00034.txt contact wanted

Dear all,
after implementing an automatic test to check for problems similar to #13 i found a problem with JEOL_Ltd/JEL00034.txt The peak list might contain two different spectra.

PK$PEAK: m/z int. rel.int.
  14.10826 36475 13
  15.09583 222688 81
  26.10593 73559 27
  27.11639 449105 163
  28.11312 374331 136
  29.13347 788534 286
  29.80832 47011 17
  30.12487 59097 21
  38.12244 31758 12
  39.13071 316878 115
  40.12874 84849 31
  41.14179 616729 224
  42.13915 415303 151
  43.1387 708455 257
  44.14659 172260 63
  45.08199 202944 74
  46.0992 77463 28
  47.10616 159675 58
  53.11076 111360 40
  54.13147 37750 14
  55.14271 199477 72
  56.13508 95786 35
  57.17133 494598 180
  58.15374 126469 46
  59.12688 177267 64
  67.13793 60353 22
  68.13357 792919 288
  69.15435 126760 46
  70.14899 84057 31
  71.17394 238975 87
  72.14914 30018 11
  73.25028 69003 25
  74.11656 271765 99
  83.17146 123791 45
  85.10854 174341 63
  91.13495 34416 12
  93.12413 60088 22
  95.15051 40090 15
  96.16142 183119 66
  97.18768 36897 13
  99.11687 79397 29
  105.66012 30252 11
  110.13925 82023 30
  111.16113 52938 19
  113.15047 55551 20
  113.67169 175718 64
  116.12024 66879 24
  123.15651 56371 20
  127.08559 60564 22
  136.15661 36760 13
  138.14722 298408 108
  139.15477 141469 51
  140.13857 35555 13
  142.10486 66036 24
  143.10454 32046 12
  143.73454 31880 12
  144.13416 35561 13
  155.11388 52262 19
  156.11776 46508 17
  157.10419 77936 28
  158.10339 42136 15
  166.17478 32350 12
  170.10137 271991 99
  171.10492 79761 29
  180.17858 45502 17
  184.11898 161572 59
  185.12064 285129 104
  186.11901 2751962 999
  195.1689 202761 74
  198.09722 94862 34
  200.03948 47050 17
  210.11173 60566 22
  212.08434 73157 27
  224.08631 45595 17
  226.07326 619225 225
  227.09087 87787 32              <-last line of first spectra
  129.02778 18900 7                <-first line of second spectra
  130.11347 42343 15
  136.1427 26075 9
  137.10019 25130 9
  138.11953 677826 246
  139.12352 80523 29
  140.11027 61803 22
  141.08029 95339 35
  142.08115 73738 27
  144.09759 159144 58
  149.20379 26721 10
  150.03273 24138 9
  152.1384 108578 39
  153.02915 38416 14
  154.01107 18706 7
  155.08419 422515 153
  156.11311 78871 29
  157.07738 86180 31
  164.15452 56191 20
  165.102 19225 7
  166.13771 231682 84
  167.14993 933539 339
  168.13903 29477 11
  169.04063 34261 12
  170.06958 564067 205
  171.04781 136229 49
  179.9786 14559 5
  182.03469 57553 21
  183.02114 26768 10
  184.0618 224532 82
  185.06897 394963 143
  186.03256 113767 41
  189.03165 18790 7
  196.06806 44105 16
  198.04149 1279391 464
  199.05639 342464 124
//

If I compare this peak list with JEOL_Ltd/JEL00033.txt I find essentially the first spectra (one peak is missing). That's why i suppose that this error is a copy&paste artifact and only the second spectra is valid for JEOL_Ltd/JEL00034.txt.
Is there any way to contact the owner of this records?

CASMI2016 records with compound/spectrum mismatch

User reported that SM858902 and SM858951 contain spectral data from acetylsulfamethoxazole but are labeled diphenhydramine (thank you!). Upon closer inspection we seem to have had an ID/Precursor&peaks mismatch for 3 IDs / 4 records in a series, surrounded by records that look OK; series "broken" due to missing IDs in the middle. We also need to find the cause in https://github.com/MassBank/RMassBank

This should not be passing any form of validation; a screening of the entire CASMI2016 database would be extremely useful for debugging the cause and flagging how and how many records to fix, thank you @meier-rene in advance if you can :-)

From what I can see:
**this one looks OK.
ACCESSION: SM858203
RECORD_TITLE: Cetirizine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C21H25ClN2O3
CH$EXACT_MASS: 388.15537
MS$FOCUSED_ION: PRECURSOR_M/Z 389.1626
389.1626 C21H26ClN2O3+ 1 389.1626 -0.05

**this one looks OK.
ACCESSION: SM858353
RECORD_TITLE: 2-Hydroxycarbamazepine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]-
CH$FORMULA: C15H12N2O2
CH$EXACT_MASS: 252.08988
MS$FOCUSED_ION: PRECURSOR_M/Z 251.0826
251.0827 C15H11N2O2- 1 251.0826 0.4

[no records with IDs between 8583 and 8588]

** here something has gone wrong
ACCESSION: SM858801
RECORD_TITLE: Finasteride; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C23H36N2O2
CH$EXACT_MASS: 372.27768
MS$FOCUSED_ION: PRECURSOR_M/Z 256.1696

** here something has gone wrong
ACCESSION: SM858902
RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
MS$FOCUSED_ION: PRECURSOR_M/Z 296.07

** still wrong ... it's using the same (wrong) exact mass to get equivalent wrong precursor
ACCESSION: SM858951
RECORD_TITLE: Diphenhydramine; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M-H]-
CH$FORMULA: C17H21NO
CH$EXACT_MASS: 255.16231
MS$FOCUSED_ION: PRECURSOR_M/Z 294.0554

** still wrong:
ACCESSION: SM859002
RECORD_TITLE: Acetyl-sulfamethoxazole; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C12H13N3O4S
CH$EXACT_MASS: 295.06268
MS$FOCUSED_ION: PRECURSOR_M/Z 325.1711
325.171 C20H22FN2O+ 1 325.1711 -0.17 <= we have F annotations!!!!!

[no 8591]

** and now everything seems OK again ...
ACCESSION: SM859203
RECORD_TITLE: Amitriptyline; LC-ESI-QFT; MS2; CE: 35 NCE; R=35000; [M+H]+
CH$FORMULA: C20H23N
CH$EXACT_MASS: 277.18305
MS$FOCUSED_ION: PRECURSOR_M/Z 278.1903
278.1904 C20H24N+ 1 278.1903 0.42

GitHub web upload seems to choke at 100

Following instructions from @sneumann:
I just realised that there is a simpler way to upload files if the directory already exists. In that case, You need 1) and 2), but 3)-7) can be replaced by going to https://github.com/schymane/MassBank-data/upload/master/UniLu
and using the "Upload files" button :-)

I get:
Yowza, that’s a lot of files. Try again with fewer than 100 files.

Since I have 950 files and don't (yet) fancy doing 9.5 commits ... I am trying another way!

Missing or mismatching MassBank directories

I'm trying to reconcile what I see on massbank.eu and massbank.jp with the directories in this repo, the following seem to be missing (num. records MBEU / num. records MBJP)
AAFC (292/0)
CASMI2016 (622/622)
Env Anal Chem, U Tuebingen (119/119)
European MassBank (1/0) <= not a major concern as this is a dummy entry
UPAO (12/12)

The following are present in this repo but not online:
CASMI_2012

The following folders are in the OpenData here: http://www.massbank.jp/SVN/OpenData/record/
CASMI_2016, Env_Anal_Chem_U_Tuebingen, UPAO

I cannot see AAFC there.

Check Creatinine records

There appear to be some creatinine records that are noise only or close too it where we should also consider removing poor records.
This one has good intensity, few peaks but looks fine, I've used it to compare:
https://massbank.eu/MassBank/RecordDisplay.jsp?id=SM868004&dsn=CASMI_2016
PK$PEAK: m/z int. rel.int.
57.0575 160462.5 2
58.0653 4760539.5 69
70.0653 149661.5 2
72.0445 776245.8 11
86.0713 4838806 71
114.0662 67953792 999

These ones are very low intensity, two have peaks that are clearly noise only, two have peaks that are in the peaklist above but still close to noise, missing other main peaks and I recommend actually to remove all four UF records ... again these failed QC by Herbert Oberacher.

https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF412504&dsn=UFZ
one or two genuine peaks, rest noise, low I
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF412501&dsn=UFZ
noise only?
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF412503&dsn=UFZ
one or two genuine peaks, rest noise, low I
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF412502&dsn=UFZ
noise only?

@meier-rene @tsufz

Inconsistency in data for records on MssBank vs MoNA

I am comparing the MoNA record at http://mona.fiehnlab.ucdavis.edu/spectra/display/BSU00002 with the MassBank record at https://massbank.eu/MassBank/RecordDisplay.jsp?id=BSU00002

I see stereochem in the structure depiction on MoNA but not in the MassBank record. I assume that InChIs are the basis of the stereo on MoNA but the SMILES has no stereochem on MassBank. The inconsistency is confusing. Is there a StereoSMILES in MassBank that is not displayed?

Happy release!

Thanks for your efforts @meier-rene. However, we need some release policy. Changes and updates in the data are occuring from time to time rather than frequently. Often the upload is also related to reportings or publishing of a paper.

Hence, a fixed release frequency is not an appropriate way to go. I suggest a very open release policy, that means that we release on request or as a larger set of new spectra is uploaded. I would expect that as a contributor, I want to see my spectra online ASAP and I don't want to wait for weeks (as in paper publishing....).

UFZ additional specs validation: Incorrect number of peaks in peaklist

Hi,
currently the last directory failing is
https://travis-ci.org/MassBank/MassBank-data/jobs/368060988

Incorrect number of peaks in peaklist. 16 peaks are declared in PK$NUM_PEAK line, but 13 peaks are found.
PK$PEAK: m/z int. rel.int.
^
Error in UA008702.txt
Incorrect number of peaks in peaklist. 13 peaks are declared in PK$NUM_PEAK line, but 11 peaks are found.
PK$PEAK: m/z int. rel.int.
^
Error in UA008703.txt
Incorrect number of peaks in peaklist. 8 peaks are declared in PK$NUM_PEAK line, but 7 peaks are found.
PK$PEAK: m/z int. rel.int.
^
Error in UA008704.txt

for e.g. https://github.com/MassBank/MassBank-data/blob/master/UFZ_Additional_Specs/UA008704.txt

Yours,
Steffen

Addition of ion mobility information

How can ion mobility data be added to MassBank records?

Need of AC$ION_MOBILITY with subtags

Discussion with @sneumann, @tsufz, @schymane in Dagstuhl January 2020?

Improve release cycle

Here are the notes and to does from our webmeeting with @laurentheirendt

make a dev branch
improve contribution guidelines (checkout the dev branch, ...)
push / merge against the dev branch
merge with master when all tests succeed
ARTENOLIS could be an alternative to Jenkins CI: https://arxiv.org/pdf/1712.05236.pdf https://prince.lcsb.uni.lu/jenkins/
hands on:
- create branch dev and push to it
- GitHub Branch properties for master, dev: require status checks to pass before merging; include administrators
- Open pull request > edit > branch is changeable; pass tests; merge
- tag the 'merge into master' with a release version and comments of the changes

Waters records

Hi, the records of Nihon Water are not covered by an open access license. I suggest to remove them from the repository.

CC-BY => CC0 in MassBank-data

Suggestion by @egonw to change CC-BY to CC0 by default, as this is the more applicable and most flexible license. Agree?

InChIKey and matching DTXSID dump for MassBank

@meier-rene are you able to produce a dump file with all InChIKeys in MassBank and, where they have them, the corresponding DTXSIDs? I need all the InChIKeys for one file, and all the DTXSIDs for another.
I've browsed and found several varients of such files, but not one containing exactly this information paired. If you have one already that I missed, please point me to it ;-)
Thanks!

Create file for PubChem deposition at every release

It would be great if we could auto-create a file to deposit in PubChem with every stable release of MassBank-data.
To discuss: compound information only (=> relatively easy) or mappings with spectral IDs (slightly more info needed) or actual spectra as well (more work our side).
Shall we start with getting a deposit file for compound information only? Then we need e.g.:

PUBCHEM_EXT_DATASOURCE_REGID <= InChIKey, or any unique identifier our side
PUBCHEM_EXT_DATASOURCE_SMILES <= SMILES
PUBCHEM_EXT_DATASOURCE_CID <= PubChem CID (if available)
PUBCHEM_SUBSTANCE_COMMENT <= here we could e.g. provide accession IDs, collapsed
PUBCHEM_SUBSTANCE_SYNONYM <= any names our side (can have multiple columns, but maybe e.g. max 3 would be sensible)

@meier-rene @sneumann @tsufz what do you think?
If yes, who will look after the file?
I would contact PubChem to get us a MassBank login for deposition, so credit goes to MassBank(EU) and we can track our submissions.

sqlite as export format for MassBank-data

Hi,
@Tomnl has updated his code to convert MassBank records to a sqlite database:

I have tidied up the MSP to SQLite python code and included it as
separate python package maintained in pip, see docs https://msp2db.readthedocs.io/en/latest/
and code https://github.com/computational-metabolomics/msp2db
The code can be used as CLI or API to create an SQLite database from MSP files
By default, it can work with either MSP format found in MassBank
github or from MoNA. You just need to assign the either "massbank" or
"mona" the schema parameter.
...
I have updated the documentation https://bioconductor.org/packages/devel/bioc/vignettes/msPurity/inst/doc/msPurity-spectral-matching-vignette.html
for msPurity in Bioconductor (development branch)
Includes reference to msp2db documentation and details the databases
in more detail
I have created SQLite databases locally from MassBank and MoNA
I am in the process of getting a suitably sized updated SQLite file
for msPurityData Bioconductor data package.
Please let me know if you have any questions. And I will keep you
informed of any other developments.

It would be great to distribute snapshots of MassBank-data in such a format.
Yours, Steffen

External report: issues with conflicting stereochemistry in identifiers

Copy-paste from email received; @meier-rene are you able to follow-up? Thx!

Comparing data from different databases, I found some discrépancies between your data. For the mentioned entry of your database (https://massbank.eu/MassBank/RecordDisplay.jsp?id=OUF00136), the chemical structure indicates that the configuration of the double bond is not defined. This configuration is defined in other databases as InChIKey CWVRJTMFETXNAD-NCZKRNLISA-N:

See:

PubChem: https://pubchem.ncbi.nlm.nih.gov/compound/9476
ChEBI: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:95271
ChEMBL: https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL3186431/
EPA: https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID3024786

Could you check please if the definition of your entry is correct and if the chemical structure is the correct one of if the structural identifiers are wrong ?

The problem is the same for other entries like FIO00619, JP000136, FIO00623... where the chemical structure is not correct compared to the stereoconfiguration at the origin of InChIKey CWVRJTMFETXNAD-JUHZACGLSA-N. This InChIKey requires the definition of the 4 chiral carbons on the ring. Please see:

ChEBI: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:16112
CHEMBL: https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL284616/

Incorrect m/z values

Reporting upstream as suggested in
https://bitbucket.org/fiehnlab/mona/issues/209/incorrect-m-z-values

Apparently, multiple MoNA GC/MS spectra have incorrect entries in MASS SPECTRAL PEAKS.

To reproduce:
http://mona.fiehnlab.ucdavis.edu/spectra/display/JP011674

Spectrum window displays multiple peaks with m/z > 1000 Da, which is definitely out-of-range (molecule mass is 436.2438).

Closer examination reveals that MASS SPECTRAL PEAKS contains multiple entries which are out of order, and exceed the previous entry by factor of >10:

Luckily, all these should be easy to fix, by removing the extra trailing zeroes.
(Were these markers of some sort the operator failed to remove before submitting, or OCR bugs?)

In attached list.txt is a list of all automatically-flagged suspicions records; more detailed information is available in log.txt.

Please note that there are some false positives due to rare strange ordering of MASS SPECTRAL PEAKS items:
http://mona.fiehnlab.ucdavis.edu/spectra/display/JP011672
(what natsort?!)

However, only <400 records have non-conventional ordering, so it might be feasible to review them all.

[SOLVED] RIKEN_NPDepo - wrong usage of comment field for synonyms

@zzjl20, I found that synonyms are annoted as COMMENT: Synonyms:, for example in
https://massbank.eu/MassBank/RecordDisplay.jsp?id=NGA00625&dsn=RIKEN_NPDepo.

In accordance to our specificication, synoynms should be annoted in CH$NAME with one entry per synonym. The first CH$NAME entry should be the preffered name.

For example:
CH$NAME: (S)-Luteanine
CH$NAME: Artabotrine
CH$NAME: Luteanine

The synonyms in Comment will be not stored as chemical names and thus a search for example for Artabotrine has no result.

May I ask you to edit the respective records and to resubmit them?

Thanks a lot and best wishes
Tobias

Update records to be compliant with Record Format 2.4

The following fields and entries do not comply to the latest Record Format and should be changed in the data for harmonisation.

Those tags are wrongly used or the terms changed during the last years:

AC$MASS_SPECTROMETRY: FRAGMENTATION_METHOD -> AC$MASS_SPECTROMETRY: FRAGMENTATION_MODE

AC$MASS_SPECTROMETRY: FRAGMENT_VOLTAGE -> AC$MASS_SPECTROMETRY: IONIZATION_ENERGY

AC$MASS_SPECTROMETRY: IONIZATION_POTENTIAL -> AC$MASS_SPECTROMETRY: IONIZATION_ENERGY

AC$MASS_SPECTROMETRY: RESOLUTION_SETTING -> AC$MASS_SPECTROMETRY: RESOLUTION

AC$MASS_SPECTROMETRY: ION_SOURCE_TEMPERATURE -> AC$MASS_SPECTROMETRY: SOURCE_TEMPERATURE

AC$CHROMATOGRAPHY: CAPILLARY_VOLTAGE -> AC$MASS_SPECTROMETRY: CAPILLARY_VOLTAGE

AC$CHROMATOGRAPHY: INJECTION_TEMPERATURE -> AC$MASS_SPECTROMETRY: SOURCE_TEMPERATURE

AC$CHROMATOGRAPHY: RETENTION_INDEX -> AC$CHROMATOGRAPHY: KOVATS_RTI
I checked all entries, they are in the 1000er ranges and thus it is very propably that they are related to the KOVATS_RTI

AC$CHROMATOGRAPHY: OVEN_TEMPERATURE -> AC$CHROMATOGRAPHY: COLUMN_TEMPERATURE_GRADIENT

MS$FOCUSED_ION: PRECURSOR_M/Z -> MS$FOCUSED_ION: PRECURSOR_MZ
@meowcat mentioned that we should avoid slashes in the tags.

This is just a typo:
AC$CHROMATOGRAPHY: TRANSFARLINE_TEMPERATURE -> AC$CHROMATOGRAPHY: TRANSFERLINE_TEMPERATURE

The following tags can be merged into AC$MASS_SPECTROMETRY: MASS_RANGE_MZ
AC$MASS_SPECTROMETRY: MASS_RANGE_M/Z
AC$MASS_SPECTROMETRY: SCAN_RANGE_M/Z
AC$MASS_SPECTROMETRY: SCANNING_RANGE

Switch bump-version.sh to one step validation with parallel Validator

Just a reminder.

GC-APCI-QTOF spectra to MassBank

Hi @meier-rene @tsufz,

There is a project in NORMAN joint program of activities to upload GC-APCI-QTOF mass spectra in MassBank. I prepared for you one massbank record (https://www.dropbox.com/s/93z76o9bx243lll/AU230117.txt?dl=0), so that you apply the needed modifications to MassBank (if any).

Let me know if all is okay with the sample record, so that I give the signal for production of GC-APCI-QTOF records.

Thanks!
Nikiforos

Duplicated records

I just stumbled over two records, which seem to be duplicates. Meta data as well as the spectrum is exactly the same.
https://massbank.eu/MassBank/RecordDisplay.jsp?id=TY000228&dsn=Univ_Toyama
https://massbank.eu/MassBank/RecordDisplay.jsp?id=TY000237&dsn=Univ_Toyama
Maybe it is worth to search MassBank globally for such cases.
I guess we will have to contact the contributors in any case.

How to tackle this? I suggest to introduce a "DEPRECATED" tag for records which are duplicated (this issue) or noisy (e.g. #51) or otherwise erroneous (#9).

BS003840/41/42/43/44/45 mixup in InChi, SMILES, MolecularFormula and exact mass

The metadata of the Kaempferol-7-O-glucoside spectra in MassBank are not consistent. We have a mixup of the protonated and deprotonated form in InChi, SMILES, MolecularFormula and exact mass.

Validator - false errors

Hi,
RMassBank or one of the used databases provides sometimes chemical names in different capitalization. However, the validator is not case sensitive. I think the validator should be less picky and should not complain about duplicates in case of case sensitive dublicates.

Or we need general rules about the caplitalisation which need to be implemented in both validator and RMassBank.

Yours,
Tobias

Chemspider URL InChI ACCESSION mapping

To update the links on Chemspider to the new MassBank url and to include new records also we need a mapping file InChI <--> URL.

Remove three noisy Clarithromycin records

We currently have 28 Clarithromycin records and three records from UFZ appear to be noise only and I recommend we remove them:
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF408502&dsn=UFZ
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF408504&dsn=UFZ
https://massbank.eu/MassBank/RecordDisplay.jsp?id=UF408503&dsn=UFZ

All only have peaks around 522 incredibly close to the noise level of the Orbi and fail quality control checks by Herbert Oberacher, also with other MassBank entries.

PK$PEAK: m/z int. rel.int.
522.4671 2511.9 254
522.9462 9873.2 999
522.9945 3734.4 377

@tsufz @meier-rene

Add DTXSIDs to all MassBank records with InChIKey match

@meier-rene @Treutler the EPA have set up a basic service that should allow retrieval of DTXSIDs by InChIKey, can you look into implementing this on the database end to add DTXSIDs to all records with matching entries for now, I will post a separate issue to get this into RMassBank and linked up in MassBank-web.
It's already in our Record format as
CH$LINK: COMPTOX DTXSID50274017
(https://github.com/MassBank/MassBank-web/blob/master/Documentation/MassBankRecordFormat.md)

https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve?identifier=IKHGUXGNUITLKF-UHFFFAOYSA-N
https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve.json?identifier=IKHGUXGNUITLKF-UHFFFAOYSA-N
https://actorws.epa.gov/actorws/chemIdentifier/v01/resolve.xml?identifier=IKHGUXGNUITLKF-UHFFFAOYSA-N

Any feedback re service to @ChemConnector

Thanks!

KO002066 and KO002063

External comment to massbankEU mail:
I think that the following two data are mistakenly replaced.
Could you confirm it? Thank you.

https://massbank.eu/MassBank/jsp/RecordDisplay.jsp?id=KO002063&dsn=Keio
L-Aspartic acid; LC-ESI-QQ; MS2; CE:10 V; [M+H]+

https://massbank.eu/MassBank/jsp/RecordDisplay.jsp?id=KO002066&dsn=Keio
L-Aspartic acid; LC-ESI-QQ; MS2; CE:40 V; [M+H]+

ES comment: I have asked for more detail; the CE10 spectrum looks more like CE40 and vice versa, but this is not entirely clear, esp with one high mass peak (noise?).

Curation of entries related to mixtures

Some "compounds" we measure are not single compounds, but mixtures of isomers or similar compounds. However, often the mixture is reported, for example Nystatin, which contains Nystatin A1, A2 and A3.

An example is https://massbank.eu/MassBank/jsp/RecordDisplay.jsp?id=EQ314001&dsn=Eawag
The name is Nystatin (the mixture), but the shown compound is Nystatin A1 ``https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID80872323` related to https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID80872323#related-substances.

For the measurement also proxy compounds are used (for example in case of surfactant mixtures with homologes or nonylphenol).

However, from a pure data science / machine view point this relation is wrong without addional information that the given compound is a proxy. In PubChem also only the proxy is given, DTX is better, of course.

Therefore, we should implent a structure to handle this situation:

Add a mixture tag which include a link to a external source (PubChem / DTX)
Curate records with mixtures in order to represent the correct compound used for the mass spectra generation (e.g. not Nystatin, but Nystatin A1)
Insert the mixture tag to those records

Implement a JSON representation of MassBank records

@hunter-moseley suggested to implement a JSON representation for the records to enhance machine readibility and to account for some future transitions.

massbank / massbank-data Goto Github PK

massbank-data's Introduction

MassBank-data validation status

MassBank-data introduction

massbank-data's People

Contributors

Stargazers

Watchers

Forkers

massbank-data's Issues

Recommend Projects

Recommend Topics

Recommend Org