Giter VIP home page Giter VIP logo

ricardomborges / dafdiscovery Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 0.0 41.6 MB

DAFdiscovery is meant to disseminate STOCSY calculation for NP scientists to enable data fusion and discovery of compounds of interest from correlation calculations. The use of .csv files is meant to enable users to use their methods of choice for (MS and/or NMR) data processing.

License: MIT License

Python 0.03% Jupyter Notebook 99.96% Batchfile 0.01%
nmr ms bioactivity integration

dafdiscovery's Introduction

DAFdiscovery

DAFdiscovery is meant to disseminate STOCSY calculation for NP scientists to enable data fusion and discovery of compounds of interest from correlation calculations. The use of .csv files is meant to enable users to use their methods of choice for (MS and/or NMR) data processing. This method was developed from .csv derived from MNova and MZMine files.

Basics on Jupyter Notebook

  • Jupyter Notebooks are completely free and users can download them on their own or as part of Anaconda (recommended)

  • Install Jupyter Notebooks through Anaconda at https://anaconda.org/

  • Start Jupyter Notebooks from the Anaconda suite or using the 'Anaconda Prompt' (installed together with Anaconda)

    • we recommend adding the 'Anaconda prompt' to favorites (or Start in Windows) for easy access
    • from the 'Anaconda prompt', use jupyter notebook to start Jupyter Notebook
      • use jupyter notebook D: to start Jupyter Notebook at the driver D: (e.g. if Jupyter Notebook is in C: and data is in D:)
  • Once Jyputer Notebook is opened, navigate to the directory where DAFdiscovery was saved to open the notebooks (.ipynb files)

  • Dependencies required

    • Open 'Anaconda Prompt' command line and install the packages:
      • NUMPY: type “pip install -U numpy”
      • PANDAS: type “pip install -U pandas”
      • IPYMPL: type “pip install -U ipympl”
      • PLOTLY: type “pip install -U plotly”
      • KALEIDO: type “pip install -U kaleido”
      • MATPLOTLIC: type “pip install -U matplotlib”
      • SCIKIT-LEARN: type “pip install -U scikit-learn”
      • SEABORN: type “pip install -U seaborn”
  • TIPS to know:

    • Jupyter is a web application used to create documents that integrate codes, information text, and results
      • each notebook document (.ipynb) is separated into cells that can be used for codes, texts, and others
      • Markdown: when a cell is set as Markdown, users can add text as a markup language to provide information for other users
      • Code: when a cell is set as Code, users can add code lines for programming
    • use %pwd to print the current directory
    • use %ls to show all files in the current directory
    • use %Who to list all variables
    • to run a cell with code lines, use "Ctrl + Enter" (or the button "Run")
    • Remember to always shutdown the kernel after using each notebook
      • Kernel --> Shutdown
    • Have fun!

Get Organized: Each project can be represented as its own directory inside the main directory of DAFdiscovery

  • Download the main DAFdiscovery directory to a local of choice on your computer

    • using the dropdown button "Code", click on Download ZIP to download the full contend image
  • Each project should be organized as a new directory inside the main DAFdiscovery directory with all necessary inputs

  • The STOCSY.py function is present at this main DAFdiscovery directory

  • The .ipynb files (notebooks) should also be located in this main directory

    • users should copy and rename each .ipynb file (notebook) according to their projects
    • users should delete unused options from each .ipynb file (notebook) according to their projects image

Metadata (.csv) must include the following column headers

  • Samples - sample codes

  • MS_filename - MS filename as present at the MS derived .csv file if available rows as the MS features columns as the samples

  • NMR_filename - NMR filename as present at the NMR derived .csv file if available rows as the chemical shift (ppm) values columns as the samples

  • BioAct_filename - as reported for the BioActivity readout if available

Important: Users are directed to different options according to the dataset available and made explicit in the Metadata. For instance, if the user has in the Metadata the columns Samples, MS_filename, and BioAct_filename, they will be pointed to option 4 (combining data from MS and Bioactivity).

Input files: .csv files were chosen as input because those are vendor-agnostic, easily handled by text editors, and we don't intend to re-create methods for peak processing for MS or NMR. Thus, users can process their data using their platform of choice (e.g.: MZmine for MS or MNova for NMR).

  • Metadata.csv - presents a list of the filenames used for each dataset (MS, NMR, or BioActivity). Those filenames should be the same as exported from each specific processing platform (e.g.: MZmine for MS or MNova for NMR); they should list in order (according to the Samples column) all the headers from the other input .csv files image

  • MS.csv - the .csv file exported from the MS processing platform (e.g.: MZMine) where MS-features are listed as rows and samples as columns with filenames as headers for each column [tip: this can/should be the same *quant.csv file used in FBMN from GNPS] image

  • NMR.csv - the .csv file exported from the NMR processing platform (e.g.: MNova) where chemical shifts are listed as rows and samples as columns with filenames as headers for each column image

  • BioActivity.csv - the .csv file produced from the Bioactivity readout where samples are listed headers for each column

    • a gradient of values (avoid binaries such as 'active vs inactive' or '1 vs 0') for the Bioactivity data must be used since the potency of the activity will be a consequence of certain compound concentrations across samples

image

General input organization

inputs-01-01

Running DAFdiscovery

  • We recommend users to first open the notebook file "dafDiscovery_General.ipynb"

    • Notebooks will open as a new tab
  • All notebooks are divided into cells and Code cells are the ones that should be used to run the codes

    • Markdown cells are used for information only and not to run codes
  • To run a Code cell, users can use 'Ctrl + Enter' or the button "Run"

  1. Dependencies to install

  2. Import section

    2.a. Set the project to use

    2.b. Import the Metadata.csv file to establish filename order according to Sample IDs

     * this will define the option to be used regarding the available data. 
    
     * for e.g. if the user has 'MS_filename' and 'NMR_filename' only, this will direct the user to go for option 2 (work with MS and NMR data)
    

We recommend users delete other options and keep only the one that fits that project in use

Save the notebook file with another name (related to the project in use)

  1. Data fusion

    3.a. in this cell, we will

     * reorganize the filename order in each dataset according to the Sample IDs in the Metadata, 
    
     * merge the datasets accordingly
    
     * save the new merged dataset into a .csv file for the user to have it if necessary
    
  2. STOCSY calculations

    4.a. in this cell, we will finally apply the STOCSY function to calculate the covariance and the correlation data from the selected driver peak

     * a driver peak must be selected/chosen by the user and it will reflect the user's interest in certain peak
    
     * a NMR peak can be used as driver if NMR data is available
    
     * a MS-feature can be used as driver if MS data is available
    
     * the bioactivity derived peak will be chosen when bioactivity data is available
     	* this is because when we have bioactivity data, we want to look for (MS or NMR or both) peaks that are correlated to it
    
     * the resulted correlation values for each MS-feature (according to the selected driver) are saved as a .csv file names MSinfo_corr_*driver*.csv in the data directory created inside the Project directory in use
    

    4.b. to run the STOCSY calculation more than once, users can duplicate (Insert -> Insert Cell Below; or 'Esc + B') by copy and pasting the code into a new cell

     * the user will need to modify the driver again
    

    4.c. Plots

     * NMR STOCSY plots are so that the intensities correspond to the covariance values for each variable (according to the driver peak) and the colors correspond to the correlation values for each variable (according to the driver peak); the x-axis is kept as chemical shifts (ppm)
    
     	* the color map refers to a variation between -1 to +1 where values close to 1 show strong correlations, but the signal shown negative or positive correlations
    
     * MS STOCSY plots are scatter-plots of Retention Time vs M/Z (this can be easily modified e.g. to Retention Time vs Row ID or scan ID)
    
     * MS-features are represented as circular marks with variable sizes and colors 
    
     	* the sizes of the circular marks are related to the correlation values for each variable according to the driver
    
     	* the color map refers to the same as mentioned before for the NMR STOCSY plot
    

    4.d. From DAFdiscovery to Cytoscape for Molecular Networks (FBMN-GNPS) when it is available

     * the file MSinfo_corr_*driver*.csv produced after the application of the STOCSY function in a dataset that contains MS data can be imported as node attributes (in Cytoscape: File -> Import -> Table from File...)
    
     * this is possible because every correlation value is indexed to the 'row ID' which characterizes each deconvoluted MS-feature
    
     * this is why the user has an FBMN workflow and the required data available, for instance: the *_quant.csv file
    

References:

https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/epdf/10.1002/pca.3178

https://pubmed.ncbi.nlm.nih.gov/15732908/

https://pubs.acs.org/doi/10.1021/ac051444m

https://pubmed.ncbi.nlm.nih.gov/35098600/

https://pubs.rsc.org/en/content/articlelanding/2019/fd/c8fd00227d

Video walk-through: https://youtu.be/_lQGzfK5V-k

Video em Portugues (Seminario IPPN 25-07-2022): https://youtu.be/Jy0jjX0bL40

dafdiscovery's People

Contributors

ricardomborges avatar stefhk3 avatar

Watchers

 avatar  avatar  avatar

dafdiscovery's Issues

Questions. Using DAFdiscovery with nontarget screening data

Hello,

thank you very much for this promising tool. I have few question before using it for my research.

Do you think that DAFdiscovery would be relevant for data from nontarget screening (when we collect MS1 and MS2 data from all the molecules we can ionized in our sample)?

If so, do you think the pipeline can be use with MS data collected in different ionization modes and maybe with MS data coming from different instruments (like LC and GC)? If I am not mistaken I think I only saw MS data collected in one ionization mode in the examples with the tutorials.

In that optic do you think the data should be scaled before calculating the correlation between MS signals and bioactivity? I am thinking that scaling the MS Areas by unit variance before running the script would avoid easily ionized molecules having a too important weight on the model compare to other molecules with a less intense signal.

One last thing: in the script for the case II (MS data and bioactivity) I noticed that the last correlation plot of MS feature was not loading properly because the column "corr_Bioact" was not renamed in the file corrDF. I fixed it by adding "corrDF2.columns.values[1] = "corr_BioAct"" before exporting MSinfo_corr.

Thanks for the help.
Jean

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.