Giter VIP home page Giter VIP logo

dbparser's Introduction

dbparser

CRAN_Status_Badge codecov Project Status: Active – The project has reached a stable, usable state and is being actively developed. Lifecycle: stable metacran downloads CII Best Practices

Overview

Drugs databases vary too much in their formats and structures which making related data analysis not a very easy job and requires a lot of efforts to work on only two databases together such as DrugBank and KEGG.

Hence, dbparser package aims to parse different public drugs databases as DrugBank or KEGG into single and unified format R object called dvobject (stands for drugverse object).

That should help in:

  • working with single data object and not multiple databases in different formats,
  • using R analysis capabilities easily on drugs data,
  • ease of transferring data between researchers after performing required data analysis or dvobject and storing results in the same object in a very easy manner

dvobject Structure

dvobject introduces a unified and compressed format of drugs data. It is an R list object that contains one or more of the following sub-lists:

  • drugs: list of data.frames that contain drugs information (i.e. synonyms, classifications, …) and it is the only mandatory list
  • salts: data.frame contains drugs salts information
  • products: data.frame of commercially available drugs products in the world
  • references: data.frame of articles, links and textbooks about drugs or CETT data
  • cett: list of data.frames contain targets, enzymes, carriers and transporters information

Drug Databases

Parsers are available for the following databases (it is in progress list)

DrugBank

DrugBank database is a comprehensive, freely accessible, online database containing information on drugs and drug targets. As both a bioinformatics and a cheminformatics resource, DrugBank combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. More information about DrugBank can be found here.

In its raw form, the DrugBank database is a single XML file. Users must create an account with DrugBank and request permission to download the database. Note that this may take a couple of days.

The dbparser package parses the DrugBank XML database into R tibbles that can be explored and analyzed by the user, check this tutorial for more details.

If you are waiting for access to the DrugBank database, or do not intend to do a deep dive with the data, you may wish to use the dbdataset package, which contains the DrugBank database already parsed into dvobject. Note that this is a large package that exceeds the limit set by CRAN. It is only available on GitHub.

dbparser is tested against DrugBank versions 5.1.0 through 5.1.10 successfully. If you find errors with these versions or any other version please submit an issue here.

Installation

You can install the released version of dbparser from CRAN with:

install.packages("dbparser")

or you can install the latest updates directly from the repo

library(devtools)
devtools::install_github("ropensci/dbparser")

Code of Conduct

Please note that the ‘dbparser’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Contributing Guide

👍🎉 First off, thanks for taking the time to contribute! 🎉👍 Please review our Contributing Guide.

Share the love ❤️

Think dbparser is useful? Let others discover it, by telling them in person, via Twitter or a blog post.

Using dbparser for a paper you are writing? Consider citing it

citation("dbparser")
#> 
#> To cite dbparser in publications use:
#> 
#>   Mohammed Ali, Ali Ezzat ().  dbparser: DrugBank Database XML Parser.
#>   R package version 2.0.0.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {DrugBank Database XML Parser},
#>     author = {Mohammed Ali and Ali Ezzat},
#>     organization = {Interstellar for Consultinc inc.},
#>     note = {R package version 2.0.0},
#>     url = {https://CRAN.R-project.org/package=dbparser},
#>   }

dbparser's People

Contributors

agenius-mohammed-ali avatar alizat avatar amrrs avatar emmamendelsohn avatar mohammedfcis avatar noamross avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dbparser's Issues

Consistency of CLASS of Data Frames Returned by dbparser

The data frames that are returned by the dbparser package (using the drug_parse_all() function) do not all have the same class/type. Observe the following examples:

> get_xml_db_rows('full database.xml')

> myall <- parse_drug_all()

> class(myall$drug_targets)
[1] "tbl_df"     "tbl"        "data.frame"

> class(myall$drug_targets_actions)
[1] "data.frame"

Please ensure that all the returned data frames consistently have the same class/type:
"tbl_df" "tbl" "data.frame" (i.e. dplyr's version of data frame, tibble)

Vignette Content May Need to Be Modified

The vignette's figures require the full DrugBank database to be displayed. Since the full database would be be too big to include in the package, this prevents the vignette from being self-contained (which is a bad thing).

This issue is about coming up with another couple of EDA examples to add in the vignette that does not require the full database. The purpose of this is to avoid relying on the full DrugBank database so that the vignette may be self-contained.

Count variables in 'drugs' data frame to have same prefix

When using the dbparser::parse_drug() function, the returned data frame has many variables ending with the _count suffix. In RStudio, when one is looking at the different elements of the data frame (by adding a $ sign and pressing TAB), it can be a bit annoying to find these count variables cluttering the elements list.

Proposed enhancement:
Instead of a ..._count suffix, let these variables have a count_... prefix so that they are bundled together; i.e. make use of the fact that the elements are ordered alphabetically in the list returned when pressing TAB.

Bonus:
In the list returned by the tidyverse's glimpse() function, it would be great the count elements are moved to the end of that list.

Add Parsed Data to Kaggle as a Contributed Dataset

Gather all the data frames that are returned by dbparser's parse_drug_all() function, and submit to Kaggle as a dataset. This dataset is to be updated quarterly with each new release of the DrugBank dataset.

Prerequisites:

  • Ask permission from DrugBank and make sure it is okay to do such a thing.
  • Homework: Can we contribute the dataset as a team (Dainanahan)?
  • Challenge: The DrugBank data is >1GB in size. How to upload such a dataset?
    • Suggested Solution: Rent an online VM for a day. From that VM, download and parse DrugBank XML, then upload to Kaggle.
  • The dbparser website has to be finalized.
    • Meaning of the word, 'finalized', to be discussed and specified later, but I am mainly referring to tutorials, usage examples, blogs and inspiration from our previously submitted DataCamp proposal.
  • Prior to contributing the dataset, there needs to be enough tutorial/blog material for Kagglers to know what to do with the data (what analyses to perform? what interesting questions could be answered with the dataset? perhaps contribute a Kaggle kernel of our own immediately after the dataset is posted online? etc.)
  • Homework: We have to make sure that the names of the variables (columns) of the data frames (returned by dbparser) are acceptable and in the way that we want them (i.e. self-explanatory without much needed elaboration from our side).
    • This will require a good look at the data frames' column names.
    • The parent_key variable in particular will be revisited (e.g. wherever parent_key refers to a drug, change column name to drugbank_id?).

parse_drug_all issues

GO_Classifiers_Polypeptide_Target_Drug is returned twice and drug_carriers_polypeptides_go_classifiers is not following name standard

Incorrect Data Frame Names Returned by `drug_parse_all()`

dbparser's parse_drug_all() function returns all the parsed data in the form of a list of 72 data frames. All the data frames have the prefix drug_ except two of them:

  • experimental_properties: please change to drug_experimental_properties
  • classificationsparse_drug_classifications: please change to drug_classification

Additionally, please make sure that the names of the data frames returned by the individual parsing functions, parse_drug_experimental_properties() and parse_drug_classifications(), are also correct.

Finally, I believe that parse_drug_classifications() should be parse_drug_classification() (i.e. remove the 's') to correctly correspond with the <classification> tag in the XML (in the same way that parse_drug() corresponds to the <drug> tag).

Add Function for "Subsetting" the Data with a Certain Drug(s)

Using dbparser's parse_drug_all() function, the user is able to obtain the entire data residing within the XML file containing the DrugBank database. The parsed data is in the form of a list of data frames. This is fine in the case where the user is interested in doing an analysis that involves the entire data.

This issue is about the case when the user is only interested in a specific drug (or group of drugs). Let's refer to the list of data frames (returned by parse_drug_all()) as dblist.

A function called filter_drugs() is to be added that would be used as follows:

  • First, the user would programmatically filter just the drugs data frame from dblist using whatever criteria they wish (e.g. via the dplyr::filter() function).
  • After the user is satisfied with the set of drugs in dblist$drugs, dblist is then passed to filter_drugs().
  • As a result, all the other data frames in dblist would be filtered by the set of drugs in dblist$drugs.
  • dblist is then returned to the user after having been filtered.

Bonus: The function is preferred to have error-handling and/or warning messages for when unexpected things happen (e.g. a data frame from dblist is missing).

parse_drug(): Main Drug ID: primary_key --> parent_key

The data frame returned by parse_drug() has, as the main drug ID, the column named 'primary_key'. However, when the drug IDs appear in data frames returned by other parse*() functions, it is typically under a column named 'parent_key'.

Here, I am suggesting to modify the 'primary_key' column name in the parse_drug() data frame to 'parent_key' so that it would match the other data frames. This would mainly be convenient in joining operations between data frames that will be done quiet frequently in the future.

Warnings are displayed with any parse*() function

The following warning is displayed upon using any parse*() function:
"In bind_rows_(x, .id) : binding character and factor vector, coercing into character vector"

It is my impression that CRAN wants there to be no such warning messages for typical/standard/default usage of any of dbparser's functions. The above message shows up when running the function on the full DrugBank database XML.

Misleading Error Message While Attempting to Read Non-Existing XML File

When attempting to read a non-existing XML file using the get_xml_db_rows('<FILENAME.xml>') function of the dbparser package, an error is given (which makes sense). However, the error produced in this scenario is:

Error: XML content does not seem to be XML: '<FILENAME.xml>'

The issue here is that the error message is incorrect since the actual problem is that the file does not exist in the first place (and not that the file exists and its contents are non-XML). The error message is currently misleading and is supposed to be something like the one below instead:

Could not find the file: "<FILENAME.xml>". Please ensure that the file name is entered correctly and that it exists at the specified location.

Naturally, code that checks for the file's existence will need to be added, and if indeed the file doesn't exist, produce the error message above.

Parsing Properties from Small Molecule Element

Is your feature request related to a problem? Please describe.
Extracting SMILES and similar attributes from Small Molecule drugs

Describe the solution you'd like
Under Drugs element:
<property> <kind>SMILES</kind> <value>CC[C@H](C)[C@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@H](CCC(O)=O)NC(=O)[C@H](CC1=CC=CC=C1)NC(=O)[C@H](CC(O)=O)NC(=O)CNC(=O)[C@H](CC(N)=O)NC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(N)=N)NC(=O)[C@@H]1CCCN1C(=O)[C@H](N)CC1=CC=CC=C1)C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](CC1=CC=C(O)C=C1)C(=O)N[C@@H](CC(C)C)C(O)=O</value> <source>ChemAxon</source> </property>

Many Duplicate Records (Targets, Enzymes, Transporters, Carriers)

Observation
There are many duplicate records in the data frames returned by parse_drug_all() that start with the following prefixes:

  • drug_targets_
  • drug_enzymes_
  • drug_transporters_
  • drug_carriers_

How to Reproduce Observation
Observe the following examples:

> library(dbparser)
> get_xml_db_rows('full database.xml')
> my_all <- parse_drug_all()
>
>
>
> my_all$drug_targets_polypeptide %>% nrow
[1] 18639
> my_all$drug_targets_polypeptide %>% unique %>% nrow
[1] 4763
>
> my_all$drug_targets_polypeptides_external_identifiers %>% nrow
[1] 107715
> my_all$drug_targets_polypeptides_external_identifiers %>% unique %>% nrow
[1] 21517
>
> my_all$drug_targets_actions %>% nrow
[1] 7671
> my_all$drug_targets_actions %>% unique %>% nrow
[1] 2953
>
>
>
> my_all$drug_enzymes_actions %>% nrow
[1] 5286
> my_all$drug_enzymes_actions %>% unique %>% nrow
[1] 517
>
>
>
> my_all$drug_carriers_actions %>% nrow
[1] 251
> my_all$drug_carriers_actions %>% unique %>% nrow
[1] 85

What to Do?
For any of the data frames with the above-mentioned prefixes, please apply the unique() function to them in order to remove any duplicate rows before returning them to the user.

Suggestion
It may even be a good idea to apply the unique() function to ALL data frames produced by dbparser before returning them to the user. However, I will let you be the judge of that. :)

Function that returns everything!

Add a function, e.g. parse_everything(), that parses all the data and returns the parsed data in the form of a list of dataframes.

Note that if save_table=TRUE, all the parsed data should be successfully get saved in the DB.

Ability to Read Compressed XML files (and Download from Internet Locations?)

It would be nice if the XML file containing the DrugBank database could be parsed while still in its compressed form. Specifically, if the file ends in .zip, the file is to be uncompressed automatically before parsing.

Inspiration for this issue came from the file argument of the read_csv() function of the dplyr package.

Bonus: If dbparser is able to deal with other compressed formats (.gz, .bz2, .xz), that would be great!

Bonus: If the filename starts with http://, https://, ftp://, or ftps://, then additionally download the file prior to uncompressing and parsing.

Badges

Happy to send a PR with badges if interested!

CRAN submission dbparser 1.0.0 Comments

  • please write package names, software names and API names in
    single quotes (e.g. 'DrugBank') in Title and Description.
  • Please add an URL for 'DrugBank' in your Description text in the form
    http:... or https:...
    with angle brackets for auto-linking and no space after 'http:' and
    'https:'.
  • Please replace \dontrun{} by \donttest{} or unwap the examples if they
    can be executed in less than 5 sec per Rd-file.

Python Wrapper for dbparser Package

It would be convenient for Python users who are interested in the dbparser package if there exists a Python wrapper for it. Such a wrapper may, for example, be uploaded to Anaconda Cloud or to something that is like BioConductor but within the Python universe (e.g. BioPython).

Rename datasets more properly

Is your feature request related to a problem? Please describe.
Putting name standard to the retrieved datasets will make dealing with it much more easier in dependent packages and apps, please refer to DrugBank Browser

Describe the solution you'd like

  • Name should be DatasetName_Parent(s)Name(s), i.e Drugs.csv, Names_Drugs.csv, etc
  • Ignore references and interactions in names

Misspelled File Names "*durg*"

Three misspelled file names were found in the dbparser package:

  • dbparser/tests/testthat/
    • test_durg_main_node_parser.R
    • test_durg_all_nodes_parser.R
  • dbparser/R/
    • durg_main_node_parser.R

Please, fix the file names and make sure there are no similar misspellings inside the above files.

DrugBank Ids

Drugs have sometimes more than 3 ids

<drugbank-id primary="true">DB00006</drugbank-id>
  <drugbank-id>BTD00076</drugbank-id>
  <drugbank-id>EXPT03302</drugbank-id>
  <drugbank-id>BIOD00076</drugbank-id>
  <drugbank-id>DB02351</drugbank-id>

Need to handle any number of drugbank ids

Typo in one of the elements of targets data

When using the dbparser::parse_drug_targets_polypeptides() function, the returned data frame has a misspelled variable, amindo_acid_format.

Proposed Solution:
amindo_acid_format --> amino_acid_format

Bonus:
See if you can find similar typos, and fix them too!

Datasets Features Names

Is your feature request related to a problem? Please describe.
When I used the generated dataframes from dbparser I had to rename its features to be displayed properly

Describe the solution you'd like
Rename features like text with a proper name, features with small caps to be Camel case

Saving to database not to be the default

Currently, save_table is TRUE by default for all parse*() functions. Let it be FALSE by default. The change is to be reflected in all functions' documentations (i.e. the kind of help page that displays when adding a '?' before the function name in the console and pressing Enter).

Title in dbparser's documentation page

The documentation page for dbparser shows up when you type the following and press Enter:
help(package = dbparser)

In the page that is displayed, the title reads:
Drug bank xml db parser and database saver

I prefer the title to be:
DrugBank Database XML Parser

An overall check of the entire documentation (including all the parse*() functions) would be welcome (for typos and clarity), but the title is enough for now.

Remove *Count* features from drug data set

Is your feature request related to a problem? Please describe.
These features has no use currently as they will be replaced by drugQuery package

Describe the solution you'd like
New package called drugQuery will replace these features and provide more complex and advanced queries on drug bank

Missing datasets

Describe the bug
There are two datasets are missing in the parser salts and international-brands

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.