ropensci / dbparser Goto Github PK

View Code? Open in Web Editor NEW

56.0 5.0 18.0 2.39 MB

Source code for the R package, "dbparser" (i.e. DrugBank Parser)

Home Page: https://docs.ropensci.org/dbparser

License: Other

R 100.00%

dbparser's Introduction

dbparser

Overview

Drugs databases vary too much in their formats and structures which making related data analysis not a very easy job and requires a lot of efforts to work on only two databases together such as DrugBank and KEGG.

Hence, dbparser package aims to parse different public drugs databases as DrugBank or KEGG into single and unified format R object called dvobject (stands for drugverse object).

That should help in:

working with single data object and not multiple databases in different formats,
using R analysis capabilities easily on drugs data,
ease of transferring data between researchers after performing required data analysis or dvobject and storing results in the same object in a very easy manner

dvobject Structure

dvobject introduces a unified and compressed format of drugs data. It is an R list object that contains one or more of the following sub-lists:

drugs: list of data.frames that contain drugs information (i.e. synonyms, classifications, …) and it is the only mandatory list
salts: data.frame contains drugs salts information
products: data.frame of commercially available drugs products in the world
references: data.frame of articles, links and textbooks about drugs or CETT data
cett: list of data.frames contain targets, enzymes, carriers and transporters information

Drug Databases

Parsers are available for the following databases (it is in progress list)

DrugBank

DrugBank database is a comprehensive, freely accessible, online database containing information on drugs and drug targets. As both a bioinformatics and a cheminformatics resource, DrugBank combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. More information about DrugBank can be found here.

In its raw form, the DrugBank database is a single XML file. Users must create an account with DrugBank and request permission to download the database. Note that this may take a couple of days.

The dbparser package parses the DrugBank XML database into R tibbles that can be explored and analyzed by the user, check this tutorial for more details.

If you are waiting for access to the DrugBank database, or do not intend to do a deep dive with the data, you may wish to use the dbdataset package, which contains the DrugBank database already parsed into dvobject. Note that this is a large package that exceeds the limit set by CRAN. It is only available on GitHub.

dbparser is tested against DrugBank versions 5.1.0 through 5.1.10 successfully. If you find errors with these versions or any other version please submit an issue here.

Installation

You can install the released version of dbparser from CRAN with:

install.packages("dbparser")

or you can install the latest updates directly from the repo

library(devtools)
devtools::install_github("ropensci/dbparser")

Code of Conduct

Please note that the ‘dbparser’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Contributing Guide

👍🎉 First off, thanks for taking the time to contribute! 🎉👍 Please review our Contributing Guide.

Share the love ❤️

Think dbparser is useful? Let others discover it, by telling them in person, via Twitter or a blog post.

Using dbparser for a paper you are writing? Consider citing it

citation("dbparser")
#> 
#> To cite dbparser in publications use:
#> 
#>   Mohammed Ali, Ali Ezzat ().  dbparser: DrugBank Database XML Parser.
#>   R package version 2.0.0.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {DrugBank Database XML Parser},
#>     author = {Mohammed Ali and Ali Ezzat},
#>     organization = {Interstellar for Consultinc inc.},
#>     note = {R package version 2.0.0},
#>     url = {https://CRAN.R-project.org/package=dbparser},
#>   }

dbparser's People

Contributors

Stargazers

Watchers

Forkers

amrrs alisharifi2000 kmaheshkulkarni pnlinh mohamedawadallah sparklingredstar csaybar emmamendelsohn shicheng-guo sailfish009 healthvivo yishingene jimsforks jaspershen ngiangre kangsli rnaimehaom qtaolab

dbparser's Issues

Consistency of CLASS of Data Frames Returned by dbparser

The data frames that are returned by the dbparser package (using the drug_parse_all() function) do not all have the same class/type. Observe the following examples:

> get_xml_db_rows('full database.xml')

> myall <- parse_drug_all()

> class(myall$drug_targets)
[1] "tbl_df"     "tbl"        "data.frame"

> class(myall$drug_targets_actions)
[1] "data.frame"

Please ensure that all the returned data frames consistently have the same class/type:
"tbl_df" "tbl" "data.frame" (i.e. dplyr's version of data frame, tibble)

Download DrugBank database from the official site and parse it

Is your feature request related to a problem? Please describe.
Instead of downloading the file and give it to the package, the package can download it directly given Drugbank user name and password

Github wiki

Vignette Content May Need to Be Modified

The vignette's figures require the full DrugBank database to be displayed. Since the full database would be be too big to include in the package, this prevents the vignette from being self-contained (which is a bad thing).

This issue is about coming up with another couple of EDA examples to add in the vignette that does not require the full database. The purpose of this is to avoid relying on the full DrugBank database so that the vignette may be self-contained.

Build pakage website

Count variables in 'drugs' data frame to have same prefix

When using the dbparser::parse_drug() function, the returned data frame has many variables ending with the _count suffix. In RStudio, when one is looking at the different elements of the data frame (by adding a $ sign and pressing TAB), it can be a bit annoying to find these count variables cluttering the elements list.

Proposed enhancement:
Instead of a ..._count suffix, let these variables have a count_... prefix so that they are bundled together; i.e. make use of the fact that the elements are ordered alphabetically in the list returned when pressing TAB.

Bonus:
In the list returned by the tidyverse's glimpse() function, it would be great the count elements are moved to the end of that list.

Release 1.0.1 prepration

resolve any issue from package checks before submitting to CRAN

Add Parsed Data to Kaggle as a Contributed Dataset

Gather all the data frames that are returned by dbparser's parse_drug_all() function, and submit to Kaggle as a dataset. This dataset is to be updated quarterly with each new release of the DrugBank dataset.

Prerequisites:

Ask permission from DrugBank and make sure it is okay to do such a thing.
Homework: Can we contribute the dataset as a team (Dainanahan)?
Challenge: The DrugBank data is >1GB in size. How to upload such a dataset?
- Suggested Solution: Rent an online VM for a day. From that VM, download and parse DrugBank XML, then upload to Kaggle.
The dbparser website has to be finalized.
- Meaning of the word, 'finalized', to be discussed and specified later, but I am mainly referring to tutorials, usage examples, blogs and inspiration from our previously submitted DataCamp proposal.
Prior to contributing the dataset, there needs to be enough tutorial/blog material for Kagglers to know what to do with the data (what analyses to perform? what interesting questions could be answered with the dataset? perhaps contribute a Kaggle kernel of our own immediately after the dataset is posted online? etc.)
Homework: We have to make sure that the names of the variables (columns) of the data frames (returned by dbparser) are acceptable and in the way that we want them (i.e. self-explanatory without much needed elaboration from our side).
- This will require a good look at the data frames' column names.
- The parent_key variable in particular will be revisited (e.g. wherever parent_key refers to a drug, change column name to drugbank_id?).

Support Reading DrugBank RDF Version

http://download.bio2rdf.org/release/3/drugbank/drugbank.html

parse_drug_all issues

GO_Classifiers_Polypeptide_Target_Drug is returned twice and drug_carriers_polypeptides_go_classifiers is not following name standard

Incorrect Data Frame Names Returned by `drug_parse_all()`

dbparser's parse_drug_all() function returns all the parsed data in the form of a list of 72 data frames. All the data frames have the prefix drug_ except two of them:

experimental_properties: please change to drug_experimental_properties
classificationsparse_drug_classifications: please change to drug_classification

Additionally, please make sure that the names of the data frames returned by the individual parsing functions, parse_drug_experimental_properties() and parse_drug_classifications(), are also correct.

Finally, I believe that parse_drug_classifications() should be parse_drug_classification() (i.e. remove the 's') to correctly correspond with the <classification> tag in the XML (in the same way that parse_drug() corresponds to the <drug> tag).

Add Function for "Subsetting" the Data with a Certain Drug(s)

Using dbparser's parse_drug_all() function, the user is able to obtain the entire data residing within the XML file containing the DrugBank database. The parsed data is in the form of a list of data frames. This is fine in the case where the user is interested in doing an analysis that involves the entire data.

This issue is about the case when the user is only interested in a specific drug (or group of drugs). Let's refer to the list of data frames (returned by parse_drug_all()) as dblist.

A function called filter_drugs() is to be added that would be used as follows:

First, the user would programmatically filter just the drugs data frame from dblist using whatever criteria they wish (e.g. via the dplyr::filter() function).
After the user is satisfied with the set of drugs in dblist$drugs, dblist is then passed to filter_drugs().
As a result, all the other data frames in dblist would be filtered by the set of drugs in dblist$drugs.
dblist is then returned to the user after having been filtered.

Bonus: The function is preferred to have error-handling and/or warning messages for when unexpected things happen (e.g. a data frame from dblist is missing).

Descriptive statistical tables

update tutorial on how to reproduce it

Review package returned data against drugbank site

support latest DrugBank database

[1] "Parsed Synonyms_Polypeptide_Carrier_Drug, 30/74"
Error in xmlChildren(p[["pfams"]])[[1]] : subscript out of bounds

Branding

Example of branding on RStudio blog
https://blog.rstudio.com/2018/01/29/sparklyr-0-7/

parse_drug(): Main Drug ID: primary_key --> parent_key

The data frame returned by parse_drug() has, as the main drug ID, the column named 'primary_key'. However, when the drug IDs appear in data frames returned by other parse*() functions, it is typically under a column named 'parent_key'.

Here, I am suggesting to modify the 'primary_key' column name in the parse_drug() data frame to 'parent_key' so that it would match the other data frames. This would mainly be convenient in joining operations between data frames that will be done quiet frequently in the future.

Warnings are displayed with any parse*() function

The following warning is displayed upon using any parse*() function:
"In bind_rows_(x, .id) : binding character and factor vector, coercing into character vector"

It is my impression that CRAN wants there to be no such warning messages for typical/standard/default usage of any of dbparser's functions. The above message shows up when running the function on the full DrugBank database XML.

Misleading Error Message While Attempting to Read Non-Existing XML File

When attempting to read a non-existing XML file using the get_xml_db_rows('<FILENAME.xml>') function of the dbparser package, an error is given (which makes sense). However, the error produced in this scenario is:

Error: XML content does not seem to be XML: '<FILENAME.xml>'

The issue here is that the error message is incorrect since the actual problem is that the file does not exist in the first place (and not that the file exists and its contents are non-XML). The error message is currently misleading and is supposed to be something like the one below instead:

Could not find the file: "<FILENAME.xml>". Please ensure that the file name is entered correctly and that it exists at the specified location.

Naturally, code that checks for the file's existence will need to be added, and if indeed the file doesn't exist, produce the error message above.

Parsing Properties from Small Molecule Element

Is your feature request related to a problem? Please describe.
Extracting SMILES and similar attributes from Small Molecule drugs

Describe the solution you'd like
Under Drugs element:
<property> <kind>SMILES</kind> <value>CC[C@H](C)[C@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@H](CCC(O)=O)NC(=O)[C@H](CC1=CC=CC=C1)NC(=O)[C@H](CC(O)=O)NC(=O)CNC(=O)[C@H](CC(N)=O)NC(=O)CNC(=O)CNC(=O)CNC(=O)CNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(N)=N)NC(=O)[C@@H]1CCCN1C(=O)[C@H](N)CC1=CC=CC=C1)C(=O)N1CCC[C@H]1C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](CC1=CC=C(O)C=C1)C(=O)N[C@@H](CC(C)C)C(O)=O</value> <source>ChemAxon</source> </property>

Many Duplicate Records (Targets, Enzymes, Transporters, Carriers)

Observation
There are many duplicate records in the data frames returned by parse_drug_all() that start with the following prefixes:

drug_targets_
drug_enzymes_
drug_transporters_
drug_carriers_

How to Reproduce Observation
Observe the following examples:

> library(dbparser)
> get_xml_db_rows('full database.xml')
> my_all <- parse_drug_all()
>
>
>
> my_all$drug_targets_polypeptide %>% nrow
[1] 18639
> my_all$drug_targets_polypeptide %>% unique %>% nrow
[1] 4763
>
> my_all$drug_targets_polypeptides_external_identifiers %>% nrow
[1] 107715
> my_all$drug_targets_polypeptides_external_identifiers %>% unique %>% nrow
[1] 21517
>
> my_all$drug_targets_actions %>% nrow
[1] 7671
> my_all$drug_targets_actions %>% unique %>% nrow
[1] 2953
>
>
>
> my_all$drug_enzymes_actions %>% nrow
[1] 5286
> my_all$drug_enzymes_actions %>% unique %>% nrow
[1] 517
>
>
>
> my_all$drug_carriers_actions %>% nrow
[1] 251
> my_all$drug_carriers_actions %>% unique %>% nrow
[1] 85

What to Do?
For any of the data frames with the above-mentioned prefixes, please apply the unique() function to them in order to remove any duplicate rows before returning them to the user.

Suggestion
It may even be a good idea to apply the unique() function to ALL data frames produced by dbparser before returning them to the user. However, I will let you be the judge of that. :)

Function that returns everything!

Add a function, e.g. parse_everything(), that parses all the data and returns the parsed data in the form of a list of dataframes.

Note that if save_table=TRUE, all the parsed data should be successfully get saved in the DB.

Ability to Read Compressed XML files (and Download from Internet Locations?)

It would be nice if the XML file containing the DrugBank database could be parsed while still in its compressed form. Specifically, if the file ends in .zip, the file is to be uncompressed automatically before parsing.

Inspiration for this issue came from the file argument of the read_csv() function of the dplyr package.

Bonus: If dbparser is able to deal with other compressed formats (.gz, .bz2, .xz), that would be great!

Bonus: If the filename starts with http://, https://, ftp://, or ftps://, then additionally download the file prior to uncompressing and parsing.

Create Rest API for the package

https://www.rplumber.io/

Support saving parsed dataframs as csv

Need to update parse all as well, to not parse an element with an existing csv.
Support overriding the written csv if user wants to

create function that returns list of tibbles of parsed dataframes

Add parser to missing attributes

There are some attributes that still need to be parsed such as:

Calculated Properties.
average-mass
etc...

Badges

Happy to send a PR with badges if interested!

Update package documentation

Review should be for semantic and syntax errors

Build error

The latest commit by you seems to cause a build error, please check
https://travis-ci.org/Dainanahan/dbparser/builds/474200685

remove "byValue" parameter in drug_sub_df method

it has usage no more

sumbit to BioConductor

Vignette Figures Look Ugly

The figures in the vignette look unsatisfactory. Specifically, the figures are left-centered and tiny. A quick comparison between the figures in dbparser's vignette and another example vignette from tidytext clearly shows that work is needed in dbparser's figures.

CRAN submission dbparser 1.0.0 Comments

please write package names, software names and API names in
single quotes (e.g. 'DrugBank') in Title and Description.
Please add an URL for 'DrugBank' in your Description text in the form
http:... or https:...
with angle brackets for auto-linking and no space after 'http:' and
'https:'.
Please replace \dontrun{} by \donttest{} or unwap the examples if they
can be executed in less than 5 sec per Rd-file.

Python Wrapper for dbparser Package

It would be convenient for Python users who are interested in the dbparser package if there exists a Python wrapper for it. Such a wrapper may, for example, be uploaded to Anaconda Cloud or to something that is like BioConductor but within the Python universe (e.g. BioPython).

submit to CRAN

Rename datasets more properly

Is your feature request related to a problem? Please describe.
Putting name standard to the retrieved datasets will make dealing with it much more easier in dependent packages and apps, please refer to DrugBank Browser

Describe the solution you'd like

Name should be DatasetName_Parent(s)Name(s), i.e Drugs.csv, Names_Drugs.csv, etc
Ignore references and interactions in names

Misspelled File Names "durg"

Three misspelled file names were found in the dbparser package:

dbparser/tests/testthat/
- test_durg_main_node_parser.R
- test_durg_all_nodes_parser.R
dbparser/R/
- durg_main_node_parser.R

Please, fix the file names and make sure there are no similar misspellings inside the above files.

Add option to save the parsed dataset into csv files

Current only option is saving the dataset into database, we could save it in user hard disk as well

DrugBank Ids

Drugs have sometimes more than 3 ids

<drugbank-id primary="true">DB00006</drugbank-id>
  <drugbank-id>BTD00076</drugbank-id>
  <drugbank-id>EXPT03302</drugbank-id>
  <drugbank-id>BIOD00076</drugbank-id>
  <drugbank-id>DB02351</drugbank-id>

Need to handle any number of drugbank ids

Create UI

Typo in one of the elements of targets data

When using the dbparser::parse_drug_targets_polypeptides() function, the returned data frame has a misspelled variable, amindo_acid_format.

Proposed Solution:
amindo_acid_format --> amino_acid_format

Bonus:
See if you can find similar typos, and fix them too!

Datasets Features Names

Is your feature request related to a problem? Please describe.
When I used the generated dataframes from dbparser I had to rename its features to be displayed properly

Describe the solution you'd like
Rename features like text with a proper name, features with small caps to be Camel case

Saving to database not to be the default

Currently, save_table is TRUE by default for all parse*() functions. Let it be FALSE by default. The change is to be reflected in all functions' documentations (i.e. the kind of help page that displays when adding a '?' before the function name in the console and pressing Enter).

Get current db version and release date

complete current documentation

Title in dbparser's documentation page

The documentation page for dbparser shows up when you type the following and press Enter:
help(package = dbparser)

In the page that is displayed, the title reads:
Drug bank xml db parser and database saver

I prefer the title to be:
DrugBank Database XML Parser

An overall check of the entire documentation (including all the parse*() functions) would be welcome (for typos and clarity), but the title is enough for now.

Remove Count features from drug data set

Is your feature request related to a problem? Please describe.
These features has no use currently as they will be replaced by drugQuery package

Describe the solution you'd like
New package called drugQuery will replace these features and provide more complex and advanced queries on drug bank

Missing datasets

Describe the bug
There are two datasets are missing in the parser salts and international-brands