Giter VIP home page Giter VIP logo

medrxivr's Introduction

medrxivr

Project Status: Active – The project has reached a stable, usable state and is being actively developed. rOpenSci Badge DOI CRAN Downloads.
R build status Travis build status Codecov test coverage

An increasingly important source of health-related bibliographic content are preprints - preliminary versions of research articles that have yet to undergo peer review. The two preprint repositories most relevant to health-related sciences are medRxiv and bioRxiv, both of which are operated by the Cold Spring Harbor Laboratory.

The goal of the medrxivr R package is two-fold. In the first instance, it provides programmatic access to the Cold Spring Harbour Laboratory (CSHL) API, allowing users to easily download medRxiv and bioRxiv preprint metadata (e.g. title, abstract, publication date, author list, etc) into R. The package also provides access to a maintained static snapshot of the medRxiv repository (see Data sources). Secondly, medrxivr provides functions to search the downloaded preprint records using regular expressions and Boolean logic, as well as helper functions that allow users to export their search results to a .BIB file for easy import to a reference manager and to download the full-text PDFs of preprints matching their search criteria.

Installation

To install the stable version of the package from CRAN:

install.packages("medrxivr")
library(medrxivr)

Alternatively, to install the development version from GitHub, use the following code:

install.packages("devtools")
devtools::install_github("ropensci/medrxivr")
library(medrxivr)

Data sources

medRxiv data

medrixvr provides two ways to access medRxiv data:

  • mx_api_content(server = "medrxiv") creates a local copy of all data available from the medRxiv API at the time the function is run.
# Get a copy of the database from the live medRxiv API endpoint
preprint_data <- mx_api_content()  
  • mx_snapshot() provides access to a static snapshot of the medRxiv database. The snapshot is created each morning at 6am using mx_api_content() and is stored as CSV file in the medrxivr-data repository. This method does not rely on the API (which can become unavailable during peak usage times) and is usually faster (as it reads data from a CSV rather than having to re-extract it from the API). Discrepancies between the most recent static snapshot and the live database can be assessed using mx_crosscheck().
# Get a copy of the database from the daily snapshot
preprint_data <- mx_snapshot()  

The relationship between the two methods for the medRxiv database is summarised in the figure below:

bioRxiv data

Only one data source exists for the bioRxiv repository:

  • mx_api_content(server = "biorxiv") creates a local copy of all data available from the bioRxiv API endpoint at the time the function is run. Note: due to it’s size, downloading a complete copy of the bioRxiv repository in this manner takes a long time (~ 1 hour).
# Get a copy of the database from the live bioRxiv API endpoint
preprint_data <- mx_api_content(server = "biorxiv")

Performing your search

Once you have created a local copy of either the medRxiv or bioRxiv preprint database, you can pass this object (preprint_data in the examples above) to mx_search() to search the preprint records using an advanced search strategy.

# Import the medrxiv database
preprint_data <- mx_snapshot()
#> Using medRxiv snapshot - 2021-01-28 09:31

# Perform a simple search
results <- mx_search(data = preprint_data,
                     query ="dementia")
#> Found 192 record(s) matching your search.

# Perform an advanced search
topic1  <- c("dementia","vascular","alzheimer's")  # Combined with Boolean OR
topic2  <- c("lipids","statins","cholesterol")     # Combined with Boolean OR
myquery <- list(topic1, topic2)                    # Combined with Boolean AND

results <- mx_search(data = preprint_data,
                     query = myquery)
#> Found 70 record(s) matching your search.

You can also explore which search terms are contributing most to your search by setting report = TRUE:

results <- mx_search(data = preprint_data,
                     query = myquery,
                     report = TRUE)
#> Found 70 record(s) matching your search.
#> Total topic 1 records: 1078
#> dementia: 192
#> vascular: 917
#> alzheimer's: 0
#> Total topic 2 records: 203
#> lipids: 74
#> statins: 25
#> cholesterol: 136

Further functionality

Export records identified by your search to a .BIB file

Pass the results of your search above (the results object) to the mx_export() to export references for preprints matching your search results to a .BIB file so that they can be easily imported into a reference manager (e.g. Zotero, Mendeley).

mx_export(data = results,
          file = "mx_search_results.bib")

Download PDFs for records returned by your search

Pass the results of your search above (the results object) to the mx_download() function to download a copy of the PDF for each record found by your search.

mx_download(results,        # Object returned by mx_search(), above
            "pdf/",         # Directory to save PDFs to 
            create = TRUE)  # Create the directory if it doesn't exist

Accessing the raw API data

By default, the mx_api_*() functions clean the data returned by the API for use with other medrxivr functions.

To access the raw data returned by the API, the clean argument should set to FALSE:

mx_api_content(to_date = "2019-07-01", clean = FALSE)

See this article for more details.

Detailed guidance

Detailed guidance, including advice on how to design complex search strategies, is available on the medrxivr website.

Linked repositories

See here for the code used to take the daily snapshot and the code that powers the medrxivr web app.

Other tools/packages for working with medRxiv/bioRxiv data

The focus of medrxivr is on providing tools to allow users to import and then search medRxiv and bioRxiv data. Below are a list of complementary packages that provide distinct but related functionality when working with medRxiv and bioRxiv data:

Code of conduct

Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Disclaimer

This package and the data it accesses/returns are provided “as is”, with no guarantee of accuracy. Please be sure to check the accuracy of the data yourself (and do let me know if you find an issue so I can fix it for everyone!)

medrxivr's People

Contributors

danielskatz avatar mcguinlu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

medrxivr's Issues

Release medrxivr 0.0.3

Prepare for release:

  • Check that description is informative
  • Check licensing of included files
  • devtools::build_readme()
  • usethis::use_cran_comments()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran(env_vars=c(R_COMPILE_AND_INSTALL_PACKAGES = "always"))
  • Update cran-comments.md
  • Review pkgdown reference index for, e.g., missing topics
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('major')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Update install instructions in README
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Error in count/100 : non-numeric argument to binary operator

Session Info
 setting  value
 version  R version 4.2.2 (2022-10-31)
 os       macOS Ventura 13.2.1
 system   x86_64, darwin17.0
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Toronto
 date     2024-03-20
 rstudio  2023.09.1+494 Desert Sunflower (desktop)
 pandoc   NA
> mx_api_content(
+     from_date = "2013-01-01",
+     to_date = as.character(Sys.Date()),
+     clean = TRUE,
+     server = "medrxiv",
+     include_info = FALSE
+ )
Error in count/100 : non-numeric argument to binary operator
> mx_data <- mx_api_content(from_date = "2020-01-01",
+                           to_date = "2020-01-07")
Error in count/100 : non-numeric argument to binary operator
> if(interactive()){
+     mx_data <- mx_api_content(from_date = "2020-01-01",
+                               to_date = "2020-01-07")
+ }
Error in count/100 : non-numeric argument to binary operator

> preprint_data <- mx_api_content(server = "biorxiv")
Error in count/100 : non-numeric argument to binary operator
> preprint_data <- mx_api_content()
Error in count/100 : non-numeric argument to binary operator

Offhand, I can see that this is caused by a bad value of count, probably NA, null, or 0. I believe the same error has been included automatically in the docs.

I am guessing either something changed server-side or in a dependency that invalidated the lib's logic. I am writing a Python version now; will let you know if I find the problem and solution.

Add function to report number of "hits" per individual search term.

It is often useful to be able to see the number of "hits" (records returned) by each individual element of the search, so that when designing the search strategy you can interogate which elements are influencing the returned records the most. So for example, if the search is:

topic1 <- c("dementia", "Alzheimer's")      # Combined with Boolean "OR"
topic2 <- c("lipids", "cholesterol")             # Combined with Boolean "OR"

query <- list(topic1,topic2)                        # Combined with Boolean "AND"

results <- mx_search(mx_snapshot(), query)

Then passing query to the proposed mx_reporter() function would return something like the below:

# Total number of records found by your search: XX

# Total topic 1 records: XX
# - dementia: XX
# - Alzheimer's: XX

# Total Topic 2 records: XX
# - lipids: XX
# - cholesterol: XX

mx_api_content() fails if the last page doesn't contain any records

This seems to be due to the fact that the number of records given by the "total" metadata is more than the total number of records actually available.

As of 14.39pm on 04/01/2021, the number of records given by the "total" is 148231. However, if you set the counter to any record within 31 of this figure (e.g. https://api.biorxiv.org/details/biorxiv/2013-01-01/2021-01-04/148201), you get a "No posts found" message. As medrxivr uses the "total" metadata field to work out how many pages it needs to cycle through to download the whole database, this sometimes leads to an error when the last page, expected by medrxivr based on the "total" field, is empty.

Note as more records are added to the API, the hardcode figures above will no longer demonstrate the issue.

Add support for plain text Boolean/common search functions

For example: if a users searches for "randomi*ation", convert this to "randomi([[:alpha:]])ation", where ([[:alpha:]]) element defines any single alphanumeric character - in this case, the regex will find both randomisation and randomization.

The idea is to prevent users from having to use unfamiliar regex terms, in favour of common MEDLINE/EMBASE/Ovid syntax.

Fix pkgdown configuration for reference

Some topics are missing from the configuration file.

 Error in check_missing_topics(rows, pkg) : 
  All topics must be included in reference indexMissing topics: medrxivr, mx_caps

Note that for topics you do not want to include in the index you can create an "internal" section https://pkgdown.r-lib.org/reference/build_reference.html?q=internal#missing-topics

You can also use the @keywords internal tag and redocument for, say, the package manual page.

To check all topics are listed, after editing the configuration file you can run pkgdown::check_pkgdown().

pkgdown is failing

The pkgdown GitHub action is failing, but the Jenkins version works fine. Gives error message:

-- Building function reference -------------------------------------------------
Error in check_missing_topics(rows, pkg) : 
  Topics missing from index: medrxivr
  • Remove the pkgdown.yml GitHub actions file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.