Giter VIP home page Giter VIP logo

fulltext's Introduction

  _____     .__  .__   __                   __
_/ ____\_ __|  | |  |_/  |_  ____ ___  ____/  |_
\   __\  |  \  | |  |\   __\/ __ \\  \/  /\   __\
 |  | |  |  /  |_|  |_|  | \  ___/ >    <  |  |
 |__| |____/|____/____/__|  \___  >__/\_ \ |__|
                                \/      \/

Get full text articles from (almost) anywhere

Build Status Build status codecov.io rstudio mirror downloads cran version

rOpenSci has a number of R packages to get either full text, metadata, or both from various publishers. The goal of fulltext is to integrate these packages to create a single interface to many data sources.

fulltext makes it easy to do text-mining by supporting the following steps:

  • Search for articles
  • Fetch articles
  • Get links for full text articles (xml, pdf)
  • Extract text from articles / convert formats
  • Collect bits of articles that you actually need
  • Download supplementary materials from papers

Additional steps we hope to include in future versions:

  • Analysis enabled via the tm package and friends, and via Spark-R to handle especially large jobs
  • Visualization

Data sources in fulltext include:

Authorization: A number of publishers require authorization via API key, and some even more draconian authorization processes involving checking IP addresses. We are working on supporting all the various authorization things for different publishers, but of course all the OA content is already easily available.

We'd love your feedback. Let us know what you think in the issue tracker

Article full text formats by publisher: https://github.com/ropensci/fulltext/blob/master/vignettes/formats.Rmd

Installation

Stable version from CRAN

install.packages("fulltext")

Development version from GitHub

devtools::install_github("ropensci/fulltext")

Load library

library('fulltext')

Extraction tools

If you want to use ft_extract() function, it currently has two options for how to extract text from PDFs: xpdf and ghostscript.

Search

ft_search() - get metadata on a search query.

ft_search(query = 'ecology', from = 'plos')
#> Query:
#>   [ecology] 
#> Found:
#>   [PLoS: 29751; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0] 
#> Returned:
#>   [PLoS: 10; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0]

Get full text links

ft_links() - get links for articles (xml and pdf).

res1 <- ft_search(query = 'ecology', from = 'entrez', limit = 5)
ft_links(res1)
#> <fulltext links>
#> [Found] 4 
#> [IDs] ID_26478753 ID_26475579 ID_26474754 ID_26474753 ID_26474751 ...

Or pass in DOIs directly

ft_links(res1$entrez$data$doi, from = "entrez")
#> <fulltext links>
#> [Found] 4 
#> [IDs] ID_26478753 ID_26475579 ID_26474754 ID_26474753 ID_26474751 ...

Get full text

ft_get() - get full or partial text of articles.

ft_get('10.1371/journal.pone.0086169', from = 'plos')
#> <fulltext text>
#> [Docs] 1 
#> [Source] R session  
#> [IDs] 10.1371/journal.pone.0086169 ...

Extract chunks

library("rplos")
(dois <- searchplos(q = "*:*", fl = 'id',
   fq = list('doc_type:full',"article_type:\"research article\""), limit = 5)$data$id)
#> [1] "10.1371/journal.pbio.0050316" "10.1371/journal.pone.0030133"
#> [3] "10.1371/journal.pone.0012724" "10.1371/journal.pbio.0050323"
#> [5] "10.1371/journal.pone.0030142"
x <- ft_get(dois, from = "plos")
x %>% chunks("publisher") %>% tabularize()
#> $plos
#>                                               publisher
#> 1 Public Library of Science\n        San Francisco, USA
#> 2 Public Library of Science\n        San Francisco, USA
#> 3         Public Library of Science\nSan Francisco, USA
#> 4 Public Library of Science\n        San Francisco, USA
#> 5 Public Library of Science\n        San Francisco, USA
x %>% chunks(c("doi","publisher")) %>% tabularize()
#> $plos
#>                            doi
#> 1 10.1371/journal.pbio.0050316
#> 2 10.1371/journal.pone.0030133
#> 3 10.1371/journal.pone.0012724
#> 4 10.1371/journal.pbio.0050323
#> 5 10.1371/journal.pone.0030142
#>                                               publisher
#> 1 Public Library of Science\n        San Francisco, USA
#> 2 Public Library of Science\n        San Francisco, USA
#> 3         Public Library of Science\nSan Francisco, USA
#> 4 Public Library of Science\n        San Francisco, USA
#> 5 Public Library of Science\n        San Francisco, USA

Use dplyr to data munge

library("dplyr")
x %>%
 chunks(c("doi", "publisher", "permissions")) %>%
 tabularize() %>%
 .$plos %>%
 select(-permissions.license)
#>                            doi
#> 1 10.1371/journal.pbio.0050316
#> 2 10.1371/journal.pone.0030133
#> 3 10.1371/journal.pone.0012724
#> 4 10.1371/journal.pbio.0050323
#> 5 10.1371/journal.pone.0030142
#>                                               publisher
#> 1 Public Library of Science\n        San Francisco, USA
#> 2 Public Library of Science\n        San Francisco, USA
#> 3         Public Library of Science\nSan Francisco, USA
#> 4 Public Library of Science\n        San Francisco, USA
#> 5 Public Library of Science\n        San Francisco, USA
#>   permissions.copyright.year permissions.copyright.holder
#> 1                       2007                  Miall et al
#> 2                       2012              Wulfmeyer et al
#> 3                       2010                 Sorbye et al
#> 4                       2007                  Xiang et al
#> 5                       2012                Kreakie et al
#>   permissions.license_url
#> 1                    <NA>
#> 2                    <NA>
#> 3                    <NA>
#> 4                    <NA>
#> 5                    <NA>

Supplementary materials

Grab supplementary materials for (re-)analysis of data

ft_get_si() accepts article identifiers, and output from ft_search(), ft_get()

catching.crabs <- read.csv(ft_get_si("10.6084/m9.figshare.979288", 2))
head(catching.crabs)
#>   trap.no. length.deployed no..crabs
#> 1        1          10 sec         0
#> 2        2          10 sec         0
#> 3        3          10 sec         0
#> 4        4          10 sec         0
#> 5        5          10 sec         0
#> 6        1           1 min         0

Cache

When dealing with full text data, you can get a lot quickly, and it can take a long time to get. That's where caching comes in. And after you pull down a bunch of data, if you do so within the R session, you don't want to lose that data if the session crashes, etc. When you search you will be able to (i.e., not ready yet) optionally cache the raw JSON/XML/etc. of each request locally - when you do that exact search again we'll just give you the local data - unless of course you want new data, which you can do.

ft_get('10.1371/journal.pone.0086169', from='plos', cache=TRUE)

Extract text from PDFs

There are going to be cases in which some results you find in ft_search() have full text available in text, xml, or other machine readable formats, but some may be open access, but only in pdf format. We have a series of convenience functions in this package to help extract text from pdfs, both locally and remotely.

Locally, using code adapted from the package tm, and various pdf to text parsing backends

pdf <- system.file("examples", "example2.pdf", package = "fulltext")

Using ghostscript

(res_gs <- ft_extract(pdf, "gs"))
#> <document>/Library/Frameworks/R.framework/Versions/3.2/Resources/library/fulltext/examples/example2.pdf
#>   Title: pone.0107412 1..10
#>   Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA
#>   Creation date: 2014-09-18

Using xpdf

(res_xpdf <- ft_extract(pdf, "xpdf"))
#> <document>/Library/Frameworks/R.framework/Versions/3.2/Resources/library/fulltext/examples/example2.pdf
#>   Pages: 10
#>   Title: pone.0107412 1..10
#>   Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA
#>   Creation date: 2014-09-18

Or extract directly into a tm Corpus

paths <- sapply(paste0("example", 2:5, ".pdf"), function(x) system.file("examples", x, package = "fulltext"))
(corpus_xpdf <- ft_extract_corpus(paths, "xpdf"))
#> $meta
#>           names                           class
#> 1 content, meta PlainTextDocument, TextDocument
#> 2 content, meta PlainTextDocument, TextDocument
#> 3 content, meta PlainTextDocument, TextDocument
#> 4 content, meta PlainTextDocument, TextDocument
#> 
#> $data
#> <<VCorpus>>
#> Metadata:  corpus specific: 0, document level (indexed): 0
#> Content:  documents: 4
#> 
#> attr(,"class")
#> [1] "xpdf"

Extract pdf remotely on the web, using a service called PDFX

pdf5 <- system.file("examples", "example5.pdf", package = "fulltext")
pdfx(file = pdf5)
#> $meta
#> $meta$job
#> [1] "34b281c10730b9e777de8a29b2dbdcc19f7d025c71afe9d674f3c5311a1f2044"
#>
#> $meta$base_name
#> [1] "5kpp"
#>
#> $meta$doi
#> [1] "10.7554/eLife.03640"
#>
#>
#> $data
#> <?xml version="1.0" encoding="UTF-8"?>
#> <pdfx xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://pdfx.cs.man.ac.uk/static/article-schema.xsd">
#>   <meta>
#>     <job>34b281c10730b9e777de8a29b2dbdcc19f7d025c71afe9d674f3c5311a1f2044</job>
#>     <base_name>5kpp</base_name>
#>     <doi>10.7554/eLife.03640</doi>
#>   </meta>
#>    <article>
#>  .....

Meta

  • Please report any issues or bugs.
  • License: MIT
  • Get citation information for fulltext: citation(package = 'fulltext')
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

rofooter

fulltext's People

Contributors

sckott avatar willpearse avatar emhart avatar dwinter avatar karthik avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.