Giter VIP home page Giter VIP logo

samara's Introduction

samara*

A commandline tool to extract plant trait data from open data sources.

Build Status DOI

Plants trait data from recent automated scrapes are available at APSNet or ARS-GRIN.

For more information about APSNet scraping process, including name matching, please go here.

*A samara is a winged achene, a type of fruit in which a flattened wing of fibrous, papery tissue develops from the ovary wall. (from https://en.wikipedia.org/wiki/Samara_(fruit) accessed at 2016-06-10).

prerequisites

sbt 0.13.8+/java jdk 8+/git/maven 3.3+

build/test

  1. clone this repo
  2. build jar by sbt assembly: a stand-alone jar samara-assembly-[version].jar will be available in target/scala-2.11/
  3. run tests by sbt test

download

Don't like building your own jar? Go to releases, pick a release and download the jar from there.

usage

  1. list available sources by java -jar samara-assembly-[version].jar list
  2. scrape a source called apsnet and put results into apsnet.tsv by java -jar samara-assembly-[version].jar scrape apsnet > apsnet.tsv

samara's People

Contributors

mariealaporte avatar

Stargazers

Adam H. Sparks avatar Christopher Lindsey avatar Jon Repp avatar Roberto Capuano avatar Hong Qin avatar

Watchers

James Cloos avatar Jorrit Poelen avatar Elizabeth Arnaud avatar  avatar Austin Meier avatar

samara's Issues

Kew Garden Seed Information Database not (yet) open data

In reviewing suitability of use of Kew Garden Seed Information Database, I stumbled across the collaboration page. The text (see below) seems to indicate that the entire database is available to collaborators after signing some data transfer agreement. This indicates that the data is not (yet) available under an open data license http://opendefinition.org/od/.

From http://data.kew.org/sid/collaborations.html accessed at 6 Dec 2017


Wherever possible, we will collaborate with other researchers and owners of large comparative datasets. Collaboration involves activities like us running species lists against SID for matches (more convenient than record by record on-line). We have previously shared large amounts of data from the database with collaborators (and are keen to do so in future) under the terms of a data transfer agreement, and with appropriate arrangements for acknowledgement and, where appropriate, authorship in any papers arising.

A couple of examples of papers resulting from previous collaborations are:

Moles et al; 2005; A Brief History of Seed Size; Science; 307(5709): 576-580

Tweddle et al; 2003; Ecological Aspects of Seed Desiccation Sensitivity; Journal of Ecology; 91(2)294-304

grin scraper: after first connection timeout, all requests fail: throttling?

From https://build.berkeleybop.org/job/extract-grin-traits/25/ , it appears that after downloading about 400k pages from ars-grin, a conection timeout occurred, followed by connection timeout and other connectivity issues and no successful download/scrape occurs after that, leaving about 13k failed downloads and 41 crops unexamined (starting from Sorghum and ending with Zinnia) . This explains why the result of https://build.berkeleybop.org/job/extract-grin-traits/25/ is about 1.2GB and the successful scrape before that (e.g. https://build.berkeleybop.org/job/extract-grin-traits/5/) was 2.5GB.

I've added some extra logging to better understand the connection errors. This extra info will hopefully tell us whether the scrape processes is actively blocked or whether the failures can be fixed on our side.

Here's the transition from successful scrapes to the timeouts:

[https://npgsweb.ars-grin.gov/gringlobal/AccessionObservation.aspx?id=1409315] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionObservation.aspx?id=1409315] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409330] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409330] download failed because of [Read timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409333] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409333] download failed because of [connect timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409334] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409334] download failed because of [connect timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409348] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409348] download failed because of [connect timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409360] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409360] download failed because of [connect timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409366] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409366] download failed because of [connect timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409385] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409385] download failed because of [npgsweb.ars-grin.gov] .

possible manual transcription of apsnet turfgrasses http://www.apsnet.org/publications/commonnames/Pages/Turfgrasses.aspx

After inspecting some more preliminary results of apsnet scraper, I found that the turfgrasses page shows some particular creative usage of abbreviations and name references. If seems that building a scraper for this page would take more time than either (a) convince the authors of the page to be a little more explicit or (b) manually transcribe the page using format similar to samara apsnet output.

http://ecoport.org - The name "EcoPort" is a composite acronym derived from the words 'Ecology and Portal'.

from http://ecoport.org - The name "EcoPort" is a composite acronym derived from the words 'Ecology and Portal'.

EcoPort is a single, open-society, contiguous, communal, wiki integrated with a Relational Database and its Management System (RDBMS)on the Internet. This is why we refer to it as a relational wiki.

Last news event was reported in 2006, and pages seem to be suitable for human consumption. Machine readable datasets associated with ecoport have yet to be discovered.

Identify evidence code for GAF2

As discussed in #11

The evidence code (column 7) is shown as "EXP (static)" in the wiki. This is not necessarily the case for all traits. The evidence code will change based on how the data was collected. Some traits are inferred, or calculated from other measurements, and thus will require a different evidence code. It will be difficult to decide on these codes automatically without human intervention.

grin scraper fails on parsing descriptordetails page

For some reason, descriptor detail 310094 at https://npgsweb.ars-grin.gov/gringlobal/descriptordetail.aspx?id=310094 displays an error page. This causes the grin parser to misbehave.

From https://build.berkeleybop.org/job/extract-grin-traits/27/console :

[https://npgsweb.ars-grin.gov/gringlobal/descriptordetail.aspx?id=310094] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/descriptordetail.aspx?id=310094] downloaded.
Exception in thread "main" java.util.NoSuchElementException
	at java.util.ArrayList$Itr.next(ArrayList.java:854)
	at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:40)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:448)
	at scala.collection.IterableLike.head(IterableLike.scala:106)
	at scala.collection.IterableLike.head$(IterableLike.scala:105)
	at net.ruippeixotog.scalascraper.model.LazyElementQuery.head(ElementQuery.scala:40)
	at net.ruippeixotog.scalascraper.scraper.ContentExtractors$.$anonfun$element$1(HtmlExtractor.scala:84)
	at net.ruippeixotog.scalascraper.scraper.SimpleExtractor.extract(HtmlExtractor.scala:62)
	at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.$anonfun$extract$1(ScrapingOps.scala:16)
	at scalaz.Monad.$anonfun$map$2(Monad.scala:14)
	at scalaz.IdInstances$$anon$1.point(Id.scala:20)
	at scalaz.Monad.$anonfun$map$1(Monad.scala:14)
	at scalaz.IdInstances$$anon$1.bind(Id.scala:22)
	at scalaz.Monad.map(Monad.scala:14)
	at scalaz.Monad.map$(Monad.scala:14)
	at scalaz.IdInstances$$anon$1.map(Id.scala:19)
	at scalaz.syntax.FunctorOps.map(FunctorSyntax.scala:10)
	at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.extract(ScrapingOps.scala:16)
	at 

North American Plant Protection Organization's (NAPPO) Phytosanitary Alert System

Accessible through https://pestalert.org the North American Plant Protection Organization's (NAPPO) Phytosanitary Alert System provides facilitate awareness, detection, prevention and management of exotic pest species in North America .

Pestalert.org claims that pest alerts/reports are intended to comply with the International Plant Protection Convention’s Standard on Pest Reporting (ISPM 17: 2002).

From International Plant Protection Convention’s Standard on Pest Reporting (ISPM 17: 2002) accessed at https://www.ippc.int/static/media/files/publication/en/2017/06/ISPM_17_2002_En_2017-05-25_PostCPM12_InkAm.pdf :
Pest reports should not be confidential. However, national systems for surveillance, domestic reporting, verification, and analysis may contain confidential information.

https://ipmdata.ipmcenters.org - Integrated Pest Management database for commodities grown in the United States

https://ipmdata.ipmcenters.org aims to be "the cornerstone of integrated pest management."

Organized in regional centers, the initiative seems to put out guidelines on how to create documents like plans, profiles, timelines and elements (e.g., https://ipmdata.ipmcenters.org/pmsp_guidelines.pdf). These documents are geared to human consumption: a machine readable version of the documents have yet to be discovered.

National Agricultural Pest Information System (NAPIS)

Accessible through http://pest.ceris.purdue.edu/index.php the "National Agricultural Pest Information System (NAPIS): Public Access Site Host: U.S. Department of Agriculture, Animal and Plant Health Inspection Service; Purdue University, Entomology Department, Center for Environmental and Regulatory Information Systems" provides information about pests through initiatives like the Cooperative Agricultural Pest Survey (http://caps.ceris.purdue.edu/pest-lists): The National CAPS Committee will approve annually a “Priority Pest List” that will include the commodity and taxon pests and the pests on the AHP Prioritized List (Appendix G), and be based on input by PPQ, the States, the Center for Plant Health Science and Technology (CPHST) (i.e. pest ranking, feasibility of survey, and pest identification), and commodity organizations. States will select from this list to complete the Priority Survey portion of CAPS..

The lists are published in xlsx matrices with scientific names for pests (but not for crops). Seems suitable for transcription.

Resource links include:

ars-grin scrape fails on network outage

In recent grin scrape (see https://build.berkeleybop.org/job/extract-grin-traits/20), a (temporary?) network outage caused the scrape to be halted after generating 1.3 GB of trait data after running for 1 day and 18 hours.

@cmungall was this a planned outage on bbop's jenkins ?

@austinmeier Let me know if you think this is a reason to implement some more retry mechanisms to be a little more resilient against intermittent network outages.

[https://npgsweb.ars-grin.gov/gringlobal/methodaccession.aspx?id1=51081&id2=494186] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/methodaccession.aspx?id1=51081&id2=11002] downloading ...
Exception in thread "main" java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:209)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
	at sun.security.ssl.InputRecord.read(InputRecord.java:503)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
	at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153)
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:512)
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493)
	at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205)
	at org.jsoup.helper.HttpConnection.get(HttpConnection.java:194)
	at org.planteome.samara.ResourceUtil$class.get(ResourceUtil.scala:15)
	at org.planteome.samara.ScraperGrin$.get(ScraperGrin.scala:6)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1$$anonfun$apply$7$$anonfun$apply$8.apply(ScraperGrin.scala:51)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1$$anonfun$apply$7$$anonfun$apply$8.apply(ScraperGrin.scala:50)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1$$anonfun$apply$7.apply(ScraperGrin.scala:50)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1$$anonfun$apply$7.apply(ScraperGrin.scala:47)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1.apply(ScraperGrin.scala:47)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1.apply(ScraperGrin.scala:44)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.planteome.samara.ScraperGrin$.getAccessionIds(ScraperGrin.scala:44)
	at org.planteome.samara.ScraperGrin$$anonfun$scrape$1.apply(ScraperGrin.scala:14)
	at org.planteome.samara.ScraperGrin$$anonfun$scrape$1.apply(ScraperGrin.scala:13)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.planteome.samara.ScraperGrin$.scrape(ScraperGrin.scala:13)
	at org.planteome.samara.Samara$.delayedEndpoint$org$planteome$samara$Samara$1(Samara.scala:41)
	at org.planteome.samara.Samara$delayedInit$body.apply(Samara.scala:6)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at org.planteome.samara.Samara$.main(Samara.scala:6)
	at org.planteome.samara.Samara.main(Samara.scala)
Build step 'Execute shell' marked build as failure

ars-grin jenkins job fails after upgrade to scala 2.12

Scrape job setup at https://build.berkeleybop.org/job/extract-grin-traits/23/console failed due to change in scala version. @cmungall any chance you can update the start script to java -jar 'target/scala-2.12/samara-assembly-*.jar' scrape grin ?

snippet from jenkins console:

+ java -jar 'target/scala-2.11/samara-assembly-*.jar' scrape grin
Error: Unable to access jarfile target/scala-2.11/samara-assembly-*.jar
Build step 'Execute shell' marked build as failure

introduce grin identifiers to facilitate in term mappings

@cmungall suggested to introduce a grin id prefix in preparation for (non-trivial) terms mapping to TO / NCBITaxon etc.

Here's my suggestion for prefixes:

term prefix example url
GRIN taxon id GRINTaxon: GRINTaxon:300359 https://npgsweb.ars-grin.gov/gringlobal/taxonomydetail.aspx?id=300359
GRIN descriptor id GRINDesc: GRINDesc:68104 https://npgsweb.ars-grin.gov/gringlobal/descriptordetail.aspx?id=68104
GRIN method id GRINMethod: GRINMethod:391002 https://npgsweb.ars-grin.gov/gringlobal/method.aspx?id=391002
GRIN accession id GRINAccess: GRINAccess:1140225 https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1140225

@AustinMeyer @cmungall curious to hear your thoughts.

resolve PubMed Ids for grin references

as suggested by @austinmeier

resolve PubMed Ids for grin references like:
D.Z. Skinner. 1999. Non random chloroplast DNA hypervariability in Medicago sativa. Theor Appl Genet Theoretical and applied genetics; international journal of b.

and

D.H. Basigalup, D.K. Barnes, and R.E. Stucker. 1995. Development of a Core Collection for Perennial Medicago Plant Introductions. Crop Sci 35:1163-1168.

eurisco dataset includes geospatial-temporal-taxonomic information but no traits

I looked at eurisco dataset at https://eurisco.ipk-gatersleben.de/apex/f?p=103:47:::NO . And found that no traits are associated for the seed crops. Dataset includes host institutions, lat/lng/elevation . I'll attempt to contact the eurisco folks to see if any traits for the respective crops have been associated somewhere else.

Please see attached data files for specific examples: first20.tsv (first 20 records), last 20 (last 20 records). Dataset includes about 2M records.

first20.tsv.txt
last20.tsv.txt

add source citation apsnet

Currently, no source is specified in samara's scrape of APSNET

Desired is to make the data source explicit and provide citation by appending something like

... sourceCitation sourceUrl accessedAt
... Wick, R., & Dicklow, B. (2000). Diseases of African Daisy(Gerbera jamesonii H. Bolus ex J. D. Hook). Retrieved August 03, 2016, from http://www.apsnet.org/publications/commonnames/Pages/AfricanDaisy.aspx http://www.apsnet.org/publications/commonnames/Pages/AfricanDaisy.aspx 2016-08-03

@austinmeier Any suggestions for the citation format?

missing method id/name in output

Various accession descriptor values are retrieved using multiple methods. Currently, no columns for methods (aka studies/environments) exists, so the values appear to be duplicates.

Suggest to add method id and names to tsv output.

descriptor name does not match descriptor id and definition

It appears that descriptor names in GRIN scrape do not correspond to associated descriptor id and definitions.

For example the line:

GRINTaxon:104918 Medicago sativa L. subsp. falcata (L.) Arcang. GRINDesc:68104 Winter injury (WINTERINJ) In-vitro dry matter disappearance (ivdmd) expressed as a percent of the cultivar venal. Higher than 100% suggests low digestibility & higher by-pass protein. GRINMethod:391002 ALFALFA.PROTBYPASS.93.VOLENEC 79 GRINAccess:1305140 PI 405064

has Winter injury (WINTERINJ) as descriptor name for GRINDesc:68104 and definition In-vitro dry matter disappearance (ivdmd) expressed as a percent of the cultivar venal. Higher than 100% suggests low digestibility & higher by-pass protein..

Expected is that GRINDesc:68104 has descriptor name By pass protein (PROTBYPASS)

@austinmeier hope this helps . . . also, can you send me the full grin scrape that you did on your infrastructure?

provide example gaf examples

as discussed in meeting July 5, 2015 -

Provide some manually constructed GAF examples for few scraped grin (and apsnet?) observations.

With these examples, we can write unit tests to automate and test the transformation from tabular (tsv) to GAF format.

apsnet name resolving : resolve suspicious name mappings

samara's scraper of apsnet makes an attempt to extract names of crops and diseases. However, because apsnet pages are far from homogeneous and because the apsnet scraper is not perfect, some of the extracted names are not taxon descriptors. In addition, other names are taxon names that cannot be resolved through various name resolvers. Because samara's view on apsnet is included in GloBI (http://globalbioticinteractions.org) , I was able to see that a bunch of names needed some attention using the GloBI status page (see attached image and name report).

Looking into ways to map these names without spending too much time on it. Happy to hear your thoughts @austinmeier @cmungall .

apsnet
apsnet-suspicious-names.tsv.txt

parse host descriptions in APSNET

currently, host descriptions scraped from APSNET contains non-taxonomic elements e.g. Diseases of Coffee (Coffea arabica L. - arabica coffee) (Coffea canephora Pierre ex Froehner - robusta coffee)

expected is that only taxonomic names appear in the host column, e.g. Coffea arabica|Coffea canephora

parse failure on ars-grin

as reported by @cmungall from https://build.berkeleybop.org/job/extract-grin-traits/3/console

[https://npgsweb.ars-grin.gov/gringlobal/AccessionObservation.aspx?id=1180774] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1180776] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1180776] downloaded.
Exception in thread "main" java.util.NoSuchElementException
    at java.util.ArrayList$Itr.next(ArrayList.java:854)
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.IterableLike$class.head(IterableLike.scala:107)
    at net.ruippeixotog.scalascraper.model.LazyElementQuery.head(ElementQuery.scala:40)
    at net.ruippeixotog.scalascraper.scraper.ContentExtractors$$anonfun$4.apply(HtmlExtractor.scala:89)
    at net.ruippeixotog.scalascraper.scraper.ContentExtractors$$anonfun$4.apply(HtmlExtractor.scala:89)
    at net.ruippeixotog.scalascraper.scraper.SimpleExtractor.extract(HtmlExtractor.scala:63)
    at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps$$anonfun$extract$1.apply(ScrapingOps.scala:16)
    at scalaz.Monad$$anonfun$map$1$$anonfun$apply$2.apply(Monad.scala:14)
    at scalaz.IdInstances$$anon$1.point(Id.scala:20)
    at scalaz.Monad$$anonfun$map$1.apply(Monad.scala:14)
    at scalaz.IdInstances$$anon$1.bind(Id.scala:22)
    at scalaz.Monad$class.map(Monad.scala:14)
    at scalaz.IdInstances$$anon$1.map(Id.scala:19)
    at scalaz.syntax.FunctorOps.map(FunctorSyntax.scala:10)
    at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.extract(ScrapingOps.scala:16)
    at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.$greater$greater(ScrapingOps.scala:20)
    at org.planteome.samara.ParserGrin.parseTaxonInAccessionDetails(ParserGrin.scala:75)
    at org.planteome.samara.ScraperGrin$.getObservationsForAccession(ScraperGrin.scala:58)
    at org.planteome.samara.ScraperGrin$$anonfun$scrape$1$$anonfun$apply$1.apply$mcVI$sp(ScraperGrin.scala:16)
    at org.planteome.samara.ScraperGrin$$anonfun$scrape$1$$anonfun$apply$1.apply(ScraperGrin.scala:15)
    at org.planteome.samara.ScraperGrin$$anonfun$scrape$1$$anonfun$apply$1.apply(ScraperGrin.scala:15)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.planteome.samara.ScraperGrin$$anonfun$scrape$1.apply(ScraperGrin.scala:15)
    at org.planteome.samara.ScraperGrin$$anonfun$scrape$1.apply(ScraperGrin.scala:13)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.planteome.samara.ScraperGrin$.scrape(ScraperGrin.scala:13)
    at org.planteome.samara.Samara$.delayedEndpoint$org$planteome$samara$Samara$1(Samara.scala:42)
    at org.planteome.samara.Samara$delayedInit$body.apply(Samara.scala:11)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at org.planteome.samara.Samara$.main(Samara.scala:11)
    at org.planteome.samara.Samara.main(Samara.scala)
Build step 'Execute shell' marked build as failure
Archiving artifacts

ars-grin - some scraped records are missing GRINTaxon entries

When experimenting with https://github.com/samara-datasets/grin-to-trait-ontology , I found that some rows in a grin scrape do not contains GRINTaxa .

Here's an example related to rows containing GRINAccess:1322495 . The last line does not appear to have a GRINTaxon associated to it. It concerns about 43k out of 6.7M records, fill list is attached.
no_grintaxa.zip

verbatim_taxon_id verbatim_taxon_name resolved_taxon_id descriptor_id descriptor_name descriptor_definition method_id method_name observed_value accession_id accession_number accession_name collected_from citations NONE descriptor_id descriptor_name
GRINTaxon:40598 Triticum monococcum L. subsp. aegilopoides (Link) Thell. NCBITaxon:52163 GRINDesc:65059 Core Subset A flag to indicate the accession is part of the core subset GRINMethod:280 WHEAT.CORE.95 Y - YES, THE ACCESSION IS PART OF THE CORE SUBSET GRINAccess:1322495 PI 427448   Malatya Turkey M.N. Rouse and Y. Jin. 2011. Stem rust resistance in A-genome diploid relatives of wheat. Pl Dis 95:941-944. NONE GRINDesc:65059 Core Subset
GRINTaxon:40598 Triticum monococcum L. subsp. aegilopoides (Link) Thell. NCBITaxon:52163 GRINDesc:65003 Days to Anthesis Days from January 1 (Julian) when 50% of the spikes are fully exserted from the boot. See also related descriptor Days to Flowering. GRINMethod:282 WHEAT.AGRON.MARICOPA.95 117 GRINAccess:1322495 PI 427448   Malatya Turkey M.N. Rouse and Y. Jin. 2011. Stem rust resistance in A-genome diploid relatives of wheat. Pl Dis 95:941-944. NONE GRINDesc:65003 Days to Anthesis
  GRINMethod:490583 WHEAT.LAB.MARICOPA.95 10 GRINAccess:1322495 PI 427448   Malatya Turkey M.N. Rouse and Y. Jin. 2011. Stem rust resistance in A-genome diploid relatives of wheat. Pl Dis 95:941-944. NONE 10 GRINAccess:1322495          

grin scraper retrieves taxon details page for each accession

Here's an example of repeated taxon pages download. Introducing a caching mechanism of sorts might help to make the scrape a little more efficient.

[https://npgsweb.ars-grin.gov/gringlobal/taxonomydetail.aspx?id=19333] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/taxonomydetail.aspx?id=19333] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1460686] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1460686] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/AccessionObservation.aspx?id=1460686] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionObservation.aspx?id=1460686] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/taxonomydetail.aspx?id=19333] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/taxonomydetail.aspx?id=19333] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1460687] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1460687] downloaded.

Header row has too many columns

I just noticed that when I split the resulting Grin.tsv file using tabs, I end up with 25 columns in the header row... as opposed to the 13 in the data rows. Might be a formatting thing. I looked into it, and I cannot seem to figure out why it's happening.

Command I was using: cat newest_GRIN_scrape.tsv |awk '{FS="\t";print NF}'

provide example gaf examples

as discussed in meeting July 5, 2015 -

@austinmeier to provide some manually constructed GAF examples for few scraped grin (and apsnet?) observations.

With these examples, we can write unit tests to automate and test the transformation from tabular (tsv) to GAF format.

parse list of genus/species pathogens into separate interactions

Some names mentioned in apsnet (e.g. Genus Allexivirus; Garlic viruses A-D (GVA, GVB, GVC, GVD), Garlic virus X (GVX), Garlic mite-borne mosaic virus (GMbMV), Shallot virus X (ShVX) from http://www.apsnet.org/publications/commonnames/Pages/OnionandGarlic.aspx) are actually lists of species of a specific genus.

Currently the list is interpreted as a single taxon.

Desired is to parse the list such that species (and their genus) are separated line by line.

So, Genus Allexivirus; Garlic viruses A-D (GVA, GVB, GVC, GVD), Garlic virus X (GVX), Garlic mite-borne mosaic virus (GMbMV), Shallot virus X (ShVX) would turn into:

Genus Species
Allexivirus Garlic viruses A-D (GVA, GVB, GVC, GVD)
Allexivirus Garlic virus X (GVX),
Allexivirus Garlic mite-borne mosaic virus (GMbMV)
Allexivirus Shallot virus X (ShVX)

@austinmeier please advise

strange (incorrect) mappings of pathogens in APS

We have found a few oddities in the name matchings for the APS scrape. Some of the pathogens are being identified as "EST" from NCBI. An example will help illustrate this:

mapped pathogen: NCBITaxon:1585532 (Beta vulgaris/Cercospora beticola mixed EST library)

The verbatim name: Beet curly top virus (BCTV)

The correct pathogen: NCBITaxon:10840 (Beet curly top virus)

It appears to me that the algorithm is being greedy in some way, and stopping after recognizing "Beet" and mapping to beet (Beta vulgaris) But I have no idea why it maps to the mixed EST library, instead of mapping to just plain beet.

Here is the offending line from the scrape:

Curly top Beet curly top virus (BCTV) Beet curly top virus (BCTV) NCBITaxon:1585532 pathogen of http://purl.obolibrary.org/obo/RO_0002556 Diseases of Cucurbits (Citrullus spp., Cucumis spp., Cucurbita spp., and others) Citrullus spp., Cucumis spp., Cucurbita spp., and s NCBITaxon:3653 R. D. Martyn, M. E. Miller and B. D. Bruton, primary collators (last update 2/19/93). Diseases of Cucurbits (Citrullus spp., Cucumis spp., Cucurbita spp., and others). The American Phytopathological Society. Accessed on 2016-09-07 at http://www.apsnet.org/publications/commonnames/Pages/Curcubits.aspx http://www.apsnet.org/publications/commonnames/Pages/Curcubits.aspx 2016-09-07

create scraper for article titles of APS Journal Plant Disease

I noticed that the publication "Plant Disease" of the American Phytopathological Society contains article titles that are structured pretty consistently. For example see articles from http://apsjournals.apsnet.org/toc/pdis/99/3 :

First Report of Bacterial Blight of Crucifers Caused by Pseudomonas cannabina pv. alisalensis in Minnesota on Arugula (Eruca vesicaria subsp. sativa) see http://apsjournals.apsnet.org/doi/abs/10.1094/PDIS-07-14-0672-PDN .

or

First Report of Late Blight Caused by Phytophthora infestans Clonal Lineage US-23 on Potato in Idaho see http://apsjournals.apsnet.org/doi/abs/10.1094/PDIS-02-14-0196-PDN .

@austinmeier @cmungall @jaiswalp does this sounds like an interesting lead? Are there any other publications that we can mine for abstracts / titles?

Trait name/descriptor label from GRIN scrape

We would love to have included in the output a column including the descriptor name. Perhaps between 'descriptor id' 'descriptor definition'.

Example: https://npgsweb.ars-grin.gov/gringlobal/descriptordetail.aspx?id=68104
Returns
descriptor ID = 68104
descriptor definition = In-vitro dry matter disappearance (ivdmd) expressed as a percent of the cultivar venal. Higher than 100% suggests low digestibility & higher by-pass protein.

but lacks the actual descriptor name: By pass protein (PROTBYPASS)

Again, if this is not clear, let me know.
Thanks!

Add NCBI taxon ID to APS scrape?

Is it possible to search the pathogens and hosts that have been scraped from APSnet against the NCBI taxonomy and return the NCBITaxonID in the scraped file?

Millenium Seed Bank not (yet) open

In reviewing potential usage of Millenium Seed Bank, I noticed the following page http://brahmsonline.kew.org/msbp/Legal . The page (see text below) seems to indicate that the database is not open and special agreements are needed to put in place to use the data.

retrieved from http://brahmsonline.kew.org/msbp/Legal on 6 Dec 2017


The data contained in the MSBP Data Warehouse is available to registered members of the MSBP Data Warehouse User Group. It is our intention to make seed collection information resources freely available for the seed conservation and research community.

You may make copies, including electronic copies, of the data held within this database provided that it is for your own personal use or for use within your organisation. If you use the data in published works then please use the attribution: "Data sourced via Millennium Seed Bank Partnership Data Warehouse http://brahmsonline.kew.org/msbp".

Following a search, summary details can be downloaded directly from this website for up to 1000 records. If you wish to download a larger dataset or make use of the data in other ways then please contact the MSBP Data Warehouse administrator.

While every effort has been taken to ensure that the information held in the database is reliable, RBG Kew is not responsible for any errors or omissions in the data or any damages arising from the use of the data.

include references in grin scrape

Currently, the grin scrape does not include references to scientific publications.

As discussed in #11 , including the references in the scrape is desirable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.