planteome / samara Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 3.0 484 KB

extracts plant trait data from open data sources like apsnet and ars-grin

License: MIT License

Scala 5.80% ASP 94.20%

samara's Introduction

samara*

A commandline tool to extract plant trait data from open data sources.

Plants trait data from recent automated scrapes are available at APSNet or ARS-GRIN.

For more information about APSNet scraping process, including name matching, please go here.

*A samara is a winged achene, a type of fruit in which a flattened wing of fibrous, papery tissue develops from the ovary wall. (from https://en.wikipedia.org/wiki/Samara_(fruit) accessed at 2016-06-10).

prerequisites

sbt 0.13.8+/java jdk 8+/git/maven 3.3+

build/test

clone this repo
build jar by sbt assembly: a stand-alone jar samara-assembly-[version].jar will be available in target/scala-2.11/
run tests by sbt test

download

Don't like building your own jar? Go to releases, pick a release and download the jar from there.

usage

list available sources by java -jar samara-assembly-[version].jar list
scrape a source called apsnet and put results into apsnet.tsv by java -jar samara-assembly-[version].jar scrape apsnet > apsnet.tsv

samara's People

Contributors

Stargazers

Watchers

Forkers

qinlab mariealaporte austinmeier

samara's Issues

Include PI number and any other names/synonyms from GRIN

We would like the returned spreadsheet to include the PI number and any names that are associated with each accession number. For example: https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1140225 should include a column "PI number", with the value: "PI 162787", and an additional column "accession name", with the value: "PAMPA"
If this doesn't make sense, I can try and clarify.
Thanks

Kew Garden Seed Information Database not (yet) open data

In reviewing suitability of use of Kew Garden Seed Information Database, I stumbled across the collaboration page. The text (see below) seems to indicate that the entire database is available to collaborators after signing some data transfer agreement. This indicates that the data is not (yet) available under an open data license http://opendefinition.org/od/.

From http://data.kew.org/sid/collaborations.html accessed at 6 Dec 2017

Wherever possible, we will collaborate with other researchers and owners of large comparative datasets. Collaboration involves activities like us running species lists against SID for matches (more convenient than record by record on-line). We have previously shared large amounts of data from the database with collaborators (and are keen to do so in future) under the terms of a data transfer agreement, and with appropriate arrangements for acknowledgement and, where appropriate, authorship in any papers arising.

A couple of examples of papers resulting from previous collaborations are:

Moles et al; 2005; A Brief History of Seed Size; Science; 307(5709): 576-580

Tweddle et al; 2003; Ecological Aspects of Seed Desiccation Sensitivity; Journal of Ecology; 91(2)294-304

traits available through SESTO nordgen

As mentioned in #38 .

https://sesto.nordgen.org/sesto/index.php?scp=ngb&thm=char_eval&mod=brws_trait contains rendered html pages containing trait descriptors (aka characteristics and evaluations) for crops.

Possibly suitable for GRIN-like scraping of html pages in combination with generating urls based on used schemes.

grin scraper: after first connection timeout, all requests fail: throttling?

From https://build.berkeleybop.org/job/extract-grin-traits/25/ , it appears that after downloading about 400k pages from ars-grin, a conection timeout occurred, followed by connection timeout and other connectivity issues and no successful download/scrape occurs after that, leaving about 13k failed downloads and 41 crops unexamined (starting from Sorghum and ending with Zinnia) . This explains why the result of https://build.berkeleybop.org/job/extract-grin-traits/25/ is about 1.2GB and the successful scrape before that (e.g. https://build.berkeleybop.org/job/extract-grin-traits/5/) was 2.5GB.

I've added some extra logging to better understand the connection errors. This extra info will hopefully tell us whether the scrape processes is actively blocked or whether the failures can be fixed on our side.

Here's the transition from successful scrapes to the timeouts:

[https://npgsweb.ars-grin.gov/gringlobal/AccessionObservation.aspx?id=1409315] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionObservation.aspx?id=1409315] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409330] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409330] download failed because of [Read timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409333] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409333] download failed because of [connect timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409334] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409334] download failed because of [connect timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409348] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409348] download failed because of [connect timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409360] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409360] download failed because of [connect timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409366] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409366] download failed because of [connect timed out] .
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409385] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1409385] download failed because of [npgsweb.ars-grin.gov] .

possible manual transcription of apsnet turfgrasses http://www.apsnet.org/publications/commonnames/Pages/Turfgrasses.aspx

After inspecting some more preliminary results of apsnet scraper, I found that the turfgrasses page shows some particular creative usage of abbreviations and name references. If seems that building a scraper for this page would take more time than either (a) convince the authors of the page to be a little more explicit or (b) manually transcribe the page using format similar to samara apsnet output.

http://ecoport.org - The name "EcoPort" is a composite acronym derived from the words 'Ecology and Portal'.

from http://ecoport.org - The name "EcoPort" is a composite acronym derived from the words 'Ecology and Portal'.

EcoPort is a single, open-society, contiguous, communal, wiki integrated with a Relational Database and its Management System (RDBMS)on the Internet. This is why we refer to it as a relational wiki.

Last news event was reported in 2006, and pages seem to be suitable for human consumption. Machine readable datasets associated with ecoport have yet to be discovered.

Identify evidence code for GAF2

As discussed in #11

The evidence code (column 7) is shown as "EXP (static)" in the wiki. This is not necessarily the case for all traits. The evidence code will change based on how the data was collected. Some traits are inferred, or calculated from other measurements, and thus will require a different evidence code. It will be difficult to decide on these codes automatically without human intervention.

grin scraper fails on parsing descriptordetails page

For some reason, descriptor detail 310094 at https://npgsweb.ars-grin.gov/gringlobal/descriptordetail.aspx?id=310094 displays an error page. This causes the grin parser to misbehave.

From https://build.berkeleybop.org/job/extract-grin-traits/27/console :

[https://npgsweb.ars-grin.gov/gringlobal/descriptordetail.aspx?id=310094] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/descriptordetail.aspx?id=310094] downloaded.
Exception in thread "main" java.util.NoSuchElementException
	at java.util.ArrayList$Itr.next(ArrayList.java:854)
	at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:40)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:448)
	at scala.collection.IterableLike.head(IterableLike.scala:106)
	at scala.collection.IterableLike.head$(IterableLike.scala:105)
	at net.ruippeixotog.scalascraper.model.LazyElementQuery.head(ElementQuery.scala:40)
	at net.ruippeixotog.scalascraper.scraper.ContentExtractors$.$anonfun$element$1(HtmlExtractor.scala:84)
	at net.ruippeixotog.scalascraper.scraper.SimpleExtractor.extract(HtmlExtractor.scala:62)
	at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.$anonfun$extract$1(ScrapingOps.scala:16)
	at scalaz.Monad.$anonfun$map$2(Monad.scala:14)
	at scalaz.IdInstances$$anon$1.point(Id.scala:20)
	at scalaz.Monad.$anonfun$map$1(Monad.scala:14)
	at scalaz.IdInstances$$anon$1.bind(Id.scala:22)
	at scalaz.Monad.map(Monad.scala:14)
	at scalaz.Monad.map$(Monad.scala:14)
	at scalaz.IdInstances$$anon$1.map(Id.scala:19)
	at scalaz.syntax.FunctorOps.map(FunctorSyntax.scala:10)
	at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.extract(ScrapingOps.scala:16)
	at

North American Plant Protection Organization's (NAPPO) Phytosanitary Alert System

Accessible through https://pestalert.org the North American Plant Protection Organization's (NAPPO) Phytosanitary Alert System provides facilitate awareness, detection, prevention and management of exotic pest species in North America .

Pestalert.org claims that pest alerts/reports are intended to comply with the International Plant Protection Convention’s Standard on Pest Reporting (ISPM 17: 2002).

From International Plant Protection Convention’s Standard on Pest Reporting (ISPM 17: 2002) accessed at https://www.ippc.int/static/media/files/publication/en/2017/06/ISPM_17_2002_En_2017-05-25_PostCPM12_InkAm.pdf :
Pest reports should not be confidential. However, national systems for surveillance, domestic reporting, verification, and analysis may contain confidential information.

"Fruit injury from mishandling during harvest or grading" reported as pathogen of mango Mangifera indica from aps

see globalbioticinteractions/globalbioticinteractions#308 .

Specific example of #35 - should ignore non-latin names to avoid damage by behavior / non-biotic processes to be interpreted as diseases.

Natural Resources Canada. Canadian Forest Service. Scope: Trees, insects and diseases of Canada's forests (TIDCF)

Available through https://tidcf.nrcan.gc.ca/en/, the Natural Resources Canada. Canadian Forest Service. Scope: Trees, insects and diseases of Canada's forests (TIDCF) seems to be designed for browsing by humans. Html pages are rendered, no data api is used to retrieve information about insects, diseases and associated hosts.

https://ipmdata.ipmcenters.org - Integrated Pest Management database for commodities grown in the United States

https://ipmdata.ipmcenters.org aims to be "the cornerstone of integrated pest management."

Organized in regional centers, the initiative seems to put out guidelines on how to create documents like plans, profiles, timelines and elements (e.g., https://ipmdata.ipmcenters.org/pmsp_guidelines.pdf). These documents are geared to human consumption: a machine readable version of the documents have yet to be discovered.

National Agricultural Pest Information System (NAPIS)

Accessible through http://pest.ceris.purdue.edu/index.php the "National Agricultural Pest Information System (NAPIS): Public Access Site Host: U.S. Department of Agriculture, Animal and Plant Health Inspection Service; Purdue University, Entomology Department, Center for Environmental and Regulatory Information Systems" provides information about pests through initiatives like the Cooperative Agricultural Pest Survey (http://caps.ceris.purdue.edu/pest-lists): The National CAPS Committee will approve annually a “Priority Pest List” that will include the commodity and taxon pests and the pests on the AHP Prioritized List (Appendix G), and be based on input by PPQ, the States, the Center for Plant Health Science and Technology (CPHST) (i.e. pest ranking, feasibility of survey, and pest identification), and commodity organizations. States will select from this list to complete the Priority Survey portion of CAPS..

The lists are published in xlsx matrices with scientific names for pests (but not for crops). Seems suitable for transcription.

Resource links include:

ars-grin scrape fails on network outage

In recent grin scrape (see https://build.berkeleybop.org/job/extract-grin-traits/20), a (temporary?) network outage caused the scrape to be halted after generating 1.3 GB of trait data after running for 1 day and 18 hours.

@cmungall was this a planned outage on bbop's jenkins ?

@austinmeier Let me know if you think this is a reason to implement some more retry mechanisms to be a little more resilient against intermittent network outages.

[https://npgsweb.ars-grin.gov/gringlobal/methodaccession.aspx?id1=51081&id2=494186] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/methodaccession.aspx?id1=51081&id2=11002] downloading ...
Exception in thread "main" java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:209)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
	at sun.security.ssl.InputRecord.read(InputRecord.java:503)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
	at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153)
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:512)
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493)
	at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205)
	at org.jsoup.helper.HttpConnection.get(HttpConnection.java:194)
	at org.planteome.samara.ResourceUtil$class.get(ResourceUtil.scala:15)
	at org.planteome.samara.ScraperGrin$.get(ScraperGrin.scala:6)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1$$anonfun$apply$7$$anonfun$apply$8.apply(ScraperGrin.scala:51)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1$$anonfun$apply$7$$anonfun$apply$8.apply(ScraperGrin.scala:50)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1$$anonfun$apply$7.apply(ScraperGrin.scala:50)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1$$anonfun$apply$7.apply(ScraperGrin.scala:47)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1.apply(ScraperGrin.scala:47)
	at org.planteome.samara.ScraperGrin$$anonfun$getAccessionIds$1.apply(ScraperGrin.scala:44)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.planteome.samara.ScraperGrin$.getAccessionIds(ScraperGrin.scala:44)
	at org.planteome.samara.ScraperGrin$$anonfun$scrape$1.apply(ScraperGrin.scala:14)
	at org.planteome.samara.ScraperGrin$$anonfun$scrape$1.apply(ScraperGrin.scala:13)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.planteome.samara.ScraperGrin$.scrape(ScraperGrin.scala:13)
	at org.planteome.samara.Samara$.delayedEndpoint$org$planteome$samara$Samara$1(Samara.scala:41)
	at org.planteome.samara.Samara$delayedInit$body.apply(Samara.scala:6)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at org.planteome.samara.Samara$.main(Samara.scala:6)
	at org.planteome.samara.Samara.main(Samara.scala)
Build step 'Execute shell' marked build as failure

make a list of existing source of plant trait datasets

ars-grin jenkins job fails after upgrade to scala 2.12

Scrape job setup at https://build.berkeleybop.org/job/extract-grin-traits/23/console failed due to change in scala version. @cmungall any chance you can update the start script to java -jar 'target/scala-2.12/samara-assembly-*.jar' scrape grin ?

snippet from jenkins console:

+ java -jar 'target/scala-2.11/samara-assembly-*.jar' scrape grin
Error: Unable to access jarfile target/scala-2.11/samara-assembly-*.jar
Build step 'Execute shell' marked build as failure

introduce grin identifiers to facilitate in term mappings

@cmungall suggested to introduce a grin id prefix in preparation for (non-trivial) terms mapping to TO / NCBITaxon etc.

Here's my suggestion for prefixes:

term	prefix	example	url
GRIN taxon id	GRINTaxon:	GRINTaxon:300359	https://npgsweb.ars-grin.gov/gringlobal/taxonomydetail.aspx?id=300359
GRIN descriptor id	GRINDesc:	GRINDesc:68104	https://npgsweb.ars-grin.gov/gringlobal/descriptordetail.aspx?id=68104
GRIN method id	GRINMethod:	GRINMethod:391002	https://npgsweb.ars-grin.gov/gringlobal/method.aspx?id=391002
GRIN accession id	GRINAccess:	GRINAccess:1140225	https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1140225

@AustinMeyer @cmungall curious to hear your thoughts.

resolve PubMed Ids for grin references

as suggested by @austinmeier

resolve PubMed Ids for grin references like:
D.Z. Skinner. 1999. Non random chloroplast DNA hypervariability in Medicago sativa. Theor Appl Genet Theoretical and applied genetics; international journal of b.

and

D.H. Basigalup, D.K. Barnes, and R.E. Stucker. 1995. Development of a Core Collection for Perennial Medicago Plant Introductions. Crop Sci 35:1163-1168.

eurisco dataset includes geospatial-temporal-taxonomic information but no traits

I looked at eurisco dataset at https://eurisco.ipk-gatersleben.de/apex/f?p=103:47:::NO . And found that no traits are associated for the seed crops. Dataset includes host institutions, lat/lng/elevation . I'll attempt to contact the eurisco folks to see if any traits for the respective crops have been associated somewhere else.

Please see attached data files for specific examples: first20.tsv (first 20 records), last 20 (last 20 records). Dataset includes about 2M records.

first20.tsv.txt
last20.tsv.txt

add source citation apsnet

Currently, no source is specified in samara's scrape of APSNET

Desired is to make the data source explicit and provide citation by appending something like

...	sourceCitation	sourceUrl	accessedAt
...	Wick, R., & Dicklow, B. (2000). Diseases of African Daisy(Gerbera jamesonii H. Bolus ex J. D. Hook). Retrieved August 03, 2016, from http://www.apsnet.org/publications/commonnames/Pages/AfricanDaisy.aspx	http://www.apsnet.org/publications/commonnames/Pages/AfricanDaisy.aspx	2016-08-03

@austinmeier Any suggestions for the citation format?

pick primary taxonomy

NCBI Taxonomy?

missing method id/name in output

Various accession descriptor values are retrieved using multiple methods. Currently, no columns for methods (aka studies/environments) exists, so the values appear to be duplicates.

Suggest to add method id and names to tsv output.

map grin descriptor name/id to Trait Ontology terms

as discussed in #11

descriptor name does not match descriptor id and definition

It appears that descriptor names in GRIN scrape do not correspond to associated descriptor id and definitions.

For example the line:

GRINTaxon:104918 Medicago sativa L. subsp. falcata (L.) Arcang. GRINDesc:68104 Winter injury (WINTERINJ) In-vitro dry matter disappearance (ivdmd) expressed as a percent of the cultivar venal. Higher than 100% suggests low digestibility & higher by-pass protein. GRINMethod:391002 ALFALFA.PROTBYPASS.93.VOLENEC 79 GRINAccess:1305140 PI 405064

has Winter injury (WINTERINJ) as descriptor name for GRINDesc:68104 and definition In-vitro dry matter disappearance (ivdmd) expressed as a percent of the cultivar venal. Higher than 100% suggests low digestibility & higher by-pass protein..

Expected is that GRINDesc:68104 has descriptor name By pass protein (PROTBYPASS)

@austinmeier hope this helps . . . also, can you send me the full grin scrape that you did on your infrastructure?

resolve locales to lat/lng pairs to facilitate geospatial analysis

As suggested by @cmungall . Also see geneontology/amigo#341 (comment)

include verbatim pathogen/host names

Include the original (verbatim) scraped name and the extracted pathogen/host side-by-side for verification purposes.

map grin taxa to ncbi taxa

As discussed in #11.

provide example gaf examples

as discussed in meeting July 5, 2015 -

Provide some manually constructed GAF examples for few scraped grin (and apsnet?) observations.

With these examples, we can write unit tests to automate and test the transformation from tabular (tsv) to GAF format.

apsnet name resolving : resolve suspicious name mappings

samara's scraper of apsnet makes an attempt to extract names of crops and diseases. However, because apsnet pages are far from homogeneous and because the apsnet scraper is not perfect, some of the extracted names are not taxon descriptors. In addition, other names are taxon names that cannot be resolved through various name resolvers. Because samara's view on apsnet is included in GloBI (http://globalbioticinteractions.org) , I was able to see that a bunch of names needed some attention using the GloBI status page (see attached image and name report).

Looking into ways to map these names without spending too much time on it. Happy to hear your thoughts @austinmeier @cmungall .

apsnet-suspicious-names.tsv.txt

parse host descriptions in APSNET

currently, host descriptions scraped from APSNET contains non-taxonomic elements e.g. Diseases of Coffee (Coffea arabica L. - arabica coffee) (Coffea canephora Pierre ex Froehner - robusta coffee)

expected is that only taxonomic names appear in the host column, e.g. Coffea arabica|Coffea canephora

parse failure on ars-grin

as reported by @cmungall from https://build.berkeleybop.org/job/extract-grin-traits/3/console

[https://npgsweb.ars-grin.gov/gringlobal/AccessionObservation.aspx?id=1180774] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1180776] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1180776] downloaded.
Exception in thread "main" java.util.NoSuchElementException
    at java.util.ArrayList$Itr.next(ArrayList.java:854)
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.IterableLike$class.head(IterableLike.scala:107)
    at net.ruippeixotog.scalascraper.model.LazyElementQuery.head(ElementQuery.scala:40)
    at net.ruippeixotog.scalascraper.scraper.ContentExtractors$$anonfun$4.apply(HtmlExtractor.scala:89)
    at net.ruippeixotog.scalascraper.scraper.ContentExtractors$$anonfun$4.apply(HtmlExtractor.scala:89)
    at net.ruippeixotog.scalascraper.scraper.SimpleExtractor.extract(HtmlExtractor.scala:63)
    at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps$$anonfun$extract$1.apply(ScrapingOps.scala:16)
    at scalaz.Monad$$anonfun$map$1$$anonfun$apply$2.apply(Monad.scala:14)
    at scalaz.IdInstances$$anon$1.point(Id.scala:20)
    at scalaz.Monad$$anonfun$map$1.apply(Monad.scala:14)
    at scalaz.IdInstances$$anon$1.bind(Id.scala:22)
    at scalaz.Monad$class.map(Monad.scala:14)
    at scalaz.IdInstances$$anon$1.map(Id.scala:19)
    at scalaz.syntax.FunctorOps.map(FunctorSyntax.scala:10)
    at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.extract(ScrapingOps.scala:16)
    at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.$greater$greater(ScrapingOps.scala:20)
    at org.planteome.samara.ParserGrin.parseTaxonInAccessionDetails(ParserGrin.scala:75)
    at org.planteome.samara.ScraperGrin$.getObservationsForAccession(ScraperGrin.scala:58)
    at org.planteome.samara.ScraperGrin$$anonfun$scrape$1$$anonfun$apply$1.apply$mcVI$sp(ScraperGrin.scala:16)
    at org.planteome.samara.ScraperGrin$$anonfun$scrape$1$$anonfun$apply$1.apply(ScraperGrin.scala:15)
    at org.planteome.samara.ScraperGrin$$anonfun$scrape$1$$anonfun$apply$1.apply(ScraperGrin.scala:15)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.planteome.samara.ScraperGrin$$anonfun$scrape$1.apply(ScraperGrin.scala:15)
    at org.planteome.samara.ScraperGrin$$anonfun$scrape$1.apply(ScraperGrin.scala:13)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.planteome.samara.ScraperGrin$.scrape(ScraperGrin.scala:13)
    at org.planteome.samara.Samara$.delayedEndpoint$org$planteome$samara$Samara$1(Samara.scala:42)
    at org.planteome.samara.Samara$delayedInit$body.apply(Samara.scala:11)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at org.planteome.samara.Samara$.main(Samara.scala:11)
    at org.planteome.samara.Samara.main(Samara.scala)
Build step 'Execute shell' marked build as failure
Archiving artifacts

setup continuous integration environment (travis?)

include country of origin in grin scrape

As discussed in #11 .

ars-grin - some scraped records are missing GRINTaxon entries

When experimenting with https://github.com/samara-datasets/grin-to-trait-ontology , I found that some rows in a grin scrape do not contains GRINTaxa .

Here's an example related to rows containing GRINAccess:1322495 . The last line does not appear to have a GRINTaxon associated to it. It concerns about 43k out of 6.7M records, fill list is attached.
no_grintaxa.zip

verbatim_taxon_id	verbatim_taxon_name	resolved_taxon_id	descriptor_id	descriptor_name	descriptor_definition	method_id	method_name	observed_value	accession_id	accession_number	accession_name	collected_from	citations	NONE	descriptor_id	descriptor_name
GRINTaxon:40598	Triticum monococcum L. subsp. aegilopoides (Link) Thell.	NCBITaxon:52163	GRINDesc:65059	Core Subset	A flag to indicate the accession is part of the core subset	GRINMethod:280	WHEAT.CORE.95	Y - YES, THE ACCESSION IS PART OF THE CORE SUBSET	GRINAccess:1322495	PI 427448		Malatya Turkey	M.N. Rouse and Y. Jin. 2011. Stem rust resistance in A-genome diploid relatives of wheat. Pl Dis 95:941-944.	NONE	GRINDesc:65059	Core Subset
GRINTaxon:40598	Triticum monococcum L. subsp. aegilopoides (Link) Thell.	NCBITaxon:52163	GRINDesc:65003	Days to Anthesis	Days from January 1 (Julian) when 50% of the spikes are fully exserted from the boot. See also related descriptor Days to Flowering.	GRINMethod:282	WHEAT.AGRON.MARICOPA.95	117	GRINAccess:1322495	PI 427448		Malatya Turkey	M.N. Rouse and Y. Jin. 2011. Stem rust resistance in A-genome diploid relatives of wheat. Pl Dis 95:941-944.	NONE	GRINDesc:65003	Days to Anthesis
	GRINMethod:490583	WHEAT.LAB.MARICOPA.95	10	GRINAccess:1322495	PI 427448		Malatya Turkey	M.N. Rouse and Y. Jin. 2011. Stem rust resistance in A-genome diploid relatives of wheat. Pl Dis 95:941-944.	NONE	10	GRINAccess:1322495

remove "Genus" prefix in apsnet pathogens

e.g. "Genus Necrovirus; Tobacco necrosis virus (TNV)" a mentioned in http://www.apsnet.org/publications/commonnames/Pages/Bean.aspx should be parsed such that the genus (Necrovirus) and species Tobacco necrosis virus (TNV)) are identified.

grin scraper retrieves taxon details page for each accession

Here's an example of repeated taxon pages download. Introducing a caching mechanism of sorts might help to make the scrape a little more efficient.

[https://npgsweb.ars-grin.gov/gringlobal/taxonomydetail.aspx?id=19333] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/taxonomydetail.aspx?id=19333] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1460686] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1460686] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/AccessionObservation.aspx?id=1460686] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionObservation.aspx?id=1460686] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/taxonomydetail.aspx?id=19333] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/taxonomydetail.aspx?id=19333] downloaded.
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1460687] downloading ...
[https://npgsweb.ars-grin.gov/gringlobal/AccessionDetail.aspx?id=1460687] downloaded.

extract plant traits from ars-grin

see http://www.ars-grin.gov/npgs/index.html

Agricultural Research Service - Germplasm Resource Information Network or ars-grin contains traits (e.g. disease, insect, fruit size) from many crops (e.g. apple, wheat).

Header row has too many columns

I just noticed that when I split the resulting Grin.tsv file using tabs, I end up with 25 columns in the header row... as opposed to the 13 in the data rows. Might be a formatting thing. I looked into it, and I cannot seem to figure out why it's happening.

Command I was using: cat newest_GRIN_scrape.tsv |awk '{FS="\t";print NF}'

https://nematode.unl.edu/index.html - Plant and Insect parasitic nematodes at University of Nebraska-Lincoln

Resources available through https://nematode.unl.edu/index.html concern nematodes and are designed to be consumed by humans - machine readable data formats associated to this website have yet to be discovered.

provide example gaf examples

as discussed in meeting July 5, 2015 -

@austinmeier to provide some manually constructed GAF examples for few scraped grin (and apsnet?) observations.

With these examples, we can write unit tests to automate and test the transformation from tabular (tsv) to GAF format.

parse list of genus/species pathogens into separate interactions

Some names mentioned in apsnet (e.g. Genus Allexivirus; Garlic viruses A-D (GVA, GVB, GVC, GVD), Garlic virus X (GVX), Garlic mite-borne mosaic virus (GMbMV), Shallot virus X (ShVX) from http://www.apsnet.org/publications/commonnames/Pages/OnionandGarlic.aspx) are actually lists of species of a specific genus.

Currently the list is interpreted as a single taxon.

Desired is to parse the list such that species (and their genus) are separated line by line.

So, Genus Allexivirus; Garlic viruses A-D (GVA, GVB, GVC, GVD), Garlic virus X (GVX), Garlic mite-borne mosaic virus (GMbMV), Shallot virus X (ShVX) would turn into:

Genus	Species
Allexivirus	Garlic viruses A-D (GVA, GVB, GVC, GVD)
Allexivirus	Garlic virus X (GVX),
Allexivirus	Garlic mite-borne mosaic virus (GMbMV)
Allexivirus	Shallot virus X (ShVX)

@austinmeier please advise

strange (incorrect) mappings of pathogens in APS

We have found a few oddities in the name matchings for the APS scrape. Some of the pathogens are being identified as "EST" from NCBI. An example will help illustrate this:

mapped pathogen: NCBITaxon:1585532 (Beta vulgaris/Cercospora beticola mixed EST library)

The verbatim name: Beet curly top virus (BCTV)

The correct pathogen: NCBITaxon:10840 (Beet curly top virus)

It appears to me that the algorithm is being greedy in some way, and stopping after recognizing "Beet" and mapping to beet (Beta vulgaris) But I have no idea why it maps to the mixed EST library, instead of mapping to just plain beet.

Here is the offending line from the scrape:

Curly top Beet curly top virus (BCTV) Beet curly top virus (BCTV) NCBITaxon:1585532 pathogen of http://purl.obolibrary.org/obo/RO_0002556 Diseases of Cucurbits (Citrullus spp., Cucumis spp., Cucurbita spp., and others) Citrullus spp., Cucumis spp., Cucurbita spp., and s NCBITaxon:3653 R. D. Martyn, M. E. Miller and B. D. Bruton, primary collators (last update 2/19/93). Diseases of Cucurbits (Citrullus spp., Cucumis spp., Cucurbita spp., and others). The American Phytopathological Society. Accessed on 2016-09-07 at http://www.apsnet.org/publications/commonnames/Pages/Curcubits.aspx http://www.apsnet.org/publications/commonnames/Pages/Curcubits.aspx 2016-09-07

create scraper for article titles of APS Journal Plant Disease

I noticed that the publication "Plant Disease" of the American Phytopathological Society contains article titles that are structured pretty consistently. For example see articles from http://apsjournals.apsnet.org/toc/pdis/99/3 :

First Report of Bacterial Blight of Crucifers Caused by Pseudomonas cannabina pv. alisalensis in Minnesota on Arugula (Eruca vesicaria subsp. sativa) see http://apsjournals.apsnet.org/doi/abs/10.1094/PDIS-07-14-0672-PDN .

First Report of Late Blight Caused by Phytophthora infestans Clonal Lineage US-23 on Potato in Idaho see http://apsjournals.apsnet.org/doi/abs/10.1094/PDIS-02-14-0196-PDN .

@austinmeier @cmungall @jaiswalp does this sounds like an interesting lead? Are there any other publications that we can mine for abstracts / titles?

http://www.dpvweb.net/ - description of plant viruses DPVweb

from http://www.dpvweb.net/ - This site provides a central source of information about viruses, viroids and satellites of plants, fungi and protozoa, with some additional data on animal viruses and phages with RNA or ssDNA genomes.

Site is designed for ad-hoc information retrieval by humans and has reportedly not been updated after August 2013.

Trait name/descriptor label from GRIN scrape

We would love to have included in the output a column including the descriptor name. Perhaps between 'descriptor id' 'descriptor definition'.

Example: https://npgsweb.ars-grin.gov/gringlobal/descriptordetail.aspx?id=68104
Returns
descriptor ID = 68104
descriptor definition = In-vitro dry matter disappearance (ivdmd) expressed as a percent of the cultivar venal. Higher than 100% suggests low digestibility & higher by-pass protein.

but lacks the actual descriptor name: By pass protein (PROTBYPASS)

Again, if this is not clear, let me know.
Thanks!

Add NCBI taxon ID to APS scrape?

Is it possible to search the pathogens and hosts that have been scraped from APSnet against the NCBI taxonomy and return the NCBITaxonID in the scraped file?

Millenium Seed Bank not (yet) open

In reviewing potential usage of Millenium Seed Bank, I noticed the following page http://brahmsonline.kew.org/msbp/Legal . The page (see text below) seems to indicate that the database is not open and special agreements are needed to put in place to use the data.

retrieved from http://brahmsonline.kew.org/msbp/Legal on 6 Dec 2017

The data contained in the MSBP Data Warehouse is available to registered members of the MSBP Data Warehouse User Group. It is our intention to make seed collection information resources freely available for the seed conservation and research community.

You may make copies, including electronic copies, of the data held within this database provided that it is for your own personal use or for use within your organisation. If you use the data in published works then please use the attribution: "Data sourced via Millennium Seed Bank Partnership Data Warehouse http://brahmsonline.kew.org/msbp".

Following a search, summary details can be downloaded directly from this website for up to 1000 records. If you wish to download a larger dataset or make use of the data in other ways then please contact the MSBP Data Warehouse administrator.

While every effort has been taken to ensure that the information held in the database is reliable, RBG Kew is not responsible for any errors or omissions in the data or any damages arising from the use of the data.