Giter VIP home page Giter VIP logo

workshop-resources's People

Contributors

chreman avatar doubi avatar grahamsteel avatar jcmolloy avatar petermr avatar skasberger avatar tarrow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

workshop-resources's Issues

diversify queries in getpapers

dont use always the same query for all data sources. show different kind of queries, so people learn different sources and query functionalities at the same time.

describe data structures better

describe better what is in the bib.json, results.json and scholarly.html files in our case. and add links to proper documentation for general interest.

explain results of getpapers

after the first call of getpapers, in which results get produced, we should explain the results. where are the fulltext files, why are the missing? is it okay, that something is missing? what needs to be inside the CProject?

  • results.json
  • fulltext.html
  • fulltext.pdf
  • fulltext.xml
  • fulltext_html_urls.txt: urls wozu ein fulltext gefunden wurde

Feedback from 2015-12-10-lifesciences workshop

See repo and pad

General software functionality

  • Lots of people wanted to use with existing PDF collections - this could be a specific module/tutorial. Need @petermr input to ensure people know how to try but are aware results could be patchy.
  • Just a proposal: all the tools can be integrated into a tool sharing platform like 'Docker'. That way users would not have to run a virtual machine or install individual tools.
  • Not sure if already noted: Powering off VM (e.g. after lunch) or opening editor ('Geany'), will reset keyboard input to non-US

QS/Getpapers

  • Lots of Qs about APIs/scrapers and links are not obvious without reading through all the software tutorial text. Could have a 'related links' at the top of the repo as well as the bottom or 'Are you looking for?' links.
  • The VMs did not have scraper definitions so need to include these or give a clear instruction to download them with the following command: git clone https://github.com/ContentMine/journal-scrapers.git
  • Trainer note: Need to be clear that while they're welcome to try their own material, they should follow all of the tutorials in order to use the papers for norma/AMI tutorials OR need bundles of files as input and output from each step in case it fails.
  • Should flag in software tutorials that if they're using the vm they don't need to install the software and can use npm rather than sudo npm
  • People wanted to work out how many papers they'd found before downloading them: getpapers --noexecute --query 'dinosaurs' etc
  • If you get too many papers, try reducing the date range to a manageable number (see getpapers and EPMC query syntax). Our experience is ca 300 papers/min.
  • To kill getpapers - Ctrl-C (maybe twice)

Qs

  • What is the .json file exactly (results)?
  • Can we restrict the number of items to download?
  • Can we use multiple search term Boolean queries to narrow results at getpapers stage?

Norma

  • Some people found that when running getpapers WITHOUT quickscrape, norma works but running quickscrape between getpapers and norma caused issues.

    Qs

  • Where can we find a list for all the --transform arguments?

AMI

RegEx tutotial: Could you please give example results? I can run up to second last step and get regex and results folders, but all I opened manually are empty. How many results should there be for food?

Documentation

  • Some specific comments that formatting all commands as code would be useful (I don't know how much code is not formatted as such, it might only have been a few lines)
  • One participant commented 'Generally got a bit lost navigating between materials on GitHub pages'
  • Overall feedback (personal, subjective): I got lost often, because I do not have a conceptual map of which step takes what input and creates what output after very simple steps (getpapers, scrape) were done. Vaguely getting there, but pace too fast. I was also often uncertain which directory I need to be in and how to access output. maybe provide hard-copy of pipeline for steps in workshop and vague 'directory' map of where stuff ends up (if the user records steps/doesn't change locations).
  • One issue: If tutorial was carried out on own search (i.e. not on dinosaurs, you can no longer carry out later tutorials, e.g. I run norma on different topic, now I cannot follow ami2-species as I have no species in my text. Could you provide intermediary output folders, so that if you have to skip forward in tutorials, you can still follow (in future courses)? I have scholarly.html, but that doesn't contain species info.
  • Overall, instructions are not very clear to me. Some command line commands are in `picture-type format (e.g. in a tree example, others in plain text, they are sequential, so you cannot follow later stages if you get stuck anywhere (not so great).

explain the data source visually/interactively

use i. e. eupmc and show how our SW works compared to onsite search. Communicate how big the dataset itself is. explain exactly what they offer on the API under which terms and where to find more information about this. use visual explaination, so that people understand the connection to EUPMC.

ctree link is not working

The ctree link in "This is also one of the starting points for a ctree, the main datastructure of the ContentMine pipeline." is not working.

describe getpapers better

hat is getpapers exactly doing? describe it better. especially, what are the result (all the possibilities) and how is the relation with quickscrape.

norma fails if additional directories are in the project dir without a fulltext.xml file

Repeatable error. If a set of xml files is collected using getpapers and then further directories (or possibly files) are added to that project directory then normal will fail as follows:

norma -q test_eupmc_neylon/ -i fulltext.xml -o scholarly.html --transform nlm2html
[...]
154  [main] ERROR org.xmlcml.norma.NormaTransformer  - no transforms given/parsed

Starting over with a fresh getpapers -x into a new directory, followed by norma works as expected:

>> rm -rf test_eupmc_neylon/
>> getpapers -q 'AUTH:"Cameron Neylon"' -o neylon-xmls -x
[...]
>> norma -q neylon-xmls -i fulltext.xml -o scholarly.html --transform nlm2html
[...]
>> ls neylon-xmls/PMC3720848/
fulltext.xml  scholarly.html

explain output of getpapers better

what do we get from the API call? which files can/must be found?

explain the amount of the data source better:

  • how many papers are available?
  • is it the same as on the web search?
  • is it the full dataset from a publisher?

Screen shots for Bubbles

create a walk through for demonstrators (or self-learners) of the complete screen history of ursus maritimus in Bubbles

also add existing bubbles.mov to repo

Contentmine pipeline

I have a database with a list of PMID's. I want to mine the text in all openaccess articles in this list of PMIDs and get the most frequent used terms/keywords/subject.

I have tested getpapers and seen how powerful and efficient it is in getting papers. I have then moved on to quickscrape and tried downloading pdf's based on the url list in the _eupmc_fulltext_html_urls.tx_t that getpapers outputs.

Seeing that i can use -p command in a getpapers query to download pdf's, my question is why should i use quickscrape? Also, after watching this video from the 1.29 minute mark, Peter-Murray is able to skim through pdfs quite easily. How does he do that? I am using an Ubuntu 14.04 Lts box how can i skim through pdfs like that using Ubuntu? Still on the video, at the 2:23 minute mark, Peter-Murray writes what seems like Java code to filter the files for sequences and keyterms. Which tool is he using to do that? Is it part of the ContentMine API? I am not sure if what i have written above qualifies to be an issue but i am really keen to understand ContentMine and how best i can use it for my project.

Thanks

AM

norma installation java error

I get the following error message, when installing norma:
mvn clean install

[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.xml-cml:norma:jar:0.1-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for com.mycila.maven-license-plugin:maven-license-plugin is missing. @ line 92, column 12
[WARNING] 'build.plugins.plugin.version' for org.codehaus.mojo:cobertura-maven-plugin is missing. @ line 53, column 12
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building norma 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ norma ---
[INFO] Deleting /Users/hildegaa/Documents/sra_upload/map_papers/norma/target
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ norma ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 46 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ norma ---
[INFO] Changes detected - recompiling the module!
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Compiling 85 source files to /Users/hildegaa/Documents/sra_upload/map_papers/norma/target/classes
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/pubstyle/getpapers/GetPapers.java:[14,30] cannot find symbol
symbol: class CMDir
location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/input/html/HtmlCleaner.java:[8,30] cannot find symbol
symbol: class CMDir
location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[22,30] cannot find symbol
symbol: class CMDir
location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaTransformer.java:[28,30] cannot find symbol
symbol: class CMDir
location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[269,42] cannot find symbol
symbol: class CMDir
location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[315,17] cannot find symbol
symbol: class CMDir
location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[347,17] cannot find symbol
symbol: class CMDir
location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[354,57] cannot find symbol
symbol: class CMDir
location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[375,17] cannot find symbol
symbol: class CMDir
location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[445,16] cannot find symbol
symbol: class CMDir
location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/util/TransformerWrapper.java:[25,30] cannot find symbol
symbol: class CMDir
location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaTransformer.java:[82,17] cannot find symbol
symbol: class CMDir
location: class org.xmlcml.norma.NormaTransformer
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/util/TransformerWrapper.java:[171,38] cannot find symbol
symbol: class CMDir
location: class org.xmlcml.norma.util.TransformerWrapper
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/input/pdf/PDF2XHTMLConverter.java:[14,30] cannot find symbol
symbol: class CMDir
location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/input/pdf/PDF2XHTMLConverter.java:[36,17] cannot find symbol
symbol: class CMDir
location: class org.xmlcml.norma.input.pdf.PDF2XHTMLConverter
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/input/pdf/PDF2XHTMLConverter.java:[42,35] cannot find symbol
symbol: class CMDir
location: class org.xmlcml.norma.input.pdf.PDF2XHTMLConverter
[INFO] 16 errors
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.497 s
[INFO] Finished at: 2016-04-08T11:46:06+01:00
[INFO] Final Memory: 23M/316M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project norma: Compilation failure: Compilation failure:
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/pubstyle/getpapers/GetPapers.java:[14,30] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/input/html/HtmlCleaner.java:[8,30] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[22,30] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaTransformer.java:[28,30] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[269,42] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[315,17] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[347,17] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[354,57] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[375,17] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaArgProcessor.java:[445,16] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: class org.xmlcml.norma.NormaArgProcessor
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/util/TransformerWrapper.java:[25,30] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/NormaTransformer.java:[82,17] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: class org.xmlcml.norma.NormaTransformer
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/util/TransformerWrapper.java:[171,38] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: class org.xmlcml.norma.util.TransformerWrapper
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/input/pdf/PDF2XHTMLConverter.java:[14,30] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: package org.xmlcml.cmine.files
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/input/pdf/PDF2XHTMLConverter.java:[36,17] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: class org.xmlcml.norma.input.pdf.PDF2XHTMLConverter
[ERROR] /Users/hildegaa/Documents/sra_upload/map_papers/norma/src/main/java/org/xmlcml/norma/input/pdf/PDF2XHTMLConverter.java:[42,35] cannot find symbol
[ERROR] symbol: class CMDir
[ERROR] location: class org.xmlcml.norma.input.pdf.PDF2XHTMLConverter
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Can't open fulltext.pdf files.

I can't open any fulltext.pdf file. Created with
getpapers -q "The title" -o "my_folder" -p
it's stored at this path but opening this returns error loading pdf file.

improve navigation/usage

from #52

  • Some specific comments that formatting all commands as code would be useful (I don't know how much code is not formatted as such, it might only have been a few lines)
  • One participant commented 'Generally got a bit lost navigating between materials on GitHub pages'
  • Overall feedback (personal, subjective): I got lost often, because I do not have a conceptual map of which step takes what input and creates what output after very simple steps (getpapers, scrape) were done. Vaguely getting there, but pace too fast. I was also often uncertain which directory I need to be in and how to access output. maybe provide hard-copy of pipeline for steps in workshop and vague 'directory' map of where stuff ends up (if the user records steps/doesn't change locations).
  • One issue: If tutorial was carried out on own search (i.e. not on dinosaurs, you can no longer carry out later tutorials, e.g. I run norma on different topic, now I cannot follow ami2-species as I have no species in my text. Could you provide intermediary output folders, so that if you have to skip forward in tutorials, you can still follow (in future courses)? I have scholarly.html, but that doesn't contain species info.
  • Overall, instructions are not very clear to me. Some command line commands are in `picture-type format (e.g. in a tree example, others in plain text, they are sequential, so you cannot follow later stages if you get stuck anywhere (not so great).

Alpha documentation for Canary

Create a schematic , with links, of the currently working components of canary for (a) tutorial presenters (b) self-learners
Also annotate any Canary training module with indication that this is alpha

Tutorials as slidedecks / PDFs

I'm going to demo CM tools in York later this month:
https://jonxhill.wordpress.com/2016/11/15/tools-and-methods-for-constructing-the-tree-of-life/

Are there any slidedecks specifically on practical usage of CM tools? (aside from legal issues, presenting the concept of CM et cetera)

The github resources are fantastic but they aren't formatted for slide-presentation. Before I go and make a slide deck myself, I was just wondering if there was any up-to-date slide-based material?

I will of course share back here any slides I make for York :)

set up shell tutorial as story

show a typical use case in which you learn step by step the necessary functions for the shell. i.e. the one we used for mozfest.

better description of quickscrape and getpapers

describe better what quickscrape does in the introduction section.

we need a better entry point for getpapers and quickscrape

  • quickscrape is for html based publishing plattforms like plos, elife, etc
  • getpapers is for api based publication retrieval
  • a diagramm in the README.md file should help to better understand this. should also show the processes before and after the usage of the software packages
  • use use-cases which clearly seperates from each other: EUPMC for getpapers and some html which is not stored on EUPMC for quickscrape

Cproject Structure Query

Hi,

This document https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/cproject seems to claim to be definitive about Cproject structure, but seems to be at odds with this document about the output of ami https://github.com/ContentMine/workshop-resources/blob/master/software-tutorials/ami/README.md#ami2-species. In the CProject definition the extent of say a sequence results directory looks to be much simpler than the apparent results described in the tutorial.

CProject folder structure:

│   ├── results
│   │   ├── sequence
│   │   │   └── dnaprimer
│   │   │       └── empty.xml

ami output tutorial

│   ├── results
│   │   ├── sequence
│   │   │   └── rna
│   │   │       └── empty.xml
│   │   │   └── dna
│   │   │       └── empty.xml
│   │   │   └── prot
│   │   │       └── empty.xml

Im trying to write a parser for CProjects, could you let me know whether the ami tools are going to produce lots of directories (e.g ami2seq will generate sequence/sequencetype folders or, as the CProject document suggests, will it generate just the sequence/dnaprimer folder? Or is the info in one of these docs out of date?

Thanks for clarification.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.