nextflow-io / training Goto Github PK

View Code? Open in Web Editor NEW

100.0 100.0 99.0 587.17 MB

Nextflow training material

Home Page: https://training.nextflow.io/

License: Other

Nextflow 82.42% R 17.16% Groovy 0.08% Python 0.15% Shell 0.19%

learning nextflow pipelines teaching training tutorial workflows workshop

training's People

Contributors

Stargazers

Watchers

Forkers

nxl365 goja288 edgano drmonaco yuewenz chriswyatt1 glichtenstein howdus stefanusbernard princeasaregh abannie aks-ak ewels alihkz94 trentleslie amatresf atrigila laribritto mariguilardi pabloaledo iaradsouza1 mribeirodantas hana008 richelbilderbeek das2000sidd justicengom gleisonm zzygyx9119 manasealoo bluestars1 robsyme naturegate hmkim geehanta sateeshperi azmigueldario thanhleviet truongphi20 pabloquinteirobernal mynahub nat-coutoune trygvizl peter-j-freeman olintoyale hakimbazol dxu104 hpcbio smlatorreo tealeave ferranc96 ybdong919 murugesanraju theballerinathatcodes natasha-adongo alijaharcher yichao-cai b8307038 vishramt7 fiuzatayna saeedfc louislenezet friederikehanssen lescai-teaching dialvarezs lukeikg a16n csquared123 samuell magletdinov ismail-therap cdomenighetti thangnx1012 rbarghou munishchouhan lucas-florin harbinuur zvittorio xophmeister flowuenne azula05 rogenep53 mathieu-maessen zhouyiqi91 shajia-se piroonj aligogon samantoz paulorayner eveliacoss ybaeus vdauwera 51cat adamrtalbot christopher-hakkaart marchu0399 deribaabera1234

training's Issues

Basic review of section 1 to 3

Add some narrative, a couple of lines, then a table of contents
Replace Nextflow overview image with an image from Seqera Labs slides.
Make compute environment images smaller, see Seqera Slides
Describe each line of Your First Script (as in example1 on the Nextflow website).
Image: Switch to the top to bottom dag layout. Remove seqera logo stripes, replace them with the actual code for each task. Important to show that splitLetters is 1 process with 1 task and convertToUpper is 1 process with 2 tasks. Try in Figma.
For RNAseq, provide more background, 'I.e. there is a series of scripts, script1.nf to script7.nf, each building on the next.
Explain parameters: Parameters are inputs and options that can be changed when the pipeline is run.
TYPO "The second example" should say, "The second script"
TYPO In index process "process that creates a binary"
Change params.transcriptome to params.transcriptome_file, this way in the process there is no confusion.
TYPO "defines a *$*transcriptome variable
TYPO "-with-docker to launch each task of the execution as a Docker container run command"
In all scripts and docs, replace pair_id with sample_id

Make a nf training docker image

Not essential, but it would be best if we have all the programs loaded as a single image.

At the moment, we have all the code being run in the .gitpod.yml init and script sections.

I assume it will be faster and simpler to have this prebuilt in a container

Section on how to change nextflow versions

We need a section early on in the tutorial to explain this.

Plus, some basic screen shots to show where to find the newer versions and updates.

Create New gitpod repo on Seqera

Currently we only have working training account on chriswyatt1 (without private Seqera material).

We need to get this fully on seqeralabs and off chriswyatt1. Had some issues with permissions before

Create a one liner and link to docs for what and why for operators in MultiQC step.

Channel, processes and operators sections

The channel, processes and operators sections are pretty much copy and paste from the docs.

These sections could be way more interactive, building on the previous sections and be less dry than at present.

e.g. all channel types tested in the processes section, to see how they have been used.

Personally, I think I found this all tricky to grasp when I did the training. Having real world examples as exercises could help bring this section to life.

Adoc missing sections

Finding weird behaviour of docs,

if a sub doc does not have at least three levels in the page (= == and ===), then it will break, and not print the lines correctly in the next doc.
In channel.adoc, if you have text longer than 326 lines, just adding: "The code below creates a channel containing 24 samples from a chromatin dynamics study (from SRA) and runs FASTQC on the resulting files." to the line 326, breaks the next section (processes)

Setup documentation Gitpod

We need documentation for how we set this up. Hosting etc.

Also, need instructions in the docs for how users access the gitpod run of the training

Add dockerfile with conda explanation

See exercise in section 4.3.1

aws setup file needs updating with documentation

We need a new setup all.sh with correct nextflow version.

## NF version export NXF_VER=20.04.1-edge echo "export NXF_VER=20.04.1-edge" >> ~/.bashrc mkdir -p ~/bin

The above was what we used, which meant dsl2 and running from git didn't work.

We also need documentation on how to :

Update the web training contents page
Create users for AWS.
How to setup the environment.

RNAseq pipeline refinement

RNAseq pipeline:

Add val(pair_id) not isolated on its own (default) pair_id

$baseDir/results should be results

Use sampleId and not pairId for clarity

baseDir is deprecated, should be projectDir

AWS setup

We also need documentation on how to :

Update the web training contents page
Create users for AWS.
How to setup the environment.

dsl2 issues

Things to do:

Are we happy with the generic foo bar examples, or should they be replaced with something "real"
Can we at line 143 give an answer that prints to screen the contents of bar.txt, not just printing the path.
Workflow section should include exercises
Workflow section should have working code (not just examples foo bar).
We need a "Best Approach" of converting dsl1 to dsl2

singularity with gitpod issue

If we switch to gitpod, singularity may not be possible to demonstrate.

I tried to find other repos with singularity. https://community.gitpod.io/t/singularity/5486/5

Looks like we would have to use a virtual machine to get around mount permissions.

Create a GitPod environment for the tutorial

See Snakemake example

Feedback from seqera team

Creating this issue here to keep track of the feedback

Abhinav:

Include pair programming to allow people to use their knowledge practically
Work on an advanced course

Show basic errors to people, how to debug
How to develop a workflow and make it cloud ready ( + tower ready )
Transition a workflow from DSL1 to DSL2

Inclusion of more command line options

Except -c -C -ansi-log -with-tower -with-docker -resume etc.

We should point people to the docs. But also, mention other key cli options for run in the examples. e.g.
-bg
-with-report
-with-trace
...
Probably could add many more

Resumption issue for section-2.2

@evanfloden ,

I think it didn't resume because of the usage of stdout

process convertToUpper {

    input:
    file y from letters.flatten()

    output:
    stdout into result

    """
    cat $y | tr '[a-z]' '[A-Z]'
    """
}

I have ran into it a couple times earlier as well (with DSL2).

Create a section for parsing data

See examples:

https://github.com/seqeralabs/nf-training-public/tree/master/nf-training/parsing

Add a single line to explain what and why directives are from docs

Fix python script section in Process 1

Python version needs to add brackets

-ansi-log

Where is this documented, to get individual process information to screen, or is a typo.

Ensure scripts are in folder

Need to go through all docs and check that the code in the ascii docs is same as in the repo files

Explain $task.cpus in a simple way without explaining directive

Resource requirements such as CPU and memory can change with different workflow executions and platforms. Nextflow can use $task.cpus as a variable for the number of CPUs.

EBI training issues

Setup: Improve instructions for cloud9 or switch to gitpot. Make it clear that "local" setup is not used in the cloud9 tutorial.

Talk
"embarassingly parrelization" sounds wrong. -> Change on slide to "Embarrasingly parallelizable"
What is idempotent? Maybe too technical
Mention that nf core pipelines are active and community owned.

Hello world
Mention what we are trying to achieve in this section of tutorial (toy example using text,,, but normally we use file inputs).
Where does task.cpus come from, we don't explain it (people always get confused and ask Q at this point). Either remove it, or explain it.

RNA-Seq
May need to rename the index to salmon_index for that process[done]
Make sure directives doc has all the links to cpus memory etc.[done, added link to docs in directives section]

Day3 :
Value and queue channels should be under different headers.
Weird function within the compression example. remove, and make sure it works.[done]
We should mention that most times you point to a script in a separate folder
Use of item or element used interchangable, and people get confused.

Day4

Listen to recording of best dsl2 section run thru, and rewrite to fit
Azure not in configuration docs

Random points
Always questions about quote marks. How to address this? Write a help section on it.

Training improvements 22_03_2022

Gitpod:

Having nextflow install in init section of gitpod doesn't work. We should just package the lot into a container.

ask users to login to gitpod the day before. It may remove Github issues, if they reenter a gitpod env. Need to test this.

Save chat after each zoom, useful to pick up useful questions, to help build the training (and docs)

Create a an exercise to make set task.cpu to 2 cpus

Check all code examples in Channels, Processes and Opertators to work

Fix example for Script 6.1

Grep not finding file

Reopening Gitpod after time out- does not have the port exposed

When gitpod times out, and you click the reopen session option, the Tutorial page does not come up :

Need to find fix. Temporarily, we can ask users to reopen a new workspace,,, but then their progress will be lost

Check answers

Check answers to exercises:

https://github.com/seqeralabs/nf-training-public/blob/master/asciidocs/processes.adoc#L350
https://github.com/seqeralabs/nf-training-public/blob/master/asciidocs/processes.adoc#L476
https://github.com/seqeralabs/nf-training-public/blob/master/asciidocs/processes.adoc#L535

Gitpod loads old deleted untracked files into the IDE

For some reason, we can see all old deleted files int he IDE, they are listed as untracked, but we need to get rid of them

The entries with a green U, are untracked, but they should not be there at all, in this master branch

Reduce index length by removing headers (left hand panel)

Fix the section on fromSRA

Tower section

Currently the tower section follows the docs (https://github.com/seqeralabs/nf-training-public/blob/master/asciidocs/tower.adoc), with help showing how to run tower for the first time and basic functions etc.

We need to review what we have already here.

Maybe though, we should think about rewriting this section to follow how it is taught in the tutorial. With examples

Processes section working examples

Still many of the examples do not work, which is fine for hypothetical, but would be cooler if they actually process data too.
E.g. Blast run Section:6.4.

Add a dropdown to show the correct answer for all Exercises

Presentation Day 1 issues raised

Explain what the hello.nf script does first. What we want to produce from this script.
Should we have a public version of the training, not a full version. but mini public git one.
remove basedir from the outdir
Mention 'path' different to 'val' and to 'file' as soon as we have these in "hello.nf"
Good to have examples exploring the work directory to find the .command.sh
Explain that nextflow channel factory input are a specific call from nextflow.
Explain the rnaseq pipeline properly at the start. That we have rnaseq dataset for multiple tissues etc and a genome file we want to map to. I guess everybody at these courses has done RNAseq before.
Check path index or 'index'. People getting confused.
Really good recap, for multiqc. Really good to slow the pace and get everyone on board.
Using the return and printing out exactly what is in each channel, is super helpful.
Running nextflow on docker. Find a dsl1 or 2 version.

Add sections for DSL2

To complete the DSL2 part, we can create:

A new section after RNAseq called Modules or similar that explains INCLUDE and aliases
A new section called Workflows
An small extra with pipes, Channel forking,

For material see:
PDF attached
https://www.nextflow.io/docs/latest/dsl2.html
https://github.com/seqeralabs/nf-training-public/tree/master/nf-training/dsl2

Bonus email section on RNAseq section

Need to ensure this works for the tutorial outside of the AWS run.

Line 460- https://github.com/seqeralabs/nf-training-public/blob/Autumn_update/asciidocs/rnaseq_pipeline.adoc

Local and Gitpod variant setups

There should be working local and gitpod setups.

Also, the AWS cloud 9 setup is different what we ask in the tutorial (Evan to update)

Upload to aws

How to upload the tarball as a file to aws s3 bucket?

Check header rendering

E.g. exercise in containers.adoc

Add a SQL database query exercise

Answer for docker user rights

What can we put for the answer for this exercise:

=== Exercise

Use the option -u $(id -u):$(id -g) to allow Docker to create files with the right permission.

Add a section for Tower

Include:

how to login (concept of auth),
how to add a pipeline (nextflow_schema.json)
concepts of credentials
concepts of compute environment
concept of orgs and workspaces

Answer to exercise (custom functions)

Exercise: Write a custom function that given the compressor name as a parameter, returns the command string to be executed. Then use this function as the process script body.

See here: https://github.com/seqeralabs/nf-training-public/blob/master/asciidocs/processes.adoc#L196

Not sure how to do this.

FYI: The exercise changed from being tcoffee to just compressing files, so that it will work without need to download software.

Once, we have an answer here, we can put in the exercise dropdown answer

AWS sessions close and cannot be reopened

I noticed that if you close the session, then it won't let you reopen the environment, try opening the following:

https://eu-west-3.console.aws.amazon.com/cloud9/ | IAM User | 195996028523 | user7-22 | Secret123!

Maybe some of the participants have the same issue and don't want to tell us

Presentation Day 2 issues raised

Make a new directory for the container to stop all the files in folder being added to the container.
Explain the docker run options properly in the documentation.
Explain docker launch within the script showing how docker is working within a nextflow .command.run file.
Need version for SRA to be correct, do not need edge now.
Make sure all the script are viewed, so they actually do something.
Create many more exercise. Not to use all in the training, but useful for people learning in their own time.
Add -process.echo to scripts as needed, so we can see output.
When to use path : instead of file in the training doc.
Section 6.2.4 should have drop down answers for all the tests.
Section 6.3.3: Exercise should continue from previous one, rather than intriduce a new script.
Do we need directives and publish dir in processes section, when already covered in RNA-Seq exp.
Not sure .map is explained very well and the syntax, at start of operator section.
Show full syntax and shortened, else it gets confusing (maybe).
Show a full map function operation in both training and main nextflow docs. Real examples.
Make sure nextflow flag options go before run. e.g. -bg -C myriad.config