Giter VIP home page Giter VIP logo

rnaseq's Introduction

nf-core/rnaseq nf-core/rnaseq

GitHub Actions CI Status GitHub Actions Linting StatusAWS CICite with Zenodo

Nextflow run with conda run with docker run with singularity Launch on Nextflow Tower

Get help on SlackFollow on TwitterFollow on MastodonWatch on YouTube

Introduction

nf-core/rnaseq is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. It takes a samplesheet and FASTQ files as input, performs quality control (QC), trimming and (pseudo-)alignment, and produces a gene expression matrix and extensive QC report.

nf-core/rnaseq metro map

  1. Merge re-sequenced FastQ files (cat)
  2. Sub-sample FastQ files and auto-infer strandedness (fq, Salmon)
  3. Read QC (FastQC)
  4. UMI extraction (UMI-tools)
  5. Adapter and quality trimming (Trim Galore!)
  6. Removal of genome contaminants (BBSplit)
  7. Removal of ribosomal RNA (SortMeRNA)
  8. Choice of multiple alignment and quantification routes:
    1. STAR -> Salmon
    2. STAR -> RSEM
    3. HiSAT2 -> NO QUANTIFICATION
  9. Sort and index alignments (SAMtools)
  10. UMI-based deduplication (UMI-tools)
  11. Duplicate read marking (picard MarkDuplicates)
  12. Transcript assembly and quantification (StringTie)
  13. Create bigWig coverage files (BEDTools, bedGraphToBigWig)
  14. Extensive quality control:
    1. RSeQC
    2. Qualimap
    3. dupRadar
    4. Preseq
    5. DESeq2
  15. Pseudoalignment and quantification (Salmon or 'Kallisto'; optional)
  16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks (MultiQC, R)

Note The SRA download functionality has been removed from the pipeline (>=3.2) and ported to an independent workflow called nf-core/fetchngs. You can provide --nf_core_pipeline rnaseq when running nf-core/fetchngs to download and auto-create a samplesheet containing publicly available samples that can be accepted directly as input by this pipeline.

Warning Quantification isn't performed if using --aligner hisat2 due to the lack of an appropriate option to calculate accurate expression estimates from HISAT2 derived genomic alignments. However, you can use this route if you have a preference for the alignment, QC and other types of downstream analysis compatible with the output of HISAT2.

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto

Each row represents a fastq file (single-end) or a pair of fastq files (paired end). Rows with the same sample identifier are considered technical replicates and merged automatically. The strandedness refers to the library preparation and will be automatically inferred if set to auto.

Warning: Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Now, you can run the pipeline using:

nextflow run nf-core/rnaseq \
    --input samplesheet.csv \
    --outdir <OUTDIR> \
    --genome GRCh37 \
    -profile <docker/singularity/.../institute>

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

This pipeline quantifies RNA-sequenced reads relative to genes/transcripts in the genome and normalizes the resulting data. It does not compare the samples statistically in order to assign significance in the form of FDR or P-values. For downstream analyses, the output files from this pipeline can be analysed directly in statistical environments like R, Julia or via the nf-core/differentialabundance pipeline.

Online videos

A short talk about the history, current status and functionality on offer in this pipeline was given by Harshil Patel (@drpatelh) on 8th February 2022 as part of the nf-core/bytesize series.

You can find numerous talks on the nf-core events page from various topics including writing pipelines/modules in Nextflow DSL2, using nf-core tooling, running nf-core pipelines as well as more generic content like contributing to Github. Please check them out!

Credits

These scripts were originally written for use at the National Genomics Infrastructure, part of SciLifeLab in Stockholm, Sweden, by Phil Ewels (@ewels) and Rickard Hammarén (@Hammarn).

The pipeline was re-written in Nextflow DSL2 and is primarily maintained by Harshil Patel (@drpatelh) from Seqera Labs, Spain.

The pipeline workflow diagram was initially designed by Sarah Guinchard (@G-Sarah) and James Fellows Yates (@jfy133), further modifications where made by Harshil Patel (@drpatelh) and Maxime Garcia (@maxulysse).

Many thanks to other who have helped out along the way too, including (but not limited to):

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #rnaseq channel (you can join with this invite).

Citations

If you use nf-core/rnaseq for your analysis, please cite it using the following doi: 10.5281/zenodo.1400710

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

rnaseq's People

Contributors

adamrtalbot avatar apeltzer avatar c-mertes avatar chris-cheshire avatar d4straub avatar drpatelh avatar edmundmiller avatar ewels avatar friederikehanssen avatar g-sarah avatar galithil avatar gavin-kelly-1 avatar ggabernet avatar grst avatar hammarn avatar jfy133 avatar joseespinosa avatar lpantano avatar mahesh-panchal avatar marchoeppner avatar matthiaszepper avatar maxulysse avatar nf-core-bot avatar olgabot avatar orionzhou avatar pinin4fjords avatar pranathivemuri avatar rfenouil avatar robsyme avatar silviamorins avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rnaseq's Issues

com.upplication.s3fs.S3Path error when running with ignite

Hey,
I've tried to run the test profile without changing any of the parameters. The only thing I did change was the working directory to an aws S3 path. The command I ran was:

nextflow run nf-core/rnaseq -w s3://workflow-workdir/rnaseq2 -profile test -with-docker -process.executor ignite

This gave the following error com.upplication.s3fs.S3Path as shown below:

screen shot 2018-08-22 at 11 15 05

Nextflow.log file can be found here: nextflow.log

However I have noticed that i do not get the error with an older version of the repo: 2cf2f0f779d39463b09d874449cbb9fdb9f58d2f

So I think some change between that version of the repo and present has caused an error when running rnaseq from an S3 working directory

Any help to fix this error would be much appreciated, thanks in advance

Running pipeline on local workstation & with profile docker

Hi,
This morning I was trying to run the pipeline on my local workstation, 16GB ram and Docker. From my understanding, and according to the docs, I have to provide the --profile docker parameter on the command line. In doing so, I get this error:

WARN: Access to undefined parameter genomes -- Initialise it to a default value eg. params.genomes = some_value
ERROR ~ Cannot get property 'GRCh37' on null object

-- Check script 'main.nf' at line: 113 or see '.nextflow.log' file for more details

after checking the config files I saw that when using the docker profile, all the genome references in the igenomes.config are not loaded, so the error above.
Of course, I run nextflow with the --genome GRCh37 as command line parameter.

Hope you can help
thx

FastQC missing fonts

Apparently, FastQC is still somewhat broken the way we use it in 1.5dev right now. Maybe we need to get the "workaround" back to get this running again...

command.err.txt

Too many input files for MultiQC

I ran the RNA-seq pipeline on 360 samples, and the slurm submission of multiQC failed with Pathname of a file, directory or other parameter too long

ERROR ~ Error executing process > 'multiqc'
Caused by:
 Failed to submit process to grid scheduler for execution
Command executed:
 sbatch .command.run
Command exit status:
 1
Command output:
sbatch: error: Batch job submission failed: Pathname of a file, directory or other parameter too long

The files .command.stub and .command.sh look normal, but .command.run is 11Mb, with many commands for lnetc. So it might be something related to this bug: https://bugs.schedmd.com/show_bug.cgi?id=2198

Plot dupRadar boxplots

From @ewels on March 8, 2017 9:0

The current MultiQC plot for dupRadar is a little tricky to interpret.

Instead of plotting the dotted line from this plot as is done currently:
dupradar_scatters

Plot the median values from this boxplot:
dupradar_boxplots

Needs these values extracting in the R script. Can then plot with the custom_content module if the output syntax is done properly.

Copied from original issue: SciLifeLab#81

ERROR ~ Cannot get property 'Galgal4' on null object

Hi @ewels , @pditommaso , @marchoeppner I am trying to use the nf-core/rnaseq pipeline to analyze some RNAseq data. I am trying to run it on a docker using the following command,

nextflow run nf-core/rnaseq --reads '*_{1,2}.fastq.gz' --genome Galgal4 -profile docker

and am getting the following:

N E X T F L O W  ~  version 0.31.1
Launching `nf-core/rnaseq` [desperate_shirley] - revision: 44f1525d7a [master]
WARN: Access to undefined parameter `genomes` -- Initialise it to a default value eg. `params.genomes = some_value`
ERROR ~ Cannot get property 'Galgal4' on null object

 -- Check script 'main.nf' at line: 113 or see '.nextflow.log' file for more details

I have looked at the log but unfortunately I can not tell what I am looking for.
How do I Initialise it to a default value eg. params.genomes = some_value as suggested in the message above?
Thanks

question about time

Here again with another "issue" while running the pipeline. Two days ago we put the pipeline to run and it is still running. Our workstation has 64GB RAM and 16cpus. reads files are around 10MB in size each and java memory related variable is set as NXF_OPTS='-Xms4g -Xmx16g'. It looks like the pipeline is generating files from time to time, but still, there should be something wrong in the settings because it's taking so long. If you need to see any specific file to figure out what could be wrong, please let me know. Thanks

Option to skip certain QC steps

Not all QC steps are always useful for all users. It would be good to have an option to skip some if they’re not needed, to speed up execution.

Main offenders are dupRadar (with MarkDups) and RSeQC/ gene body coverage.

Suggested by @vkaimal

Other gene names to merged_gene_counts

Hello NF-core team,

I am wondering if it would be possible to add a column on ENSEMBL_GENE_NAME to merged_gene_counts.txt file (output of featureCounts) in addition to already existing ENSEMBL_ID.

It could be quite useful for the downstream analysis, especially for bacterial genomes for whom no biomaRt translation package is available, so one has to use the specified GTF file again to get the gene names.

Thank you.

Conflict with user Python libraries

From @pareng on June 26, 2018 10:47

When I run the RNA-seq pipeline v1.4 (on the UPPMAX cluster rackham), the MultiQC step fails due to a version conflict with my local python libraries. The problem seems to be that my local libraries are used instead of the ones in the Singularity image.

As a temporary fix, I've moved the directory containing my local libraries (~/.local), so that Python does not find it. @marcelm, who helped me understand the problem, suggested that a better fix would be to set the environment variable PYTHONNOUSERSITE in the Singluarity image. If it is set to a true value, local libraries should be ignored.

I tried setting it in my shell (export PYTHONNOUSERSITE=x) prior to running nextflow, but that did not help. I guess my environment is overridden by the Singularity image?

Error message from MultiQC (file work/.../.command.err):

[INFO   ]         multiqc : This is MultiQC v1.5
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '.'
[INFO   ]         multiqc : Only using modules custom_content, picard, preseq, rseqc, featureCounts, hisat2, star, cutadapt, fastqc
Traceback (most recent call last):
  File "/opt/conda/envs/ngi-rnaseq/bin/multiqc", line 767, in <module>
    multiqc()
  File "/opt/conda/envs/ngi-rnaseq/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/ngi-rnaseq/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/ngi-rnaseq/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/ngi-rnaseq/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/envs/ngi-rnaseq/bin/multiqc", line 413, in multiqc
    template_mod = config.avail_templates[config.template].load()
  File "/opt/conda/envs/ngi-rnaseq/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2323, in load
    self.require(*args, **kwargs)
  File "/opt/conda/envs/ngi-rnaseq/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2346, in require
    items = working_set.resolve(reqs, env, installer, extras=self.extras)
  File "/opt/conda/envs/ngi-rnaseq/lib/python2.7/site-packages/pkg_resources/__init__.py", line 783, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.VersionConflict: (Jinja2 2.7 (/home/pareng/.local/lib/python2.7/site-packages), Requirement.parse('jinja2>=2.9'))

Copied from original issue: SciLifeLab#237

dupRadar for large files

After rerunning several large RNAseq runs (15GB markDuped BAM files) with dupRadar, I found out that these are probably simply taking too long for the default of 2 hours which is specified in base.config. After automated rescheduling, its still running over 4 hours and thus ultimately crashes. Manually giving the pipeline > 4 hours time for each step solves the issue and the results look good.

I don't see this as a severe bug, but would like to pinpoint the issue and also bring up, that we might keep track of the enhancement in nextflow here: nextflow-io/nextflow#731 that ultimately aims to have a possibility to check how big a file is and modify resource consumption based on file size for example in the future. That should provide the possibility to make such "extreme" cases easier to handle.

Show good / bad examples in documentation

From @Hammarn on March 22, 2017 8:19

It would be useful to supply examples of a good/expected output for all our supported programs. Or we should at least specify wether the supplied results are of a good or bad experiment. FeatureCount is currently shows a library with rather low annotation amounts.

Copied from original issue: SciLifeLab#93

STAR only runs for one sample - channel.collect() not working?

Hello,

I wonder how to consider replicates, since from the alignment process (STAR in my case) only one file of the channel trimmed_reads are taken in account. Just after I put a trace of the workflow. In my directory data I have 2 paired-ends data that I want to consider as replicates. Everything is fine with FastQC and Trim Galore, but the process STAR is running only 1 time.

N E X T F L O W  ~  version 0.29.1
Launching `main.nf` [romantic_poincare] - revision: 02d1b5b022
===================================
 nfcore/rnaseq  ~  version 1.5dev
===================================
Run Name       : romantic_poincare
Reads          : data/*_{1,2}.fastq
Data Type      : Paired-End
Genome         : BDGP6
Strandedness   : None
Trim R1        : 0
Trim R2        : 0
Trim 3' R1     : 0
Trim 3' R2     : 0
Aligner        : STAR
STAR Index     : /home/aubin/Data/projects/Nextflow/nf_core/iGenomes//Drosophila_melanogaster/Ensembl/BDGP6/Sequence/STARIndex/
GTF Annotation : /home/aubin/Data/projects/Nextflow/nf_core/iGenomes//Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.gtf
BED Annotation : /home/aubin/Data/projects/Nextflow/nf_core/iGenomes//Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.bed
Save Reference : No
Save Trimmed   : No
Save Intermeds : No
Max Memory     : 12 GB
Max CPUs       : 4
Max Time       : 10d
Output dir     : ./results
Working dir    : /media/aubin/d8a2a24e-1474-4c5d-9cea-2e11f711df34/disk2/projects/Nextflow/nf_core/rnaseq/work
Container      : [:]
Current home   : /home/aubin
Current user   : aubin
Current path   : /home/aubin/Data/projects/Nextflow/nf_core/rnaseq
R libraries    : false
Script dir     : /media/aubin/d8a2a24e-1474-4c5d-9cea-2e11f711df34/disk2/projects/Nextflow/nf_core/rnaseq
=========================================
[warm up] executor > local
[bc/18b49b] Cached process > fastqc (SRR4305653)
[28/76fbd2] Cached process > workflow_summary_mqc
[8d/81917e] Cached process > fastqc (SRR4305654)
[88/e6409e] Cached process > trim_galore (SRR4305654)
[c1/6dd656] Cached process > get_software_versions
[02/2baaa5] Cached process > trim_galore (SRR4305653)
[b1/064d4d] Cached process > star (SRR4305654_1)
          Passed alignment > star (SRR4305654_1)   >> 92.46% <<
[94/66f6ff] Cached process > preseq (SRR4305654_1AlignedByCoord.out)
[37/e47ffa] Cached process > stringtieFPKM (SRR4305654_1AlignedByCoord.out)
[52/fbe046] Cached process > genebody_coverage (SRR4305654_1AlignedByCoord.out)
[ac/1710d4] Cached process > rseqc (SRR4305654_1AlignedByCoord.out)
[c7/579f83] Submitted process > markDuplicates (SRR4305654_1AlignedByCoord.out)
[9e/d3c67d] Submitted process > featureCounts (SRR4305654_1AlignedByCoord.out)
[16/89ad29] Submitted process > dupradar (SRR4305654_1Aligned.sortedByCoord.out.markDups)
[3b/84cfc6] Submitted process > merge_featureCounts (SRR4305654_1AlignedByCoord.out_gene.featureCounts)
[8c/736173] Submitted process > multiqc (SRR4305654_1)
[1f/1f82bc] Submitted process > output_documentation (SRR4305654_1)
[nfcore/rnaseq] Pipeline Complete

I wonder if it's because of the use of channel for index (there is only 1 index for several calls)
Thanks in advance

aubin

Locale error in step Picard MarkDuplicates

Hi there!

I’m running version 1.1 of the pipeline using the Singularity’s image provided in here. When running Picard MarkDuplicates I’m getting the following error:

/opt/conda/envs/nf-core-rnaseq-1.1/bin/picard: line 5: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory

Is anybody else having the same issue? Is it something that should be fixed in the image, right? I tried running something like apt-get install locales but don’t image is read-only (don’t have much experience in Singularity, should I be able to bypass the read-only limitation?).

Thank you very much in advance for any help.

Cheers,
Santiago

c3se problems

Hello,

I noticed that the 1.0 release is out, but I cannot download the singularity container for this version

I tired this command:

singularity pull --name nfcore-rnaseq-1.0.img docker://nfcore/rnaseq:1.0

But the I got this

WARNING: pull for Docker Hub is not guaranteed to produce the
WARNING: same image on repeated pull. Use Singularity Registry
WARNING: (shub://) to pull exactly equivalent images.
Docker image path: index.docker.io/nfcore/rnaseq:1.0
ERROR MANIFEST_UNKNOWN: manifest unknown
Cleaning up...
ERROR: pulling container failed!

running the pipeline without pulling the container first doesnt work for c3se

Caused by:
  Failed to pull singularity image
  command: singularity pull --name nfcore-rnaseq-1.0.img docker://nfcore/rnaseq:1.0 > /dev/null
  status : 1
  message:
    WARNING: pull for Docker Hub is not guaranteed to produce the
   WARNING: same image on repeated pull. Use Singularity Registry
    WARNING: (shub://) to pull exactly equivalent images.
 ERROR MANIFEST_UNKNOWN: manifest unknown
    ERROR: pulling container failed!

I ran the pipeline like this:
sbatch rna_seq_script.sh
rna_seq_script.sh

Best,
John

Picard cpu usage

Picard/java may gobble up more CPUs than allotted. From the picard FAQ (why does picard use so many threads? http://broadinstitute.github.io/picard/faq.html), it says that java GC is to blame, and furthermore that picard uses -XX:ParallelGCThreads=<num of threads> to limit the number of threads. This parameter can be passed to picard in the Nextflow script sections as -XX:ParallelGCThreads=${task.cpus}.

This may actually not be the definitive solution, see e.g. broadinstitute/picard#488

My case was a NF process under LSF, allotted 4 CPUS. It was automatically suspended because it used 11 CPUS. This process used markDuplicates. A few picard modes accept a NUM_PROCESSORS option, but not so markDuplicates.

STAR files are re-downloaded if the step is not successful

Hi

I have been playing with your pipeline and it seems that if STAR fails (not enough RAM) then when it tries to do the step again it re-downloads the STAR files and thus slowly filling up the disk as the tmp dir from the failed tries are not deleted.

I assume that the tmp dir is kept so one can reproduce/check the error, but surely it would be a better implementation if the core reference files are only downloaded once into a central location. This would also make it possible to reuse them when analysing additional samples and save some bandwidth.

Regards

Kim

MultiQC edgeR section broken

The relatively recent fix to make MultiQC not hang if the edgeR process doesn't run, using toList() instead of collect() is now broken:

rnaseq/main.nf

Line 1039 in be0ba76

file ('sample_correlation_results/*') from sample_correlation_results.toList() // toList() as process optional

Instead of linking in the files, this creates a file called input.xxx containing a text string with the file paths.

Need to confirm that any fix doesn't make the pipeline hang when this process does not run (eg with 2 or fewer samples).

Save BAM index file

From @ewels on June 23, 2017 9:4

We create a BAM index file during the RSeQC step, but don't save it to results. It would probably be better to move this index generation step to an earlier process (or a new process) and save it to results.

Copied from original issue: SciLifeLab#134

Make BigWig files and use RSeQC geneBody_coverage2.py

Dear Phil,

The gene body coverage module requires more memory for samples with larger size bam (>100G) nearly impossible to get it done if the server memory is <128G.
Possible for you to generate bigwig and run geneBody_coverage2.py (http://rseqc.sourceforge.net/) instead bam as input which is less memory hungry? Or can you suggest any other way to overcome this issue within the pipeline?

Unless the "genebody_coverage" complete the pipeline not generating the MultiQC.

Thanks
Justin

Made env module MultiQC load only if not already found in PATH

From @ewels on July 14, 2016 13:42

We have MultiQC installed in our conda environment with specific config options, plus the MultiQC_NGI plugin and other special stuff.

Running module load MultiQC at the top of the process will bring the system version of MultiQC into the PATH and take precedence over our conda version.

This change checks for multiqc on the PATH within the process and tries to load it with environment modules if it's not found.

Copied from original issue: SciLifeLab/pull/23

Proper way of running with own profile

Hi there!

I was wondering if you could tell me / add in the documentation what would be the appropriate way of running the pipeline using a custom profile like the ones in the conf folder (e.g. hebbe, ccga, etc.).

I've tried creating a configuration file (let's call it centre.config) looking like this:

singularity {
  enabled = true
  autoMounts = true
}

process {
  executor = 'sge'
  queue = 'short'
  penv = 'shmem'
  clusterOptions = { "-P $params.project ${params.clusterOptions ?: ''}" }

  // Process-specific resource requirements
  $trim_galore.cpus = { check_max (5, 'cpus') }
}

params {
  igenomes_base = '/path/to/my/local/igenomes'
}

And then I ran the pipeline like this:

nextflow run nf-core/rnaseq -r 1.1 \
  -c centre.config \
  -with-singularity /path/to/image/nfcore-rnaseq-1.1.img \
  [...regular options...]
  1. Running it this way, it is failing to properly overwrite the igenomes_base variable, because the standard profile is being loaded first, which means base.config and igenomes.config are being loaded before redefining igenomes_base. The workaround I found for this was to specify igenomes_base directly as a parameter when launching the pipeline:
nextflow run nf-core/rnaseq -r 1.1 \
  -c centre.config \
  --igenomes_base /path/to/my/local/igenomes \      # <--- HERE
  -with-singularity /path/to/image/nfcore-rnaseq-1.1.img \
  [...regular options...]
  1. Because the file centre.config is being loaded after the standard config, the function check_max can't be found so it has to be re-defined in the new config file.

  2. I've alternatively tried modifying the original nextflow.config file by adding my own profile in the following way:

  centre {
    includeConfig 'conf/base.config'
    includeConfig 'conf/centre.config'
    includeConfig 'conf/igenomes.config'
  }

and then copying the centre.config profile file to the conf folder so I could now launch the pipeline as nextflow run nf-core/rnaseq -r 1.1 -profile centre .... However, because I tweaked the original code now it's (correctly) complaining that version 1.1 of the pipeline can not be run because the code has changed.

So, how should I properly set up my profile, then? Any tips or recommendations?

Thank you very much in advance.

Cheers,
Santiago

Issue with markDuplicates: java.io.IOException: No space left on device

Hi all!

I am running nfcore/rnaseq on Hebbe Cluster using singularity to pull. The .simg was created to include all the relevant dirs mounted using a recipe.

I have set NXF_OPTS='-Xms1g -Xmx6g' in my bash profile and I have also changed the input parameters for markduplicates to $markDuplicates.memory = 6.GB . This is because only 2 out of 20 cores are allocated by the pipeline and RAM is proportional, and the default 3GB was insufficient - the pipeline was crashing.

Even though I have set my $TMPDIR directory to have a lot of available space, I still get the error from markduplicates. Whatever this dir is it always says no space left on device from the java.io.IOException.

Any idea how I can get this pipeline to finish even without markDuplicates? I can provide all logs and .command files from the workdir.

RSeQC terminology

Issue by @rsuchecki, moved from https://github.com/ewels/nf-core-RNAseq/issues/19


Great work, I might be extending or at least re-purposing some of it soon. Just a comment about the RSeQC results in output.md: inner distance should not be confused with insert size. See here for disambiguation by @tseemann, parly quoted below. One might of course argue it is a local variable which can be called as one pleases ;-)

Mind the gap (by @tseemann)

There is a lot of confusion about the gap of unknown bases. You will encounter terms like "insert size", "fragment size", "library size" and variations thereof. The term "insert" comes from a time before NGS existed, when cloning DNA in E.coli vectors was standard business.

PE reads      R1--------->                    <---------R2
fragment     ~~~========================================~~~
insert          ========================================
inner mate                ....................

The main confusion is with "insert size". The name itself suggests it is the unknown gap because it is "inserted" between R1 and R2, but this is misleading. It is more accurate to think of the insert as the piece of DNA inserted between the adaptors which enable amplification and sequencing of that piece of DNA. So the "insert" actually encompasses R1 and R2 as well as the unknown gap between them. The name for the gap itself is better named "inner mate distance" because it is self-descriptive and can vary depending on what read lengths you sequenced a DNA library with.

Look into adding RSEM

From @ewels on June 19, 2017 12:5

A few people have asked about using RSEM to generate gene counts instead of (in addition to?) StringTie. It would be interesting to look into adding this either as an addition or as an alternative to StringTie.

From e-mail thread:

I was wondering whether you plan to also include RSEM for transcript-level quantification in near future (Irizarry's group has shown that RSEM slightly outperforms other methods (Teng et al., Genome Biology, 2016)). StringTie was not included in the comparison though.

Copied from original issue: SciLifeLab#132

Option to discard reads with adapter (remains)?

Cutadapt can remove reads that are containing remains of adapters. Could we add an option for the pipeline to remove reads with subsets of adapters identified by cutadapt?

Happy to do it - just thought I'd ask before.

-profile standard needed for base config

Running the last version crashes in the get_software_versions function. There are also a couple of warnings.

nf-core/rnaseq : RNA-Seq Best Practice v1.5dev
=======================================================
WARN: Access to undefined parameter `max_time` -- Initialise it to a default value eg. `params.max_time = some_value`
WARN: Access to undefined parameter `maxMultiqcEmailFileSize` -- Initialise it to a default value eg. `params.maxMultiqcEmailFileSize = some_value`
Run Name       : ecstatic_tuckerman
Reads          : data/*{1,2}.fastq.gz
Data Type      : Paired-End
Genome         : false
Strandedness   : None
Trim R1        : 0
Trim R2        : 0
Trim 3' R1     : 0
Trim 3' R2     : 0
Aligner        : STAR
STAR Index     : /home/houtan/genome/hg38/
GTF Annotation : /home/houtan/genome/hg38/gencode.v28.primary_assembly.annotation.gtf
Save Reference : No
Save Trimmed   : No
Save Intermeds : No
Max Memory     : 30.GB
Max CPUs       : 1
Max Time       : null
Output dir     : data_results_hg38_gencodev28/
Working dir    : /home/houtan/ESCA/RNA-seq/PPARG/work
Container      : [:]
Current home   : /home/houtan
Current user   : root
Current path   : /home/houtan/ESCA/RNA-seq/PPARG
Script dir     : /home/houtan/my-pipelines/rnaseq
Config Profile : docker
E-mail Address : [email protected]
MultiQC maxsize: null
=========================================
[warm up] executor > local
[88/047082] Submitted process > fastqc (33-siPPARG-SA09470_S74_L008_R)
[39/296ecf] Submitted process > makeBED12 (gencode.v28.primary_assembly.annotation.gtf)
[be/69df32] Submitted process > fastqc (33-Negative-SA09464_S68_L008_R)
[25/a0191b] Submitted process > trim_galore (33-Negative-SA09464_S68_L008_R)
[f7/0072a3] Submitted process > trim_galore (33-siPPARG-SA09470_S74_L008_R)
[bc/bbf9df] Submitted process > get_software_versions
[4c/82642e] Cached process > workflow_summary_mqc
ERROR ~ Error executing process > 'get_software_versions'

Caused by:
  Process `get_software_versions` terminated with an error exit status (127)

Command executed:

  echo 1.5dev &> v_ngi_rnaseq.txt
  echo 0.30.1 &> v_nextflow.txt
  fastqc --version &> v_fastqc.txt
  cutadapt --version &> v_cutadapt.txt
  trim_galore --version &> v_trim_galore.txt
  STAR --version &> v_star.txt
  hisat2 --version &> v_hisat2.txt
  stringtie --version &> v_stringtie.txt
  preseq &> v_preseq.txt
  read_duplication.py --version &> v_rseqc.txt
  featureCounts -v &> v_featurecounts.txt
  picard MarkDuplicates --version &> v_markduplicates.txt  || true
  samtools --version &> v_samtools.txt
  multiqc --version &> v_multiqc.txt
  scrape_software_versions.py &> software_versions_mqc.yaml

Command exit status:
  127

Command output:
  (empty)

Option for generic extra CL options for commands

From @Hammarn on January 12, 2017 9:0

Just a thought I had;
It should be possible to send extra commands to certain processes in the pipeline without having to change the code. I.e supplying them on the command line. E.g. STAR suppling it with for instance clip3pNbases could sometimes be useful (i.e the Clontech SMARTer PICO prep).
I guess we would have to set up a params for each process for supplying extra CL options.

Copied from original issue: SciLifeLab#74

Add warning to the top of MultiQC reports if samples are skipped due to low alignment

From @ewels on June 29, 2017 10:4

Although a warning is given in the Nextflow log when a sample is skipped, it's not so obvious after the run is finished and forgotten.

It would be good to add a section to the top of the MultiQC report highlighting any skipped samples. We should be able to do this using a custom content file and kind of trick MultiQC.

For example, call it alerts_mqc.yaml and have something like this (untested):

id: ngi-rnaseq-alerts
section_name: NGI-RNAseq Warnings
description: |
  <div class="alert alert-danger">
    <p><strong>Warning!</strong> 14 samples were halted after alignment due to very low alignment rates.</p>
    <ul>
      <li>Sample 1 (2.3% aligned)</li>
      <li>Sample 2 (0.4% aligned)</li>
    </ul>
  </div>

I'm not quite sure what happens if you don't supply any plot data to Custom Content, so if this doesn't work right away then I'll take a look into the inner-workings of MultiQC.

Copied from original issue: SciLifeLab#139

problem with singularity

Hello,

I try this workflow on local machine using the singularity image.

Step 1

I first pull the image

singularity pull --name nfcore-rnaseq.img docker://nfcore/rnaseq

Minor issue
I notice on the documentation is

singularity pull --name nfcore-rnaseq-1.4.img docker://nfcore/rnaseq:1.4

But the tag 1.4 do not exist in the repository

Step 2

I run the workflow

 nextflow run nf-core/rnaseq --with-singularity nfcore-rnaseq.img --genome GRCm38 --reads 'data/*_R{1,2}*'

Nextflow v0.29.1
nfcore/rnaseq v1.5dev

Major issue

Command error:
  .command.sh: line 2: trim_galore: command not found

Seams like trimgalore is not found... is there an issue with the container ?

Document igenomes_base

The reference genome docs don't mention igenomes_base and could do with a bit of a refresh.

Option to disable QC steps

The QC steps can take a substantial amount of time and be problematic with very large runs (eg. single cell projects).

  1. We can probably remove Preseq. It doesn't really add anything beyond DupRadar anyway
  2. Add a new flag --no-qc to disable all quality control steps.

Consider other cluster definitions for task resource allocation

Hi there!

In the centre I'm working at, we use "SGE" as a job scheduler. The way slots are reserved for the jobs is core based. Given the following scenario:

Node
Total memory : 32 Gb
Total cores : 8
Average memory per core : 4 Gb

if I have a job that uses 1 core and 16 Gb of RAM, then I have to ask for 4 cores to be able to run it properly.

My question then is: would it be possible to update the code somehow so memory/cpus validations would be automatically adjusted based on this? This way, we wouldn't be having to re-define the process requirements (at least for memory/cpus).

I was thinking maybe on adding a memory_per_core param and tweaking the check_max function to consider this if defined?

Let me know your thoughts or if you have any other idea to sort this out.

Thank you very much in advance!

Cheers,
Santiago

featureCounts: Add samtools idxstats process

Hi,

as already discussed in MultiQC/MultiQC#842, it would be nice to integrate a new plot to the featureCounts section of the MultiQC report, which would look similar to the already existing one.

I attached an exemplary stacked bar plot, where the x-axis shows the different samples, the y-axis represents the counts and the different colors depict the existing chromosomes.

In order for this to work, a custom Python script would need to be added to the pipeline, which does the counting and prints it to a file that MultiQC can understand.

Best,
Marie

citation?

Dear all,

a colleague here at the medical school has used the pipeline successfully and now needs a citation. Should we cite this github repo URL or have there been efforts made to put it on zenodo, where (I believe) you can get a citable DOI for software/data.

Thanks,
Colin

Lane Merging option?

We have quite often some cases where there are more than just one FastQ file per condition, e.g. if samples have been sequenced on more than one lane:

blabla_L001_R1.fastq.gz
blabla_L001_R1.fastq.gz
blabla_L002_R1.fastq.gz
blabla_L002_R1.fastq.gz

in this case single ended.
I thought about a possibility to treat these as a single sample based on the extension and having an option in RNAseq that can be used to specify a lane pattern for example? Would this be of general interest?

Cheers,
Alex

Improve genome not found error message ?

Hi,

messing around with the new version, I had to change very little config. Things are improving fast, cheers.

One comment I have is on a better error message when incorrectly entering the input genome string. Instead of Cannot get property 'star' on null object it might be more informative to output genome string not found at (somepath).
Here I put in hg19 instead of GRCh37.

Cheers,
Colin

$ nextflow run /mnt/ngsnfs/tools/nf-core/rnaseq-master/ --singleEnd --profile docker --genome 'hg19' --reads '*.fastq.gz'   --igenomes_base '/mnt/ngsnfs/igenomes_2017'
..
N E X T F L O W  ~  version 0.29.1
Launching `/mnt/ngsnfs/tools/nf-core/rnaseq-master/main.nf` [nostalgic_aryabhata] - revision: 6549eba824
ERROR ~ Cannot get property 'star' on null object

 -- Check script 'main.nf' at line: 86 or see '.nextflow.log' file for more details

Second attempt:

$ nextflow run /mnt/ngsnfs/tools/nf-core/rnaseq-master/ --singleEnd --profile docker --genome 'GRCh37' --reads '*.fastq.gz'   --igenomes_base '/mnt/ngsnfs/igenomes_2017'
Picked up _JAVA_OPTIONS: -Dhttp.proxyHost=172.24.2.50 -Dhttp.proxyPort=8080 -Dhttps.proxyHost=172.24.2.50 -Dhttps.proxyPort=8080
N E X T F L O W  ~  version 0.29.1

.. -> works fine.

DupRadar broken

Tested v1.4 and it seems that DupRadar is broken:

~/D/t/80791ee1db8a958074d47827688b86> ls -l
total 1508544
-rw-r--r-- 1 alex users 788007786 Apr  9 16:00 AS-225023-LR-34148Aligned.sortedByCoord.out.markDups.bam
-rwxr-xr-x 1 alex users      5285 Apr  9 16:07 dupRadar.r*
-rw-r----- 1 alex users 756721066 Apr  9 16:00 Mus_musculus.GRCm38.90.gtf
alex@aragorn ~/D/t/80791ee1db8a958074d47827688b86> 
singularity shell ../singularity/scilifelab-ngi-rnaseq-1.4.img 
Singularity: Invoking an interactive shell within container...

Singularity scilifelab-ngi-rnaseq-1.4.img:~/Downloads/testme_locally/80791ee1db8a958074d47827688b86> sh .command.sh
Input bam      (Arg 1): AS-225023-LR-34148Aligned.sortedByCoord.out.markDups.bam
Input gtf      (Arg 2): Mus_musculus.GRCm38.90.gtf
Strandness     (Arg 3): unstranded
paired/single  (Arg 4): single
Nb threads     (Arg 5): 2
R package loc. (Arg 6): 2
Output basename       : AS-225023-LR-34148Aligned.sortedByCoord.out.markDups
Loading required package: dupRadar
Loading required package: parallel

 *** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
 1: .C("R_readSummary_wrapper", as.integer(n), as.character(cmd),     PACKAGE = "Rsubread")
 2: Rsubread::featureCounts(files = bam, annot.ext = gtf, isGTFAnnotationFile = TRUE,     GTF.featureType = "exon", GTF.attrType = "gene_id", nthreads = threads,     isPairedEnd = paired, strandSpecific = stranded, ignoreDup = dup,     countMultiMappingReads = mh, ...)
 3: count(mh = TRUE, dup = FALSE)
 4: eval(expr, pf)
 5: eval(expr, pf)
 6: withVisible(eval(expr, pf))
 7: evalVis(expr)
 8: capture.output(counts <- list(mhdup = count(mh = TRUE, dup = FALSE),     mhnodup = count(mh = TRUE, dup = TRUE), nomhdup = count(mh = FALSE,         dup = FALSE), nomhnodup = count(mh = FALSE, dup = TRUE)))
 9: analyzeDuprates(input_bam, annotation_gtf, stranded, paired_end,     threads)
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)
Singularity scilifelab-ngi-rnaseq-1.4.img:~/Downloads/testme_locally/80791ee1db8a958074d47827688b86>

If I change things in the .command.sh to use only 1 CPU core, things work and finish properly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.