Giter VIP home page Giter VIP logo

Comments (6)

WackerO avatar WackerO commented on July 3, 2024

Hey Fernando, you should definitely be able to run the pipeline on rnaseq output, but unless you share some information about your command and the input files you used as well as the error message you received, I'm afraid I won't be able to help...

from differentialabundance.

RaverJay avatar RaverJay commented on July 3, 2024

I am also interested in this.
nf-core/rnaseq running star+salmon already produces transcript abundances as output, and the table has the columns when run against ensemble genome+annotation:
tx gene_id <sample1> <sample2> etc

with tx having the ensembl transcript ids.

Now if you try to plug in this table + the lengths into differentialabundance it fails on validation:

-[nf-core/differentialabundance] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:VALIDATOR (samplesheet.csv)'

Caused by:
  Process `NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:VALIDATOR (samplesheet.csv)` terminated with an error exit status (1)

Command executed:

  validate_fom_components.R \
      --sample_metadata "samplesheet.csv" \
      --feature_metadata 'Homo_sapiens.anno.tsv' \
      --assay_files "salmon.merged.transcript_counts.tsv" \
      --contrasts_file "contrasts.csv" \
      --output_directory "study" \
      --sample_id_col 'sample' --feature_id_col 'gene_id'

<SNIP>

  [1] "Reading sample sheet at samplesheet.csv with ID col sample"
  [1] "Reading feature metadata at Homo_sapiens.anno.tsv with ID col gene_id"
  [1] "Reading assay matrix salmon.merged.transcript_counts.tsv and validating against samples and features (if supplied)"
  Error in `.rowNamesDF<-`(x, value = value) : 
    duplicate 'row.names' are not allowed
  Calls: validate_inputs ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
  In addition: Warning message:
  non-unique values when setting 'row.names': ‘ENSG00000000003’, ‘ENSG00000000005’, ‘ENSG00000000419’, ‘ENSG00000000457’, ‘ENSG00000000460’, ‘ENSG00000000938’, ‘ENSG00000000971’, ‘ENSG00000001036’, ‘ENSG00000001084’, ‘ENSG00000001167’, ‘ENSG00000001460’, ‘ENSG00000001461’, ‘ENSG00000001497’, ‘ENSG00000001617’, ‘ENSG00000001626’, ‘ENSG00000001629’, ‘ENSG00000001630’, ‘ENSG00000001631’, ‘ENSG00000002016’, ‘ENSG00000002330’, ‘ENSG00000002549’, ‘ENSG00000002586’, ‘ENSG00000002587’, ‘ENSG00000002726’, ‘ENSG00000002745’, ‘ENSG00000002746’, ‘ENSG00000002822’, ‘ENSG00000002834’, ‘ENSG00000002919’, ‘ENSG00000002933’, ‘ENSG00000003056’, ‘ENSG00000003096’, ‘ENSG00000003137’, ‘ENSG00000003147’, ‘ENSG00000003249’, ‘ENSG00000003393’, ‘ENSG00000003400’, ‘ENSG00000003402’, ‘ENSG00000003436’, ‘ENSG00000003509’, ‘ENSG00000003756’, ‘ENSG000000 [... truncated] 
  Execution halted
  INFO:    Cleaning up image...

because the feature_id_col should not be 'gene_id', but 'tx' for this input table

Is there a way to set this parameter?
Or is it set via the params in the 'Features options'?

from differentialabundance.

RaverJay avatar RaverJay commented on July 3, 2024

Okay, so setting

--features_id_col             'transcript_id'
--features_type               'transcript'

made the validation succeed because features_id_col is used:

validate_fom_components.R \
    --sample_metadata "samplesheet.csv" \
    --feature_metadata 'Homo_sapiens.anno.tsv' \
    --assay_files "salmon.merged.transcript_counts.tsv" \
    --contrasts_file "contrasts.csv" \
    --output_directory "study" \
    --sample_id_col 'sample' --feature_id_col 'transcript_id'

the following filter step seems to work, yielding such a table:

transcript_id	EU10_R4_Ctrl_wDPO	EU11_R4_Wt_oDPO	EU12_R4_Wt_wDPO	EU1_R2_Ctrl_oDPO	EU2_R2_Ctrl_wDPO	EU3_R2_Wt_oDPO	EU4_R2_Wt_wDPO	EU5_R3_Ctrl_oDPO	EU6_R3_Ctrl_wDPO	EU7_R3_Wt_oDPO	EU8_R3_Wt_wDPO	EU9_R4_Ctrl_oDPO
ENST00000456328	0	0	0	0	0	0	0	0	0	0	0	1.65
ENST00000488147	112.211	78.767	74.328	63.958	41.027	119.846	73.367	53.181	60.653	47.144	54.7959.419

but it still dies on the DESeq2 process, likely because that still looks for a 'gene_id' column in the table,
which is not there.
'gene_id_col' is not passed as a param to the DESeq2 process:

  ################################################
  ################################################
  ## PARSE PARAMETERS FROM NEXTFLOW             ##
  ################################################
  ################################################
  
  # I've defined these in a single array like this so that we could go back to an
  # optparse-driven method in future with module bin/ directories, rather than
  # the template
  
  # Set defaults and classes
  
  opt <- list(
      output_prefix = ifelse('all' == 'null', 'Ctrl_wDPO_vs_Ctrl_oDPO', 'all'),
      count_file = 'study.filtered.tsv',
      sample_file = 'samplesheet.sample_metadata.tsv',
      contrast_variable = 'condition',
      reference_level = 'Ctrl_oDPO',
      target_level = 'Ctrl_wDPO',
      blocking_variables = NULL,
      control_genes_file = '',
      transcript_lengths_file = 'salmon.merged.transcript_lengths.tsv',
      sizefactors_from_controls = FALSE,
      gene_id_col = "gene_id",
      sample_id_col = "experiment_accession",

<SNIP>

  )
  opt_types <- lapply(opt, class)
  
  # Apply parameter overrides
  
  args_opt <- parse_args('--sample_id_col "sample" --test Wald --fit_type parametric --sf_type ratio --min_replicates_for_replace 7 --use_t false --lfc_threshold 0 --alt_hypothesis greaterAbs --independent_filtering true --p_adjust_method BH --alpha 0.1 --minmu 0.5 --vs_method vst --vs_blind true --vst_nsub 1000 --shrink_lfc true --cores 1 --subset_to_contrast_samples false --blocking_variables')

Ideas?

from differentialabundance.

RaverJay avatar RaverJay commented on July 3, 2024

Got it to work by renaming the column names in the counts table, lengths table AND annotation, such that the column containing the transcript_id is (wrongly) called 'gene_id'

renaming:

sed '1 s/gene_id/gene_id_orig/;s/tx/gene_id/' results_rnaseq/star_salmon/salmon.merged.transcript_counts.tsv > renamed_salmon.merged.transcript_counts.tsv
sed '1 s/gene_id/gene_id_orig/;s/tx/gene_id/' results_rnaseq/star_salmon/salmon.merged.transcript_lengths.tsv > renamed_salmon.merged.transcript_lengths.tsv
sed '1 s/gene_id/gene_id_orig/;s/transcript_id/gene_id/' results_differentialabundance/tables/annotation/Homo_sapiens.anno.tsv > renamed_Homo_sapiens.anno.tsv

running:

nextflow run nf-core/differentialabundance \
    -resume \
    -profile rnaseq,singularity \
    --input samplesheet.csv \
    --contrasts contrasts.csv \
    --matrix renamed_salmon.merged.transcript_counts.tsv \
    --transcript_length_matrix renamed_salmon.merged.transcript_lengths.tsv \
    --outdir results_transcriptsdiff_renamed \
    --features renamed_Homo_sapiens.anno.tsv

Output obviously has 'Gene ID' column name even though they are transcript ids.

Ideally there would be a way to properly parameterize which columns to use (in validation, filtering, and DESeqing) AND how to generate the annotation from the gtf

Please consider implementing it - would be a cool feature to be able to directly use transcript abundances from nf-core/rnaseq

Best

from differentialabundance.

WackerO avatar WackerO commented on July 3, 2024

Hey there!

but it still dies on the DESeq2 process, likely because that still looks for a 'gene_id' column in the table, which is not there. 'gene_id_col' is not passed as a param to the DESeq2 process:

Yep...I noticed that as well some time ago, that's a silly little bug. It's already fixed on the dev branch, so if you run that code it should work. The necessary id_col parameters should already be in the pipeline :) (There's quite a few of them unfortunately, but all of them are described on https://nf-co.re/differentialabundance/1.4.0/parameters)

from differentialabundance.

pinin4fjords avatar pinin4fjords commented on July 3, 2024

@WackerO has indicated a fix present on the dev branch, and there has been now response by the OP since. I'm assuming this is resolved, @fpsanz please reopen if you discover that the issue persists.

from differentialabundance.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.