Hi all, I have a set of 10x scRNAseq (nuclei instead of cells, to be

Finally finished succesfully, thanks a lot for your help <a class="user-mention notran

Based on <a href="http://cf.10xgenomics.com/samples/cell-exp/1.3.0/hgmm_100/hgmm_100_f

3'UTR transgene identification from 10x scRNAseq reads,about bcgsc/rna-bloom

Comments (36)

auesro commented on August 17, 2024 1

I come back to report the assembly was quite succesful.
However, now I am trying to assemble reads from a different experiment (traditional RNAseq, single end reads, downloaded from SRA) and face a similar issue to here:

(RNA_Bloom) [lab_eh_2@NeuroServer denovo_transcriptome]$ rnabloom -fpr 0.01 -sef /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz -ntcard -t 30 -outdir output
RNA-Bloom v1.4.3
args: [-fpr, 0.01, -sef, /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz, -ntcard, -t, 30, -outdir, output]

name:   rnabloom
outdir: output
WARNING: Output directory does not exist!
Created output directory at `output`

K-mer counting with ntCard...
Running command: `ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `output/rnabloom_k25.hist`...
Unique k-mers (k=25):     262.333.277
Unique k-mers (k=25,c>1): 129.744.533
K-mer counting completed in 31.585s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       0.57971644
k-mer counting:        2.2937248
paired k-mers (reads): 0.57971644
paired k-mers (frags): 0.57971644
screening:             0.57971644
====================================
Total:                 4.612591

> Stage 1: Construct graph from reads (k=25)
Sampling read lengths (l>=25, n=1000) from each file...
        min     Q1      M       Q3      max     file
        28      50      51      51      51      /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz
Weighted [Q1,M,Q3]: [50, 51, 51]
Absolute [min,max]: [28, 51]
Paired k-mers distance: 15
Max. tip length: 26
Parsing `/home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz`...
Parsed 25.694.006 sequences in 2m 14s
DBG Bloom filter FPR:                 0.998 %
Counting Bloom filter FPR:            1.01 %
Reads paired k-mers Bloom filter FPR: 0.335 %
> Stage 1 completed in 2m 23s

> Skipping Stage 2 for single-end reads.

> Stage 3: Assemble transcripts for "rnabloom"
ERROR: null
java.lang.NullPointerException
        at rnabloom.RNABloom$SingleEndReadsIterator.<init>(RNABloom.java:4679)
        at rnabloom.RNABloom.assembleSingleEndReads(RNABloom.java:4805)
        at rnabloom.RNABloom.assembleTranscriptsSE(RNABloom.java:5379)
        at rnabloom.RNABloom.main(RNABloom.java:7104)

How do I specify the reads when I only have 1 file?

from rna-bloom.

auesro commented on August 17, 2024 1

Don´t worry, fixed it!
Thanks!

from rna-bloom.

auesro commented on August 17, 2024 1

Finally finished succesfully, thanks a lot for your help @kmnip !

from rna-bloom.

kmnip commented on August 17, 2024

As mentioned previously, you need to trim the UMIs and barcodes entirely from the reads with umitools extract. After that, only right (_2.fastq) reads contain the sequences that can be assembled.

You have several options here. You can align the right reads (with bwa/minimap2) to either your CDS, reference transcripts, or the reference genome. For example, the gene for "Bgal" should be GLB1 (chr3:32,996,617-33,097,146 in GRCh38).

After alignment, you can extract the aligned reads with samtools fastq.

samtools fastq -F 0x4 alignment.bam > extracted_reads.fastq

You can assemble the extracted reads with RNA-Bloom:

java -jar RNA-Bloom.jar -sef extracted_reads.fastq -ntcard -t THREADS -outdir OUTDIR

from rna-bloom.

auesro commented on August 17, 2024

A couple of doubts:
-Read2 should not contain any UMI or Barcode right? I mean, my average insert size is 400 bp, read2 is 88 bases long...unlikely any of them included an insert small enough to get to the other side ,right?
-If I align my reads to the CDS (thats the only thing I have by now) I will keep only those reads that map to the CDS and will discard reads mapping outside...precisely what I am looking for: those reads that map to the 3'UTR....or am I missing something?

from rna-bloom.

kmnip commented on August 17, 2024

Based on the sample data from 10x, I don't think the right read (98-bp) contains any UMI/barcodes at all. But if you believe that yours do, please do remove them before alignment/assembly.

The insert sizes probably peak at 400 bp, but they should vary a bit.

Yes, you are correct about the CDS missing the UTR. That's why I suggested reference transcript and genome as alternatives.

from rna-bloom.

auesro commented on August 17, 2024

The problem is that being a transgene (a sequence inserted artificially in the mouse genome in order to be expressed by select cells so we can identify them), the sequence is not present in any reference....I only know the CDS. I need a way to select the reads that map to the CDS and use those to "build" a transcript by adding more reads (specially in the direction 3' of the CDS)...thats why I initially thought Kollector could do it.

from rna-bloom.

kmnip commented on August 17, 2024

You can just assemble the trimmed reads and then extract the transcripts that align to your CDS?

from rna-bloom.

auesro commented on August 17, 2024

Mmm I guess so. You mean:
-Discard read1
-Trim any UMI, cell barcode, etc from read2
-Perform de novo transcriptome assembly with ALL read2 using RNA-bloom
-BLAST the assembled transcriptome against my CDS
-Pick candidate transcripts and identify the 3'UTR from those long enough
?

from rna-bloom.

kmnip commented on August 17, 2024

Yes, that seems like your only option because your left read most likely contain only UMIs + cell barcodes.
Kollector would not work in this case because its recruitment step expects paired end reads where one paired read would anchor on a given sequence.

from rna-bloom.

auesro commented on August 17, 2024

Alright.
Hope RNA-bloom will not crash!
Then my next question is: I have 2 samples (=libraries, from different animals) should I supply RNA-bloom with the 2 read2 files, or better do it one by one?

from rna-bloom.

kmnip commented on August 17, 2024

Let me know if you run into any issues.
If they are different animals, then you should assemble them separately.

from rna-bloom.

auesro commented on August 17, 2024

Thanks a lot, Ka!
Will keep you posted

from rna-bloom.

auesro commented on August 17, 2024

Unfortunately, I ran into something...

(RNA_Bloom) [lab_eh_2@NeuroServer RNA_Bloom]$ rnabloom -fpr 0.01 -ser /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz -ntcard -t 30 -outdir output
RNA-Bloom v1.4.3
args: [-fpr, 0.01, -ser, /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz, /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz, -ntcard, -t, 30, -outdir, output]

name:   rnabloom
outdir: output

K-mer counting with ntCard...
Running command: `ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `output/rnabloom_k25.hist`...
Unique k-mers (k=25):     1.990.801.340
Unique k-mers (k=25,c>1): 802.505.180
K-mer counting completed in 13m 57s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       4.399367
k-mer counting:        14.187311
paired k-mers (reads): 4.399367
paired k-mers (frags): 4.399367
screening:             4.399367
====================================
Total:                 31.784777

> Stage 1: Construct graph from reads (k=25)
Sampling read lengths (l>=25, n=1000) from each file...
        min     Q1      M       Q3      max     file
        61      87      88      88      88      /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz
        78      87      88      88      88      /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz
Weighted [Q1,M,Q3]: [87, 88, 88]
Absolute [min,max]: [61, 88]
Paired k-mers distance: 52
Max. tip length: 63
Parsing `/home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz`...
Parsed 100.549 sequences in 1.42s
Parsing `/home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz`...
Parsed 345.166.566 sequences in 1h 6m 3s
Parsed 345.267.115 sequences from 2 files.
DBG Bloom filter FPR:                 0.999 %
Counting Bloom filter FPR:            1.01 %
Reads paired k-mers Bloom filter FPR: 0.25 %
> Stage 1 completed in 1h 6m 50s

> Skipping Stage 2 for single-end reads.

> Stage 3: Assemble transcripts for "rnabloom"
ERROR: null
java.lang.NullPointerException
        at rnabloom.RNABloom$SingleEndReadsIterator.<init>(RNABloom.java:4670)
        at rnabloom.RNABloom.assembleSingleEndReads(RNABloom.java:4805)
        at rnabloom.RNABloom.assembleTranscriptsSE(RNABloom.java:5379)
        at rnabloom.RNABloom.main(RNABloom.java:7104)
(RNA_Bloom) [lab_eh_2@NeuroServer RNA_Bloom]$
´´´
Any ideas whats happening here?

from rna-bloom.

kmnip commented on August 17, 2024

It looks like you have found a bug.
You can bypass it by using both -ser and -sef, i.e.

rnabloom -fpr 0.01 \
-ser /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_unpaired_norRNA_2.fq.gz \
-sef /home/LAB/lab_eh_2/AER_Data/snRNAseq/C1/trimmed/Clean_rRNA/A_S1_L001_R2_001_val_norRNA_2.fq.gz \
-ntcard -t 30 -outdir output

Since your reads are not strand-specific, it doesn't matter whether you specify your reads with -ser or -sef.

I will fix this in the next release. Sorry about that!

from rna-bloom.

auesro commented on August 17, 2024

Alright, good to know it was not something worse. I was already touching the JAVA options.

So you are saying the problem was I specified 2 input read files with only one argument, right? It should have been -ser X -sef Y, not -ser X Y. But then that means the bug is that you cannot, so far, provide more than 1 read file per "modality" -left, -right, -sef, -ser, correct?

Also, I was wondering if I could use the -mergepool option here if I only have single end reads: how should I specify the files from different samples in the readslist.txt file? Given I need to put them in the 4th column, do I just leave the columns 2 and 3 empty?

Sample1     sef_file ser_file
Sample2     sef_file ser_file

from rna-bloom.

kmnip commented on August 17, 2024

No, the bug was that -ser and -sef had to be used together.

In in version 1.4.3, the -pool option doesn't support single-end reads (yet):
https://github.com/bcgsc/RNA-Bloom/releases/tag/v1.4.3

support for single-end reads in -pool will be implemented in a future release

It will be available in the next release though.

from rna-bloom.

auesro commented on August 17, 2024

Ok, thanks for the info!
Is it possible to use the -mergepool parameter if I provide several manually assembled transcriptomes?

I am running it now with the first sample, fingers crossed!

from rna-bloom.

auesro commented on August 17, 2024

Just wondering, is it normal that it will take more than 9 hours to assemble a transcriptome from 345.267.115 sequences using 30 threads? (Currently at "Reducing redundancy in assembled transcripts...").

from rna-bloom.

kmnip commented on August 17, 2024

At this stage, it is reducing the redundancy in the assembled sequences to generate *.transcripts.nr.fa, but you can already use the sequences in *.transcripts.fa.

345 million reads is not a small amount data and assembly of single-end reads is less efficient than paired-end assembly. So, that is kind of expected.

from rna-bloom.

auesro commented on August 17, 2024

Alright, thanks Ka!

Regarding output till now, I guess the transcripts.short.fa file contains the assembled transcripts smaller than certain threshold, right? Is it taking those into account too for the redundancy reduction?

from rna-bloom.

kmnip commented on August 17, 2024

Yes, the default length threshold is 200.
Those short sequences are not included in the redundancy reduction process.

from rna-bloom.

kmnip commented on August 17, 2024

Yes, that was the same bug as before. Sorry for the inconvenience.
You need to split your FASTQ file in half and use -ser and -sef options together.
For example, you can split the file like so:

zcat SRR7039666_3_trimmed.fq.gz | split -l 51400000 - partition_
mv partition_aa SRR7039666_3_trimmed.01.fq
mv partition_ab SRR7039666_3_trimmed.02.fq
gzip SRR7039666_3_trimmed.01.fq SRR7039666_3_trimmed.02.fq

rnabloom -fpr 0.01 -sef SRR7039666_3_trimmed.01.fq.gz -ser SRR7039666_3_trimmed.02.fq.gz -ntcard -t 30 -outdir output

from rna-bloom.

auesro commented on August 17, 2024

Ok, that was the smart move.
I just ran rnabloom with the same reads file twice like:
rnabloom -fpr 0.01 -sef /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz -ser /home/LAB/lab_eh_2/AER_Data/Chamessian/3/trimmed/SRR7039666_3_trimmed.fq.gz -ntcard -t 30 -outdir output
Is that crazy? The file is small so the assembly already finished, can I use those results or better not?

from rna-bloom.

kmnip commented on August 17, 2024

It is okay if noise is not an issue.
Since the read file is relatively small, I suggest you re-run the assembly with the split files.

from rna-bloom.

auesro commented on August 17, 2024

Alright!
Thanks!

from rna-bloom.

kmnip commented on August 17, 2024

I just realized that I messed up the number for the example split command previously.
25,694,006 reads * 4 lines/read = 102,776,024 lines
Split that in half would be 51,388,012 lines per partition.
So, something like -l 51400000 would yield exactly two files.

from rna-bloom.

auesro commented on August 17, 2024

Hi Ka,

Not sure if I should open a new issue or keep adding here, because this is a new error not discussed previously. If you prefer, I will close this and post as new issue. In the meantime:

I have been de novo assembling several public datasets. In this case from a RNAseq experiment (not single cell). The only difference I can see is that the size of the reads files is way bigger than previous (I have merged together 33 SRR single end files and then split in 2 as recommended above: each reads file is about 39GB). And I get this:

(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$ rnabloom -fpr 0.01 -sef *_00_trimmed.fq.gz -ser *_01_trimmed.fq.gz -ntcard -t 30 -outdir output
RNA-Bloom v1.4.3
args: [-fpr, 0.01, -sef, Chongtam_00_trimmed.fq.gz, -ser, Chongtam_01_trimmed.fq.gz, -ntcard, -t, 30, -outdir, output]

name:   rnabloom
outdir: output

K-mer counting with ntCard...
Running command: `ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt`...
ERROR: Error running ntCard!

Not a very informative error...any ideas??

from rna-bloom.

kmnip commented on August 17, 2024

Hi,
Can you please run this command and see whether you got an error from ntCard?
ntcard -t 30 -k 25 -c 65535 -p output/rnabloom @output/rnabloom.ntcard.readslist.txt
Thanks!

from rna-bloom.

auesro commented on August 17, 2024

Mmmm this is what I get...

(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$  ntcard -t 30 -k 25 -c 65535 -p output/rnabloom@output/rnabloom.ntcard.readslist.txt

gzip: Chongtam_00_trimmed.fq.gz: unexpected end of file                                                                                  
PID 3694 exited with status 1

from rna-bloom.

kmnip commented on August 17, 2024

That means your fastq file Chongtam_00_trimmed.fq.gz is truncated.

from rna-bloom.

auesro commented on August 17, 2024

Ok.

I made it by concatenating several single end fastqs, counting lines and dividing in half as previously advised. How could the file be truncated?

from rna-bloom.

kmnip commented on August 17, 2024

That error actually came from gzip. ntCard was simply using it to decompress your FASTQ file. So, your gzip file is corrupted according to gzip.
Try this and see if it is really truncated:

zcat Chongtam_00_trimmed.fq.gz | tail -n 4

You can try decompressing and then compressing the file again.

from rna-bloom.

auesro commented on August 17, 2024

Result:

(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$ zcat Chongtam_00_trimmed.fq.gz | tail -n 4        
gzip: Chongtam_00_trimmed.fq.gz: unexpected end of file
@SRR13563559.53035757 NB501946:387:HNHTGBGXC:1:23107:17045:17168 length=70                                                               CACATATTCATAATCAGGGTCTTGCAGCAATTGAGAAGTGTGGAGAAGCTGCTCTAAATGACCCCA
+
AA
(RNA_Bloom) [lab_eh_2@NeuroServer trimmed]$

How did this happen??

from rna-bloom.

kmnip commented on August 17, 2024

That definitely looks truncated.
Check your original FASTQ file that contains the read SRR13563559.53035757.
Otherwise, you may have had multiple commands running at the same time and they write to the same file.

from rna-bloom.

auesro commented on August 17, 2024

You were right, the original reads file:

(in202109) [lab_eh_2@NeuroServer Chongtam]$ zcat SRR13563559.fastq.gz | tail -n 4
@SRR13563559.68742091 NB501946:387:HNHTGBGXC:1:11311:18817:13247 length=70
TAGGATTGTTTGTACGGAATAGAGAAAAGTGTGGCAGAAAAGCGGAACGCGCAAAGAGAACGTATATGGT
+SRR13563559.68742091 NB501946:387:HNHTGBGXC:1:11311:18817:13247 length=70
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

So I am reconstructing again, will report back!

Thanks!

from rna-bloom.

3'UTR transgene identification from 10x scRNAseq reads about rna-bloom HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent