Comments (11)
I can confirm that there is no annotation on KI270827.1 in the version I was using (v36).
from ultra.
This is strange, I will have a better look at this over the weekend. It would help if I can get at least one read(/sequence) that has aligned coordinates outside this contig, as well as your command line calling uLTRA. Would this be possible?
from ultra.
I decided to wait with tracing down this bug until I get a sequence that produces violating alignment coordinates, as this bug may be nontrivial.
from ultra.
I prepared a minimal example, by taking every 100th read from the affected contig:
samtools view CTL_S2_sorted.bam KI270827.1|perl -lane '$c+=1; if ($c%100==0){print "\@read$c\n$F[9]\n+\n$F[10]"}'>test.fastq
This resulted in 10 reads. As the aligned reads did not contain quality strings, there are no quality values in the fastq, but I think it does not matter.
You can get this file and the resulting alignment here:
test.fastq
The reference genome and annotation I used is from GENCODE.
I used your tool with the new parameter options as follows, to reprocess the selected reads:
# prepare the index
uLTRA index GRCh38.genome.fa gencode.v36.chr_patch_hapl_scaff.annotation.gtf uLTRA_index --disable_infer
# align the reads
uLTRA align GRCh38.genome.fa test.fastq . --prefix test --index uLTRA_index --isoseq
Strangely, the resulting alignment contained only 4 reads (even though all where aligned to KI270827.1 when origninally processed together with all reads). For each read, two alignments are found, one on chr11 and one on KI270827.1. Sometimes chr11 is reported as primary, sometimes KI270827.1. Except for the strand[1] the alignment to KI270827.1 is the same as in my original uLTRA alignment with all reads.
[1] apparently the sequence in the sam file is reversed if aligned to the - strand, which I did not consider when creating the fastq.
from ultra.
Hi Matthias,
Thanks for the effort in producing a minimal example, it really helps.
I just wanted to double-check with you if you are sure that you are using v36 on the gencode annotation? The reason that I ask is that I cannot find KI270827.1
in the gencode.v36.chr_patch_hapl_scaff.annotation.gtf
(downloaded from http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.chr_patch_hapl_scaff.annotation.gtf.gz
-- which is the "ALL" entry at https://www.gencodegenes.org/human/release_36.html ). It is however present in the GRCh38.genome.fa
file. Could you check this with e.g. grep "KI270827.1" [your annotation].gtf
This may be precisely what is causing the bug, but just thought I would check if I have the correct GTF annotation file before I go on a bug hunt.
from ultra.
So I looked into this and here are my observations:
-
"Strangely, the resulting alignment contained only 4 reads"
I also observed this behavior when using the fastq file that you produced. I do not observe this when I convert the file to fasta (then all 10 reads are aligned). I then modified your fastq so that the quality strings were the same length as the sequences, then I also get all 10 sequences aligned. So this makes me conclude that the fastq parser (silently) bugs out when quality strings are not of the same lengths. I should implement a check for this and also understand why this happens. -
To the core issue, when I tried to reproduce your issue aligning the fasta version of your reads, I get that all these reads are aligned either to chr11 or contig
KI270831.1
. ContigKI270831.1
is of length 296895 (@SQ SN:KI270831.1 LN:296895
) and all alignments are within the boundary of this contig. I reran a couple of times but didn't observe any stochasticity in the results.
I'm therefore wondering whether in your case, uLTRA made use of an old database.db
file created from an earlier run where contig KI270827.1
was present in the GTF annotation? This could be checked by performing a complete rerun of uLTRA by specifying a new index folder path, alternatively removing all contents in the indexing folder.
If you do such a "complete rerun" (and also use a fasta file or properly formatted fastq file) and still observe this behaviour I would have to take more serious debugging action. Let me know.
from ultra.
I reran the complete pipeline, but the issue persists...
Again, I used the two stage comands directly, with the --disable_infer parameter for the index step. Might that cause the issue?
# prepare the index
uLTRA index GRCh38.genome.fa gencode.v36.chr_patch_hapl_scaff.annotation.gtf uLTRA_index --disable_infer
# prepare the isoseq reads
samtools fastq ${isoseq_reads}.bam > ${isoseq_reads}.fastq
# align the reads
uLTRA align GRCh38.genome.fa ${isoseq_reads}.fastq . --prefix ${isoseq_reads}_aligned --index uLTRA_index --isoseq --t 48
samtools view -Sb ${isoseq_reads}_aligned.sam | samtools sort -o ${isoseq_reads}_aligned.bam -
samtools index ${isoseq_reads}_aligned.bam
samtools view ${isoseq_reads}_aligned.bam KI270827.1|wc -l
1001
samtools view ${isoseq_reads}_aligned.bam KI270827.1|head -1
m64080_200120_120529/16976146/ccs 256 KI270827.1 134490 0 24=1X9=1X24=1X28=851N166=1X109=4748N98=865N161=220N133=6911N119=1246N108=1210N101=2653N102=232N121=2393N232= * 0 0 GGCTCATCCCGGAAGGACCGGTGTCTAGGTCACCCTGGAGCGCTCACCCCACCGGCACCCGTGCCCAAGCCCGCCCCTGCAAAGGCAGGCAAGGCCAGGCGGGTGCTGCCTGGGACCCAGTGACTCAGCACCCCTGCCCGGATCAACTGGACTTTTGCCCCCTGCTCCGCCAGCCTCCTGCTTGGATCTCTCCTGGGTCTCCCTGCTGCGCCTGTCCAGGATGCAGGGAGCTCGGGCTCCCAGGGACCAGGGCCGGTCCCCCGGCAGGATGAGCGCTCTAGGCCGGTCCTCGGTCATCTTGCTTACCTACGTGCTGGCCGCCACAGAACTTACCTGCCTCTTCATGCAGTTCTCCATCGTGCCATACCTGTCTCGGAAACTGGGCCTGGATTCCATTGCCTTCGGCTACCTGCAAACCACCTTCGGGGTGCTGCAGCTGCTGGGCGGGCCGGTGTTTGGCAGGTTCGCAGACCAGCGCGGGGCGCGGGCGGCGCTCACGCTCTCCTTCCTGGCTGCCTTGGCGCTCTACCTGCTCCTGGCGGCCGCCTCCAGCCCGGCCCTGCCCGGGGTCTACCTGCTCTTCGCCTCGCGCCTGCCCGGAGCGCTCATGCACACGCTGCCAGCCGCCCAGATGGTCATCACGGACCTGTCGGCACCCGAGGAGCGGCCCGCGGCCCTGGGCCGGCTGGGCCTCTGCTTCGGCGTCGGAGTCATCCTCGGCTCCCTGCTGGGCGGGACCCTGGTCTCCGCGTACGGGATTCAGTGCCCGGCCATCCTGGCTGCCCTGGCCACCCTCCTGGGAGCTGTCCTCAGCTTCACCTGCATCCCCGCCAGCACCAAAGGGGCCAAAACTGACGCCCAGGCTCCACTGCCAGGCGGCCCCCGGGCCAGTGTGTTCGACCTGAAGGCCATCGCCTCCCTGCTGCGGCTGCCAGACGTCCCGAGGATCTTCCTGGTGAAGGTGGCCTCCAACTGCCCCACAGGGCTCTTCATGGTCATGTTCTCCATCATCTCCATGGACTTCTTCCAGCTGGAGGCCGCCCAAGCTGGCTACCTCATGTCCTTCTTCGGGCTCCTCCAGATGGTGACCCAGGGCCTGGTCATCGGGCAGCTGAGCAGCCACTTCTCGGAGGAGGTGCTGCTCCGGGCCAGCGTGCTGGTCTTCATCGTGGTGGGCCTGGCCATGGCCTGGATGTCCAGCGTCTTCCACTTCTGCCTCCTGGTGCCCGGCCTGGTGTTCAGCCTCTGCACCCTCAACGTGGTCACCGACAGCATGCTGATCAAGGCTGTCTCCACCTCGGACACAGGGACCATGCTGGGCCTCTGCGCCTCTGTACAACCACTGCTCCGAACTCTGGGACCCACGGTCGGCGGCCTCCTGTACCGCAGCTTTGGCGTCCCCGTCTTCGGCCACGTGCAGGTTGCTATCAATACCCTTGTCCTCCTGGTCCTCTGGAGGAAACCTATGCCCCAGAGGAAGGACAAAGTCCGGTGACCGCTGCCCAGACACAGACTGGCAATAAACTCCTACTAAATCCC * XA:Z:ENST00000610526.4 XC:Z:FSM NM:i:4
grep ENST00000610526.4 gencode.v36.chr_patch_hapl_scaff.annotation.gtf | head -1
KI270831.1 HAVANA transcript 134406 157362 . + . gene_id "ENSG00000276130.4"; transcript_id "ENST00000610526.4"; gene_type "protein_coding"; gene_name "SLC22A18"; transcript_type "protein_coding"; transcript_name "SLC22A18-214"; level 2; protein_id "ENSP00000479357.1"; transcript_support_level "1"; hgnc_id "HGNC:10964"; tag "basic"; havana_gene "OTTHUMG00000190753.1"; havana_transcript "OTTHUMT00000485914.1";
As you found, the transcript is actually on KI270831.1, and the position seems to match. However, for me KI270827.1 is reported.
I'll rerun the indexing without the --disable-infer, to see if that solves the issue. How exactly did you process the example?
from ultra.
Thanks for your time with this! Really helps as the bug may be difficult. I will rerun it on another machine to try to reproduce this bug.
Are you sure that you didn't specify the full path to the gtf file
? I get an error ValueError: unknown url type:
from gffutils
when not specifying the full path, as described here
I ran with these commands specifically:
./uLTRA index GRCh38.p13.genome.fa /Users/kxs624/Downloads/uLTRA_issue/gencode.v36.chr_patch_hapl_scaff.annotation.gtf ~/tmp/ULTRA/human_test/ --disable_infer
./uLTRA align ~/Documents/data/genomes/human/HG_38/GRCh38.p13.genome.fa ~/Downloads/uLTRA_issue/test.fast[a/q] ~/tmp/ULTRA/human_test/ --isoseq
I also tried the alignment step using only one core, i.e., with --t 1
.
I ran on my Macbook. Pythons dictionary iteration order change across runs and I use a dictionary to store reference name to ref_ID (used internally in program) so will start by examining this property.
from ultra.
You are right I used absolute paths. I removed them as I thought it makes things more easy to read.
I noticed a mistake on my side, not sure whether this is relevant:
I used GRCh38.genome.fa (e.g. not containing the haplotypes) instead of GRCh38.p13.genome.fa - so the genome fasta did not contain all sequences mentioned in the gtf. Both genomes do contain the sequence for KI270827.1 though. And the tool did not complain about the missing sequences.
I reran the index and alignment steps with GRCh38.p13.genome.fa and got different results, however having the same issue. This time I got 45 alignments on KI270827.1, of which 42 are beyond the end position of the sequence. I also noticed other sequences affected, but I guess this one is very obvious, as it is relatively short. If you need more examples (e.g. the index I produced) let me know.
As you mention that dict implementation might be related to it, my exact python version and linux kernel:
python --version -VV
Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)
[GCC 9.3.0]
uname -srm
Linux 5.10.24.mx64.375 x86_64
from ultra.
I have not forgotten about this issue. Will take a look at it again after vacation.
from ultra.
I can finally close this issue as it, with the highest likelihood, is fixed due to a very similar bug report in #17 that I just fixed.
If you still have this instance easily set up, it would be good with a verification, but I'm not expecting anything as it was so long ago now.
from ultra.
Related Issues (20)
- ultra installation and run error HOT 7
- CIGAR string starts/ends with N HOT 7
- Mapping with uLTRA without GTF? HOT 3
- Cigar is None HOT 2
- BUG -4294967296 HOT 8
- Controlling (high) uLTRA RAM usage HOT 1
- a bug of `--alignment_threshold` HOT 1
- Out of bound reads HOT 13
- Genomes FASTA/GTF files needed to run the evaluation HOT 4
- Can not access local variable 'read_mems_tmp' when using --use_NAM_seeds HOT 2
- Error: invalid feature coordinates (end<start!) at line: HOT 1
- Bug with uLTRA align : TypeError: bad argument type for built-in operation HOT 3
- UnboundLocalError: local variable 'i' referenced before assignment HOT 4
- Can I use ultra to align est to references HOT 1
- error when aligning direct RNA data during revcomp script HOT 3
- KeyError when running test pipeline HOT 3
- How to control minimap 2 parameters during uLTRA alignment HOT 6
- uLTRA + SQANTI3 HOT 2
- Non-absolute paths don't resolve HOT 5
- Python bindings? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ultra.