I used your tool (version 0.3, installed with pip) to align isoseq data to the human r

So I looked into this and here are my observations: <p dir="a

Alignment outside of contig about ultra HOT 11 CLOSED

ksahlin commented on September 28, 2024

Alignment outside of contig

from ultra.

Comments (11)

MatthiasLienhard commented on September 28, 2024 1

I can confirm that there is no annotation on KI270827.1 in the version I was using (v36).

from ultra.

ksahlin commented on September 28, 2024

This is strange, I will have a better look at this over the weekend. It would help if I can get at least one read(/sequence) that has aligned coordinates outside this contig, as well as your command line calling uLTRA. Would this be possible?

from ultra.

ksahlin commented on September 28, 2024

I decided to wait with tracing down this bug until I get a sequence that produces violating alignment coordinates, as this bug may be nontrivial.

from ultra.

MatthiasLienhard commented on September 28, 2024

I prepared a minimal example, by taking every 100th read from the affected contig:

samtools view CTL_S2_sorted.bam KI270827.1|perl -lane '$c+=1; if ($c%100==0){print "\@read$c\n$F[9]\n+\n$F[10]"}'>test.fastq

This resulted in 10 reads. As the aligned reads did not contain quality strings, there are no quality values in the fastq, but I think it does not matter.
You can get this file and the resulting alignment here:
test.fastq

The reference genome and annotation I used is from GENCODE.
I used your tool with the new parameter options as follows, to reprocess the selected reads:

# prepare the index
uLTRA index GRCh38.genome.fa gencode.v36.chr_patch_hapl_scaff.annotation.gtf uLTRA_index --disable_infer
# align the reads
uLTRA align GRCh38.genome.fa test.fastq . --prefix test --index uLTRA_index --isoseq

Strangely, the resulting alignment contained only 4 reads (even though all where aligned to KI270827.1 when origninally processed together with all reads). For each read, two alignments are found, one on chr11 and one on KI270827.1. Sometimes chr11 is reported as primary, sometimes KI270827.1. Except for the strand[1] the alignment to KI270827.1 is the same as in my original uLTRA alignment with all reads.

[1] apparently the sequence in the sam file is reversed if aligned to the - strand, which I did not consider when creating the fastq.

from ultra.

ksahlin commented on September 28, 2024

Hi Matthias,

Thanks for the effort in producing a minimal example, it really helps.

I just wanted to double-check with you if you are sure that you are using v36 on the gencode annotation? The reason that I ask is that I cannot find KI270827.1 in the gencode.v36.chr_patch_hapl_scaff.annotation.gtf (downloaded from http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.chr_patch_hapl_scaff.annotation.gtf.gz -- which is the "ALL" entry at https://www.gencodegenes.org/human/release_36.html ). It is however present in the GRCh38.genome.fa file. Could you check this with e.g. grep "KI270827.1" [your annotation].gtf

This may be precisely what is causing the bug, but just thought I would check if I have the correct GTF annotation file before I go on a bug hunt.

from ultra.

ksahlin commented on September 28, 2024

So I looked into this and here are my observations:

"Strangely, the resulting alignment contained only 4 reads"
I also observed this behavior when using the fastq file that you produced. I do not observe this when I convert the file to fasta (then all 10 reads are aligned). I then modified your fastq so that the quality strings were the same length as the sequences, then I also get all 10 sequences aligned. So this makes me conclude that the fastq parser (silently) bugs out when quality strings are not of the same lengths. I should implement a check for this and also understand why this happens.
To the core issue, when I tried to reproduce your issue aligning the fasta version of your reads, I get that all these reads are aligned either to chr11 or contig KI270831.1. Contig KI270831.1 is of length 296895 (@SQ SN:KI270831.1 LN:296895) and all alignments are within the boundary of this contig. I reran a couple of times but didn't observe any stochasticity in the results.

I'm therefore wondering whether in your case, uLTRA made use of an old database.db file created from an earlier run where contig KI270827.1 was present in the GTF annotation? This could be checked by performing a complete rerun of uLTRA by specifying a new index folder path, alternatively removing all contents in the indexing folder.

If you do such a "complete rerun" (and also use a fasta file or properly formatted fastq file) and still observe this behaviour I would have to take more serious debugging action. Let me know.

from ultra.

MatthiasLienhard commented on September 28, 2024

I reran the complete pipeline, but the issue persists...
Again, I used the two stage comands directly, with the --disable_infer parameter for the index step. Might that cause the issue?

# prepare the index
uLTRA index GRCh38.genome.fa gencode.v36.chr_patch_hapl_scaff.annotation.gtf uLTRA_index --disable_infer
# prepare the isoseq reads
samtools fastq ${isoseq_reads}.bam > ${isoseq_reads}.fastq

# align the reads
uLTRA align GRCh38.genome.fa ${isoseq_reads}.fastq . --prefix ${isoseq_reads}_aligned --index uLTRA_index --isoseq --t 48
samtools view -Sb ${isoseq_reads}_aligned.sam | samtools sort -o ${isoseq_reads}_aligned.bam -
samtools index ${isoseq_reads}_aligned.bam
samtools view ${isoseq_reads}_aligned.bam KI270827.1|wc -l
1001
samtools view ${isoseq_reads}_aligned.bam KI270827.1|head -1
m64080_200120_120529/16976146/ccs       256     KI270827.1      134490  0       24=1X9=1X24=1X28=851N166=1X109=4748N98=865N161=220N133=6911N119=1246N108=1210N101=2653N102=232N121=2393N232= *       0       0       GGCTCATCCCGGAAGGACCGGTGTCTAGGTCACCCTGGAGCGCTCACCCCACCGGCACCCGTGCCCAAGCCCGCCCCTGCAAAGGCAGGCAAGGCCAGGCGGGTGCTGCCTGGGACCCAGTGACTCAGCACCCCTGCCCGGATCAACTGGACTTTTGCCCCCTGCTCCGCCAGCCTCCTGCTTGGATCTCTCCTGGGTCTCCCTGCTGCGCCTGTCCAGGATGCAGGGAGCTCGGGCTCCCAGGGACCAGGGCCGGTCCCCCGGCAGGATGAGCGCTCTAGGCCGGTCCTCGGTCATCTTGCTTACCTACGTGCTGGCCGCCACAGAACTTACCTGCCTCTTCATGCAGTTCTCCATCGTGCCATACCTGTCTCGGAAACTGGGCCTGGATTCCATTGCCTTCGGCTACCTGCAAACCACCTTCGGGGTGCTGCAGCTGCTGGGCGGGCCGGTGTTTGGCAGGTTCGCAGACCAGCGCGGGGCGCGGGCGGCGCTCACGCTCTCCTTCCTGGCTGCCTTGGCGCTCTACCTGCTCCTGGCGGCCGCCTCCAGCCCGGCCCTGCCCGGGGTCTACCTGCTCTTCGCCTCGCGCCTGCCCGGAGCGCTCATGCACACGCTGCCAGCCGCCCAGATGGTCATCACGGACCTGTCGGCACCCGAGGAGCGGCCCGCGGCCCTGGGCCGGCTGGGCCTCTGCTTCGGCGTCGGAGTCATCCTCGGCTCCCTGCTGGGCGGGACCCTGGTCTCCGCGTACGGGATTCAGTGCCCGGCCATCCTGGCTGCCCTGGCCACCCTCCTGGGAGCTGTCCTCAGCTTCACCTGCATCCCCGCCAGCACCAAAGGGGCCAAAACTGACGCCCAGGCTCCACTGCCAGGCGGCCCCCGGGCCAGTGTGTTCGACCTGAAGGCCATCGCCTCCCTGCTGCGGCTGCCAGACGTCCCGAGGATCTTCCTGGTGAAGGTGGCCTCCAACTGCCCCACAGGGCTCTTCATGGTCATGTTCTCCATCATCTCCATGGACTTCTTCCAGCTGGAGGCCGCCCAAGCTGGCTACCTCATGTCCTTCTTCGGGCTCCTCCAGATGGTGACCCAGGGCCTGGTCATCGGGCAGCTGAGCAGCCACTTCTCGGAGGAGGTGCTGCTCCGGGCCAGCGTGCTGGTCTTCATCGTGGTGGGCCTGGCCATGGCCTGGATGTCCAGCGTCTTCCACTTCTGCCTCCTGGTGCCCGGCCTGGTGTTCAGCCTCTGCACCCTCAACGTGGTCACCGACAGCATGCTGATCAAGGCTGTCTCCACCTCGGACACAGGGACCATGCTGGGCCTCTGCGCCTCTGTACAACCACTGCTCCGAACTCTGGGACCCACGGTCGGCGGCCTCCTGTACCGCAGCTTTGGCGTCCCCGTCTTCGGCCACGTGCAGGTTGCTATCAATACCCTTGTCCTCCTGGTCCTCTGGAGGAAACCTATGCCCCAGAGGAAGGACAAAGTCCGGTGACCGCTGCCCAGACACAGACTGGCAATAAACTCCTACTAAATCCC        *       XA:Z:ENST00000610526.4  XC:Z:FSM        NM:i:4
grep ENST00000610526.4 gencode.v36.chr_patch_hapl_scaff.annotation.gtf | head -1
KI270831.1      HAVANA  transcript      134406  157362  .       +       .       gene_id "ENSG00000276130.4"; transcript_id "ENST00000610526.4"; gene_type "protein_coding"; gene_name "SLC22A18"; transcript_type "protein_coding"; transcript_name "SLC22A18-214"; level 2; protein_id "ENSP00000479357.1"; transcript_support_level "1"; hgnc_id "HGNC:10964"; tag "basic"; havana_gene "OTTHUMG00000190753.1"; havana_transcript "OTTHUMT00000485914.1";

As you found, the transcript is actually on KI270831.1, and the position seems to match. However, for me KI270827.1 is reported.
I'll rerun the indexing without the --disable-infer, to see if that solves the issue. How exactly did you process the example?

from ultra.

ksahlin commented on September 28, 2024

Thanks for your time with this! Really helps as the bug may be difficult. I will rerun it on another machine to try to reproduce this bug.

Are you sure that you didn't specify the full path to the gtf file? I get an error ValueError: unknown url type: from gffutilswhen not specifying the full path, as described here

I ran with these commands specifically:

./uLTRA index GRCh38.p13.genome.fa /Users/kxs624/Downloads/uLTRA_issue/gencode.v36.chr_patch_hapl_scaff.annotation.gtf ~/tmp/ULTRA/human_test/ --disable_infer

./uLTRA align ~/Documents/data/genomes/human/HG_38/GRCh38.p13.genome.fa ~/Downloads/uLTRA_issue/test.fast[a/q] ~/tmp/ULTRA/human_test/ --isoseq

I also tried the alignment step using only one core, i.e., with --t 1.

I ran on my Macbook. Pythons dictionary iteration order change across runs and I use a dictionary to store reference name to ref_ID (used internally in program) so will start by examining this property.

from ultra.

MatthiasLienhard commented on September 28, 2024

You are right I used absolute paths. I removed them as I thought it makes things more easy to read.

I noticed a mistake on my side, not sure whether this is relevant:
I used GRCh38.genome.fa (e.g. not containing the haplotypes) instead of GRCh38.p13.genome.fa - so the genome fasta did not contain all sequences mentioned in the gtf. Both genomes do contain the sequence for KI270827.1 though. And the tool did not complain about the missing sequences.

I reran the index and alignment steps with GRCh38.p13.genome.fa and got different results, however having the same issue. This time I got 45 alignments on KI270827.1, of which 42 are beyond the end position of the sequence. I also noticed other sequences affected, but I guess this one is very obvious, as it is relatively short. If you need more examples (e.g. the index I produced) let me know.

As you mention that dict implementation might be related to it, my exact python version and linux kernel:

python --version -VV
Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) 
[GCC 9.3.0]
uname -srm
Linux 5.10.24.mx64.375 x86_64

from ultra.

ksahlin commented on September 28, 2024

I have not forgotten about this issue. Will take a look at it again after vacation.

from ultra.

ksahlin commented on September 28, 2024

I can finally close this issue as it, with the highest likelihood, is fixed due to a very similar bug report in #17 that I just fixed.

If you still have this instance easily set up, it would be good with a verification, but I'm not expecting anything as it was so long ago now.

from ultra.

Alignment outside of contig about ultra HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent