Comments (13)
The genome the genome and gtf are from Human genome GRCh38 (dowloaded from igenome).
@husensofteng Can you, please, provide the chunk13 fasta file which should contain the read m54224_190921_125133/25952682/ccs
?
from ultra.
Hi @ksahlin,
Thanks for looking into the issue. Please find the corresponding sequence below.
>m54224_190921_125133/25952682/ccs
GGGGAGGGTGAGGGAGGGGGTGGGATGGGTGGGGGGTAACGGGGGAAACTGGGGAAGTGGGGAACCGAGGGGCAACCAGGGGAAGATGGGGTGCTGGAGGAGAGCTTGTGGGAGCCAAGGAGCACCTTGGACATCTGGAGTCTGGCAGGAGTGATGACGGGTGGAGGGGCTAGCTCGAGGCAGGGCTGGTGGGGCCTGAGGCCAGTGAGGAGTGTGGAGTAGGTGCCCAGGCATCGTGCAGACAGGGCGACATCAGCTGGGGACGATGGGCCTGAGCTAGGGCTGGAAAGAAGGGGGAGCCAGGCATTCATCCCGGTCACTTTTGGTTACAGGACGTGGCAGCTGGTTGGACGAGGGGAGCTGGTGGGCAGGGTTTGATCCCAGGGCCTGGGCAACGGAGGTGTAGCTGGCAGCAGCGGGCAGGTGAGGACCCCATCTGCCGGGCAGGTGAGTCCCTTCCCTCCCCAGGCCTCGCTTCCCCAGCCTTCTGAAAGAAGGAGGTTTAGGGGATCGAGGGCTGGCGGGGAGAAGCAGACACCCTCCCAGCAGAGGGGCAGGATGGGGGCAGGAGAGTTAGCAAAGGTGACATCTTCTCGGGGGGAGCCGAGACTGCGCAAGGCTGGGGGGCTATGGGCCCGTTCCAGGCAGAAAGAGCAAGAGGGCAGGGAGGGAGCACAGGGGTGGCCAGCGTAGGGTCCAGCACGTGGGGTGGTACCCCAGGCCTGGGTCAGACAGGGACAAGGCAGGGGACACAGGACAGAGGGGTCCCCAGCTGCCACCTCACCCACCGCAATTCATTTAGTAGCAGGCACAGGGGCAGCTCCGGCACGGCTTTCTCAGGCCTATGCCGGAGCCTCGAGGGCTGGAGAGCGGGAAGACAGGCAGTGCTCGGGGAGTTGCAGCAGGACATCACCAGGAGGGCGAAGCGGCCACGGGAGGGGGGCCCCGGGACATTGCGCAGCAAGGAGGCTGCAGGGGCTCGGCCTGCGGGCGCCGGTCCCACGAGGCACTGCGGCCCAGGGTCTGGTGCGGAGAGGGCCCACAGTGGACTTGGTGACGCTGTATGCCCTCACCGCTCAGCCCCTGGGGCTGGCTTGGCAGACAGTACAGCATCCAGGGGAGTCAAGGGCATGGGGCGAGACCAGACTAGGCGAGGCGGGCGGGGCGGAGTGAATGAGCTCTCAGGAGGGAGGATGGTGCAGGCAGGGGTGAGGAGCGCAGCGGGCGGCGAGCGGGAGGCACTGGCCTCCAGAGCCCGTGGCCAAGGCGGGCCTCGCGGGCGGCGACGGAGCCGGGATCGGTGCCTCAGCGTTCGGGCTGGAGACGAGGCCAGGTCTCCAGCTGGGGTGGACGTGCCCACCAGCTGCCGAAGGCCAAGACGCCAGGTCCGGTGGACGTGACAAGCAGGACATGACATGGTCCGGTGTGACGGCGAGGACAGAGGAGGCGCGTCCGGCCTTCCTGAACACCTTAGGCTGGTGGGGCTGCGGCAAGAAGCGGGTCTGTTTCTTTACTTCCTCCACGGAGTCGGCACACTATGGCTGCCCTCTGGGCTCCCAGAACCCACAACATGAAAGAAATGGTGCTACCCAGCTCAAGCCTGGGCCTTTGAATCCGGACACAAAACCCTCTAGCTTGGAAATGAATATGCTGCACTTTACAACCACTGCACTACCTGACTCAGGAATCGGCTCTCGAAGGTGAAGCGAGAGGAACCAGACCTCATCAGCCCAACATCAAAGACACCATCGGAACAGCAGCGCCCGCAGCACCCACCCCGCACCGCGACTCCATCTTCACGGCCACCCCCTGCGGCGGACGGTTGACCACCAGCCACCACATCATCCCAGAGCTGAGCTCCTCCAGCGGGATGACGCCGTCCCCACCACCTCCCTCTTCTTCTTTTTCATCCTTCTGTCTCTTTGTTTCTGAGCTTTCCTGTCTTTCCTTTTTTCTGAGAGATTCAAAGCCTCCACGACTCTGTTTCCCCCGTCCCTTCTGAATTTAATTTGCACTAAGTCATTTGCACTGGTTGGAGTTGTGGAGACGGCCTTGAGTCTCGGTGCGAGTGTGCGTGAGTGTGAGCCACCTTGGCAAGTGCCTGCGCAGGGCCCGGCCGCCCTCCATCTGGGCCGGGTGACTGGGCGCCGGCTGTGTGCCCGAGGCCTCACCCTGCCCTCGCCTAGTCTGGAAGCTCCGACCGACATCACGGAGCAGCCTTCAAGCATTCCATTACGCCCCATCTCGCTCTGTGCCCCTCCCCACCAGGGCTTCAGCAGGAGCCCTGGACTCATCATCAATAAACACTGTTACAG
Also, these are the full paths to the reference files:
s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa
s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf
from ultra.
Yes, was just waiting for your input. Will do it asap and get back to you here when done.
from ultra.
I have now uploaded v0.0.4.2 to PyPI and created a release.
I will close this issue. you can always bump here when the conda package is updated and I'll update the readme on the conda version.
Thanks for reporting and documenting so well this nasty bug!
from ultra.
Hi @sguizard
Not that I can think of off the top of my head.
It would be great to get the offending read as well as the genome and gtf files so that I could reproduce it. Would it be possible?
Best,
K
from ultra.
I have downloaded the archive. Is the relevant gtf-annotation in the Genes.gencode
folder? (there are two other folders Genes
and Archives
(with subfolders), all with gtf files in them)
As for the genome, is it this one:
from ultra.
I have found the error. Indeed, the read is aligned to chr11_KI270721v1_random
up until the very end in uLTRA where a merging of sam-files is performed (in case of parallelization).
The line causing the error seems to be
alignment_outfile = pysam.AlignmentFile( os.path.join(args.outfolder, args.prefix+".sam"), "w", reference_names=list(refs_lengths.keys()), reference_lengths=list(refs_lengths.values()) )
On this line, uLTRA uses the reference sequences in the reference fasta to form a template for the final SAM-file. uLTRAs uses temporary SAM files to produce the alignment (in ase of multiprocessing).
Now, in the temporary SAM-file, reference accession chr11_KI270721v1_random
is found on line 33. When initialising the final SAM file with command above, sequence chr1_KI270713v1_random
occurs on line 33.
The read as mapped with uLTRA has already set all fileds for the read, e.g., read.reference_name = "chr11_KI270721v1_random"
. However, in the temporary SAM file, this "read record" gets updated (involuntarily) as read.reference_name = "chr1_KI270713v1_random"
.
This leads me to believe that the library Pysam
that uLTRA uses internally stores the position of the reference in the header. That is, read.reference_name
points to a position in a vector storing the reference sequences. This means that if I now create a new sam file using a different template (different reference sequence vector), reads can get assigned to the wrong reference.
I need to figure out why uLTRA uses only a subset of the references in it's header (creating the diff), and I will fix this asap. This bug will most likely fix issue #2 as well.
from ultra.
Hi @sguizard and @husensofteng ,
I have pushed a fix for this to master. Could you verify that the latest master solves the problem (and does not introduce something unexpected). The fix was pretty minimal so I think it should not introduce anything unexpected but good to check regardless.
Ideally if would be good with a diff or something looking at the number of lines affected in the SAM between the versions. There could be other reads for where this reference name switch has occurred that are silent due to not being out of bounds.
@sguizard, If everything works with the latest master, I kindly ask for your expertise in updating also the bioconda version whenever you have time :) I made sure to update the setup.py script to v0.0.4.2
from ultra.
Hi @ksahlin, Thanks for your investigation.
I'm having a look to this today.
No problems about bioconda, I'll take care of this ;)
from ultra.
I tested the patch with the complete set of reads.
Apart the corrected sequence for read m54224_190921_125133/25952682/ccs
, the existing differences (between **) are located in the XA tag, where ID of alternative mappings are switched.
Before: m54224_190921_125133/26149707/ccs 0 chr17 1742846 60 1=1X...9=4S * 0 0 GGG...AGA * **XA:Z:NM_000934.3,XM_005256698.2** XC:Z:FSM NM:i:4
After: m54224_190921_125133/26149707/ccs 0 chr17 1742846 60 1=1X...9=4S * 0 0 GGG...AGA * **XA:Z:XM_005256698.2,NM_000934.3** XC:Z:FSM NM:i:4
Before: m54224_190921_125133/26018748/ccs 0 chr17 1742852 60 2=1D...2=1S * 0 0 GGG...GCA * **XA:Z:NM_000934.3,XM_005256698.2** XC:Z:FSM NM:i:3
After: m54224_190921_125133/26018748/ccs 0 chr17 1742852 60 2=1D...2=1S * 0 0 GGG...GCA * **XA:Z:XM_005256698.2,NM_000934.3** XC:Z:FSM NM:i:3
Before: m54224_190921_125133/25625088/ccs 0 chr17 1742871 60 4S38=...8=1S * 0 0 GGG...GCA * **XA:Z:NM_000934.3,XM_005256698.2** XC:Z:FSM NM:i:1
After: m54224_190921_125133/25625088/ccs 0 chr17 1742871 60 4S38=...8=1S * 0 0 GGG...GCA * **XA:Z:XM_005256698.2,NM_000934.3** XC:Z:FSM NM:i:1
Before: m54224_190921_125133/25690218/ccs 0 chr17 1742871 60 4S38=...1144=77S * 0 0 GGG...AGA * **XA:Z:NM_000934.3,XM_005256698.2** XC:Z:FSM NM:i:0
After: m54224_190921_125133/25690218/ccs 0 chr17 1742871 60 4S38=...1144=77S * 0 0 GGG...AGA * **XA:Z:XM_005256698.2,NM_000934.3** XC:Z:FSM NM:i:0
Before: m54224_190921_125133/25822164/ccs 0 chr17 1742871 60 4S38=...428=1S * 0 0 GGG...GCA * **XA:Z:NM_000934.3,XM_005256698.2** XC:Z:FSM NM:i:7
After: m54224_190921_125133/25822164/ccs 0 chr17 1742871 60 4S38=...428=1S * 0 0 GGG...GCA * **XA:Z:XM_005256698.2,NM_000934.3** XC:Z:FSM NM:i:7
Before: m54224_190921_125133/26018081/ccs 0 chr17 1742871 60 4S38=...579=3S * 0 0 GGG...AGA * **XA:Z:NM_000934.3,XM_005256698.2** XC:Z:FSM NM:i:2
After: m54224_190921_125133/26018081/ccs 0 chr17 1742871 60 4S38=...579=3S * 0 0 GGG...AGA * **XA:Z:XM_005256698.2,NM_000934.3** XC:Z:FSM NM:i:2
Before: m54224_190921_125133/26149488/ccs 0 chr17 1742871 60 6S38=...786=3S * 0 0 GGG...AGA * **XA:Z:NM_000934.3,XM_005256698.2** XC:Z:FSM NM:i:1
After: m54224_190921_125133/26149488/ccs 0 chr17 1742871 60 6S38=...786=3S * 0 0 GGG...AGA * **XA:Z:XM_005256698.2,NM_000934.3** XC:Z:FSM NM:i:1
Before: m54224_190921_125133/25952682/ccs 16 **chr1_KI270713v1_random** 50359 60 1=1D...1X2= * 0 0 TCT...CCC * XA:Z:XM_006724897.1 XC:Z:FSM NM:i:14
After: m54224_190921_125133/25952682/ccs 16 **chr11_KI270721v1_random** 50359 60 1=1D...1X2= * 0 0 TCT...CCC * XA:Z:XM_006724897.1 XC:Z:FSM NM:i:14
So, it seems OK for me.
I update the bioconda recipe and the nf-core/isoseq pipeline
from ultra.
@ksahlin do you plan to create a 0.0.4.2 release?
from ultra.
@ksahlin the updated bioconda recipe is now live. I take care of the nf-core modules today.
from ultra.
super, thanks!
from ultra.
Related Issues (20)
- ultra installation and run error HOT 7
- CIGAR string starts/ends with N HOT 7
- Mapping with uLTRA without GTF? HOT 3
- Cigar is None HOT 2
- BUG -4294967296 HOT 8
- Controlling (high) uLTRA RAM usage HOT 1
- a bug of `--alignment_threshold` HOT 1
- Genomes FASTA/GTF files needed to run the evaluation HOT 4
- Can not access local variable 'read_mems_tmp' when using --use_NAM_seeds HOT 2
- Error: invalid feature coordinates (end<start!) at line: HOT 1
- Bug with uLTRA align : TypeError: bad argument type for built-in operation HOT 3
- UnboundLocalError: local variable 'i' referenced before assignment HOT 4
- Can I use ultra to align est to references HOT 1
- error when aligning direct RNA data during revcomp script HOT 3
- KeyError when running test pipeline HOT 3
- How to control minimap 2 parameters during uLTRA alignment HOT 6
- uLTRA + SQANTI3 HOT 2
- Non-absolute paths don't resolve HOT 5
- Python bindings? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ultra.