Comments (35)
I would just write an hts-nim program to do this. Here is some untested, hastily written code to get you started:
import hts
import os
import strutils
import tables
var ibam:Bam
# lookup from cb -> cluster
var clusterTbl = initTable[string,string]()
# lookup from cluster -> bam
var tbl = initTable[string, Bam]()
for x in paramStr(1).lines:
var toks = x.strip().split(",")
if toks[0] == "Barcode": continue
clusterTbl[toks[0]] = toks[1]
if not open(ibam, paramStr(2)):
quit "couldn't open bam"
for aln in ibam:
var cbt = tag[string](aln, "CB")
if cbt.isNone: continue
var cb = cbt.get
if cb notin clusterTbl: continue
var cluster = clusterTbl[cb]
if cluster notin tbl:
var obam: Bam
if not open(obam, "out-cluster-" & cluster & ".bam", mode="wb"):
quit "couldn't open bam for writing"
obam.write_header(ibam.hdr)
tbl[cluster] = obam
tbl[cluster].write(aln)
for k, bam in tbl:
bam.close()
ibam.close()
from hts-nim-tools.
I updated the comment to include the lookup from cb -> cluster.
you run as $prog cluster.txt $bam
this also elides some error checking for brevity.
from hts-nim-tools.
Thanks @brentp ! I will test this and compare the speed with pysam.
from hts-nim-tools.
actually, I didn't see it needed to look at the cluster id. I just split by cb tag.
you'll have to add the part where it adds a lookup from cb -> cluster then writes to the bam of that cluster.
from hts-nim-tools.
okay! I am not familiar with the syntax of nim, so I did not notice you did not do that immediately. thanks for the heads-up
from hts-nim-tools.
awesome! saved me some time. thanks.
from hts-nim-tools.
for
if cb is not present in the cluster file, this will raise an error
I actually want to pass instead of giving errors.
in python dict, I use dict.get() and return None if CB not present in the cluster table. How to do it in nim?
I thought it would be faster for you to update the code than I googling around :)
thanks!
from hts-nim-tools.
I updated the code to handle that. Also, here is a binary, just unzip, chmod +x and run. If you have a relatively recent libc, it should work. Still depends on libhts.so.
tommy.gz
from hts-nim-tools.
great! I am curious to see how fast it is compared with pysam. will report back. thanks again!
from hts-nim-tools.
yeah. interested to hear. you can add threads=2
(or 3) to the open
calls to get a bit more speed on de/compressing the bam which will be the most CPU time.
from hts-nim-tools.
good to know the tip. I have not figured out how to multi-thread for pysam. so I will compare single thread for both at the moment.
from hts-nim-tools.
ok. and if compiling on your own. remember to build with -d:release
from hts-nim-tools.
thanks!
I was compiling myself and it said succeeded.
echo $LD_LIBRARY_PATH
/n/helmod/apps/centos7/Core/bzip2/1.0.6-fasrc01/lib:/n/helmod/apps/centos7/Core/xz/5.2.2-fasrc01/lib64:/n/helmod/apps/centos7/Core/gcc/7.1.0-fasrc01/lib64:/n/helmod/apps/centos7/Core/gcc/7.1.0-fasrc01/lib:/n/helmod/apps/centos7/Core/mpc/1.0.3-fasrc06/lib64:/n/helmod/apps/centos7/Core/mpfr/3.1.5-fasrc01/lib64:/n/helmod/apps/centos7/Core/gmp/6.1.2-fasrc01/lib64:/n/helmod/apps/centos7/Core/libpng/1.6.25-fasrc01/lib:/n/helmod/apps/centos7/Core/hdf5/1.10.1-fasrc03/lib::/n/home02/mtang/apps/elastix/lib:/n/home02/mtang/apps/htslib-1.6/libhts.so
./split_scATAC_bam
could not load: libhts.so
compile with -d:nimDebugDlOpen for more information
#same with the file you uploaded
./tommy
could not load: libhts.so
compile with -d:nimDebugDlOpen for more information
but I have libhts in LD_LIBRARY_PATH
from hts-nim-tools.
LD_LIBRARY_PATH
is like PATH
it takes directories, not files.
from hts-nim-tools.
sorry my brain was not working...after add the folder to LD_LIBRARY_PATH
time ./tommy clusters.csv possorted_bam.bam
src/hts/simpleoption.nim(30) get
Error: unhandled exception: Can't obtain a value from a `none` [KeyError]
real 0m0.349s
user 0m0.008s
sys 0m0.011s
it has to do with the dictionary I think, sometimes CB tag maybe not present?
from hts-nim-tools.
updated the snippet to handle this. note the line with isNone
.
from hts-nim-tools.
Thanks! was googling around for how to test a string is none or not in nim.
tried cb.isNil... thanks for saving me. need to learn basics of nim. it is running now will report back.
from hts-nim-tools.
that's returning an option value from the tag method. see docs here: https://brentp.github.io/hts-nim/hts/bam.html#tag%2CRecord%2Cstring
then you can follow the link to the option:
https://brentp.github.io/hts-nim/hts/simpleoption.html#Option
I was previously calling get, but as you can see from the description, it throws and error if the value (tag) is absent.
from hts-nim-tools.
Thanks for the links!
just checked, it is writing to sam not bam.
changed if not open(obam, "out-cluster-" & cluster & ".bam", mode="w")
to if not open(obam, "out-cluster-" & cluster & ".bam", mode="wb")
one more little thing, if I want to skip the header, how should I do it?
for x in paramStr(1).lines:
var toks = x.strip().split(",")
clusterTbl[toks[0]] = toks[1]
from hts-nim-tools.
I updated the snippet again (also to write bam). Basically:
if toks[0] == "Barcode": continue
or you can track the line number yourself and if line_no == 1: continue
from hts-nim-tools.
thanks! that works, but one has to know the column names beforehand. so tracking the line number is better.
in python3, I can do header = next(csv) to skip it.
the pysam
version seems to work, see https://hackmd.io/X2K8TpKeRJyb2ibK8PLzew?both#use-pysam
from hts-nim-tools.
reporting back. It is 1.5x faster using hts-nim
than pysam
. impressive.
from hts-nim-tools.
the bam file after splitting is a bit different:
# by pysam
samtools flagstat cluster10.bam
23691881 + 0 in total (QC-passed reads + QC-failed reads)
180 + 0 secondary
0 + 0 supplementary
13520847 + 0 duplicates
23499718 + 0 mapped (99.19% : N/A)
23691701 + 0 paired in sequencing
11846803 + 0 read1
11844898 + 0 read2
23310179 + 0 properly paired (98.39% : N/A)
23383493 + 0 with itself and mate mapped
116045 + 0 singletons (0.49% : N/A)
40060 + 0 with mate mapped to a different chr
20207 + 0 with mate mapped to a different chr (mapQ>=5)
# by hts-nim
samtools flagstat atac_v1_pbmc_5k/outs/out-cluster-10.bam
23174666 + 0 in total (QC-passed reads + QC-failed reads)
176 + 0 secondary
0 + 0 supplementary
13207244 + 0 duplicates
23034081 + 0 mapped (99.39% : N/A)
23174490 + 0 paired in sequencing
11587245 + 0 read1
11587245 + 0 read2
22873454 + 0 properly paired (98.70% : N/A)
22942804 + 0 with itself and mate mapped
91101 + 0 singletons (0.39% : N/A)
38116 + 0 with mate mapped to a different chr
19077 + 0 with mate mapped to a different chr (mapQ>=5)
pysam version
import pysam
import csv
cluster_dict = {}
with open('clusters.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
#skip header
header = next(csv_reader)
for row in csv_reader:
cluster_dict[row[0]] = row[1]
clusters = set(x for x in cluster_dict.values())
fin = pysam.AlignmentFile("atac_v1_pbmc_5k_possorted_bam.bam", "rb")
# open the number of bam files as the same number of clusters, and map the out file handler to the cluster id, write to a bam with wb
fouts_dict = {}
for cluster in clusters:
fout = pysam.AlignmentFile("cluster" + cluster + ".bam", "wb", template = fin)
fouts_dict[cluster] = fout
for read in fin:
tags = read.tags
# the 8th item is the CB tag
CB_list = [ x for x in tags if x[0] == "CB"]
if CB_list:
cell_barcode = CB_list[0][1]
# the bam files may contain reads not in the final clustered barcodes
# will be None if the barcode is not in the clusters.csv file
cluster_id = cluster_dict.get(cell_barcode)
if cluster_id:
fouts_dict[cluster_id].write(read)
## do not forget to close the files
fin.close()
for fout in fouts_dict.values():
fout.close()
There are a little more reads pulled out by pysam
.
download the bam file and the clusters.csv files from http://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_5k/atac_v1_pbmc_5k_analysis.tar.gz
from hts-nim-tools.
hold on...I just noticed I was using different bam files to begin with, one is the one downloaded and the other one is from fastq that I processed with cellranger. Let me test on the same bam...
from hts-nim-tools.
I won't have time to debug this. The code looks fine to me. If you have some difference, I would start by finding a single read that's in pysam output but not hts-nim (or vice-versa) and figuring out why.
from hts-nim-tools.
sure. totally understand.
from hts-nim-tools.
were you able to resolve this?
from hts-nim-tools.
was distracted...working on it now. will report back.
from hts-nim-tools.
update, not sure why pysam
pull out some read records without the CB tag.
## by pysam
samtools flagstat cluster12.bam
16771134 + 0 in total (QC-passed reads + QC-failed reads)
140 + 0 secondary
0 + 0 supplementary
9381485 + 0 duplicates
16637530 + 0 mapped (99.20% : N/A)
16770994 + 0 paired in sequencing
8385399 + 0 read1
8385595 + 0 read2
16508476 + 0 properly paired (98.43% : N/A)
16556331 + 0 with itself and mate mapped
81059 + 0 singletons (0.48% : N/A)
24024 + 0 with mate mapped to a different chr
12056 + 0 with mate mapped to a different chr (mapQ>=5)
# by hts-nim
samtools flagstat out-cluster-12.bam
16407030 + 0 in total (QC-passed reads + QC-failed reads)
136 + 0 secondary
0 + 0 supplementary
9162203 + 0 duplicates
16309886 + 0 mapped (99.41% : N/A)
16406894 + 0 paired in sequencing
8203447 + 0 read1
8203447 + 0 read2
16201216 + 0 properly paired (98.75% : N/A)
16246368 + 0 with itself and mate mapped
63382 + 0 singletons (0.39% : N/A)
22746 + 0 with mate mapped to a different chr
11304 + 0 with mate mapped to a different chr (mapQ>=5)
There are more reads pulled out by pysam
than hts-nim
samtools view cluster12.bam | cut -f1 | sort -u > cluster12_readIDs.txt
samtools view out-cluster-12.bam | cut -f1 | sort -u > out-cluster-12_readIDs.txt
comm -23 cluster12_readIDs.txt out-cluster-12_readIDs.txt > cluster12_specific_readIDs.txt
I then checked one of the reads that pysam specific.
There is no CB tag, how come it was pulled out by my pysam code, to begin with...?
samtools view cluster12.bam | grep -w A00519:57:HG3JCDMXX:1:1101:10375:20400
A00519:57:HG3JCDMXX:1:1101:10375:20400 1187 chr8 21769249 60 49M = 21769275 76 ACCCGGAAGCGAGAAGGAGAGGCTGATAACGGGGCTGGCCCAGGCTGGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F NM:i:0 MD:Z:49 MC:Z:50M AS:i:49 XS:i:20 CR:Z:GGCAATTTGCTGGAAT CY:Z:,F,,,,,F,,F,F:,, BC:Z:CTATACGC QT:Z:FFFFFFFF GP:i:1414564938 MP:i:1414565014 MQ:i:60 RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:1
from hts-nim-tools.
that is odd. The pysam code looks sane to me.
from hts-nim-tools.
I can not figure out... this bothers me! It has to do with the paired-end reads?
samtools view atac_v1_pbmc_5k_possorted_bam.bam | grep -w "A00519:57:HG3JCDMXX:2:1318:26684:24220"
A00519:57:HG3JCDMXX:2:1318:26684:24220 163 chr1 30687631 60 49M = 30687748 167 TAGATGGGATACTTCATATCATCCCCCTGCCCTTCCATTGCATTAAGAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F NM:i:0 MD:Z:49 MC:Z:50M AS:i:49
XS:i:19 CR:Z:ATGGCTAGGTGTCGAG CY:Z::FFF::,FFFF,:F,, BC:Z:AGGGCGTT QT:Z::FFFFF:F GP:i:30687630 MP:i:30687797 MQ:i:60
RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:2
A00519:57:HG3JCDMXX:2:1318:26684:24220 83 chr1 30687748 60 50M = 30687631 -167 TCTTCCAATCCATCATCCTACTTTCAGGCTGGGAAGAGCGTGTTCTGATT FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:50 MC:Z:49M AS:i:50
XS:i:20 CR:Z:ATGGCTAGGTGTCGAG CY:Z::FFF::,FFFF,:F,, BC:Z:AGGGCGTT QT:Z::FFFFF:F GP:i:30687797 MP:i:30687630 MQ:i:60
RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:2
There are two reads with the same id from the original bam. only one of them appeared in the split bam by pysam.
from hts-nim-tools.
it's a bug in your pysam code. you have:
CB_list = [ x for x in tags if x[0] == "CB"]
if CB_list:
cell_barcode = CB_list[0][1]
so if CB_list is empty, then cell_barcode is left as the value of the last seen CB tag. you need an else that does continue
from hts-nim-tools.
thanks a billion! This saves me. will update the code.
from hts-nim-tools.
just confirm that now pysam and hts-nim produce identical results after updating the code. thanks!
from hts-nim-tools.
subset-bam from 10X seems to be fit for the job. Did not try yet.
from hts-nim-tools.
Related Issues (11)
- new tools: split vcf HOT 1
- could not import: sam_hdr_destroy HOT 9
- please tag new release - 0.2.0 is incompatible with nim 1.4
- Opening BAMs doesn't appear to return bool HOT 2
- example tools HOT 12
- bam_filter and count_reads broken by API update HOT 1
- trying `hts_nim_tools count-reads <bed> <bam>` only return 0 counts HOT 4
- Installation error HOT 2
- Could not import: bam_hdr_destroy HOT 6
- Question: Convert a seq[CigarElement] to Cigar HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hts-nim-tools.