Giter VIP home page Giter VIP logo

Comments (35)

brentp avatar brentp commented on June 2, 2024 2

I would just write an hts-nim program to do this. Here is some untested, hastily written code to get you started:

import hts
import os
import strutils
import tables
var ibam:Bam

# lookup from cb -> cluster
var clusterTbl = initTable[string,string]()
# lookup from cluster -> bam
var tbl = initTable[string, Bam]()

for x in paramStr(1).lines:
  var toks = x.strip().split(",")
  if toks[0] == "Barcode": continue
  clusterTbl[toks[0]] = toks[1]
    
if not open(ibam, paramStr(2)):
   quit "couldn't open bam"
    
for aln in ibam:
  var cbt = tag[string](aln, "CB")
  if cbt.isNone: continue
  var cb = cbt.get

  if cb notin clusterTbl: continue
  var cluster = clusterTbl[cb]
  if cluster notin tbl:
    var obam: Bam
    if not open(obam, "out-cluster-" & cluster & ".bam", mode="wb"):
      quit "couldn't open bam for writing"
    obam.write_header(ibam.hdr)
    tbl[cluster] = obam
  tbl[cluster].write(aln)

for k, bam in tbl:
  bam.close()
ibam.close()

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024 2

I updated the comment to include the lookup from cb -> cluster.
you run as $prog cluster.txt $bam

this also elides some error checking for brevity.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

Thanks @brentp ! I will test this and compare the speed with pysam.

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

actually, I didn't see it needed to look at the cluster id. I just split by cb tag.
you'll have to add the part where it adds a lookup from cb -> cluster then writes to the bam of that cluster.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

okay! I am not familiar with the syntax of nim, so I did not notice you did not do that immediately. thanks for the heads-up

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

awesome! saved me some time. thanks.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

for

if cb is not present in the cluster file, this will raise an error

I actually want to pass instead of giving errors.

in python dict, I use dict.get() and return None if CB not present in the cluster table. How to do it in nim?
I thought it would be faster for you to update the code than I googling around :)
thanks!

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

I updated the code to handle that. Also, here is a binary, just unzip, chmod +x and run. If you have a relatively recent libc, it should work. Still depends on libhts.so.
tommy.gz

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

great! I am curious to see how fast it is compared with pysam. will report back. thanks again!

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

yeah. interested to hear. you can add threads=2 (or 3) to the open calls to get a bit more speed on de/compressing the bam which will be the most CPU time.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

good to know the tip. I have not figured out how to multi-thread for pysam. so I will compare single thread for both at the moment.

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

ok. and if compiling on your own. remember to build with -d:release

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

thanks!
I was compiling myself and it said succeeded.

echo  $LD_LIBRARY_PATH
/n/helmod/apps/centos7/Core/bzip2/1.0.6-fasrc01/lib:/n/helmod/apps/centos7/Core/xz/5.2.2-fasrc01/lib64:/n/helmod/apps/centos7/Core/gcc/7.1.0-fasrc01/lib64:/n/helmod/apps/centos7/Core/gcc/7.1.0-fasrc01/lib:/n/helmod/apps/centos7/Core/mpc/1.0.3-fasrc06/lib64:/n/helmod/apps/centos7/Core/mpfr/3.1.5-fasrc01/lib64:/n/helmod/apps/centos7/Core/gmp/6.1.2-fasrc01/lib64:/n/helmod/apps/centos7/Core/libpng/1.6.25-fasrc01/lib:/n/helmod/apps/centos7/Core/hdf5/1.10.1-fasrc03/lib::/n/home02/mtang/apps/elastix/lib:/n/home02/mtang/apps/htslib-1.6/libhts.so

./split_scATAC_bam
could not load: libhts.so
compile with -d:nimDebugDlOpen for more information

#same with the file you uploaded 
./tommy
could not load: libhts.so
compile with -d:nimDebugDlOpen for more information

but I have libhts in LD_LIBRARY_PATH

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

LD_LIBRARY_PATH is like PATH it takes directories, not files.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

sorry my brain was not working...after add the folder to LD_LIBRARY_PATH

time ./tommy clusters.csv possorted_bam.bam
src/hts/simpleoption.nim(30) get
Error: unhandled exception: Can't obtain a value from a `none` [KeyError]

real    0m0.349s
user    0m0.008s
sys     0m0.011s

it has to do with the dictionary I think, sometimes CB tag maybe not present?

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

updated the snippet to handle this. note the line with isNone.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

Thanks! was googling around for how to test a string is none or not in nim.
tried cb.isNil... thanks for saving me. need to learn basics of nim. it is running now will report back.

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

that's returning an option value from the tag method. see docs here: https://brentp.github.io/hts-nim/hts/bam.html#tag%2CRecord%2Cstring
then you can follow the link to the option:
https://brentp.github.io/hts-nim/hts/simpleoption.html#Option
I was previously calling get, but as you can see from the description, it throws and error if the value (tag) is absent.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

Thanks for the links!

just checked, it is writing to sam not bam.
changed if not open(obam, "out-cluster-" & cluster & ".bam", mode="w") to if not open(obam, "out-cluster-" & cluster & ".bam", mode="wb")

one more little thing, if I want to skip the header, how should I do it?

for x in paramStr(1).lines:
  var toks = x.strip().split(",")
  clusterTbl[toks[0]] = toks[1]

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

I updated the snippet again (also to write bam). Basically:

  if toks[0] == "Barcode": continue

or you can track the line number yourself and if line_no == 1: continue

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

thanks! that works, but one has to know the column names beforehand. so tracking the line number is better.
in python3, I can do header = next(csv) to skip it.
the pysam version seems to work, see https://hackmd.io/X2K8TpKeRJyb2ibK8PLzew?both#use-pysam

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

reporting back. It is 1.5x faster using hts-nim than pysam. impressive.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

the bam file after splitting is a bit different:

# by pysam
samtools flagstat cluster10.bam
23691881 + 0 in total (QC-passed reads + QC-failed reads)
180 + 0 secondary
0 + 0 supplementary
13520847 + 0 duplicates
23499718 + 0 mapped (99.19% : N/A)
23691701 + 0 paired in sequencing
11846803 + 0 read1
11844898 + 0 read2
23310179 + 0 properly paired (98.39% : N/A)
23383493 + 0 with itself and mate mapped
116045 + 0 singletons (0.49% : N/A)
40060 + 0 with mate mapped to a different chr
20207 + 0 with mate mapped to a different chr (mapQ>=5)

# by hts-nim 
samtools flagstat atac_v1_pbmc_5k/outs/out-cluster-10.bam
23174666 + 0 in total (QC-passed reads + QC-failed reads)
176 + 0 secondary
0 + 0 supplementary
13207244 + 0 duplicates
23034081 + 0 mapped (99.39% : N/A)
23174490 + 0 paired in sequencing
11587245 + 0 read1
11587245 + 0 read2
22873454 + 0 properly paired (98.70% : N/A)
22942804 + 0 with itself and mate mapped
91101 + 0 singletons (0.39% : N/A)
38116 + 0 with mate mapped to a different chr
19077 + 0 with mate mapped to a different chr (mapQ>=5)

pysam version

import pysam
import csv

cluster_dict = {}
with open('clusters.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    #skip header
    header = next(csv_reader)
    for row in csv_reader:
        cluster_dict[row[0]] = row[1]

clusters = set(x for x in cluster_dict.values())
fin = pysam.AlignmentFile("atac_v1_pbmc_5k_possorted_bam.bam", "rb")

# open the number of bam files as the same number of clusters, and map the out file handler to the cluster id, write to a bam with wb
fouts_dict = {}
for cluster in clusters:
    fout = pysam.AlignmentFile("cluster" + cluster + ".bam", "wb", template = fin)
    fouts_dict[cluster] = fout

for read in fin:
    tags = read.tags
    # the 8th item is the CB tag
    CB_list = [ x for x in tags if x[0] == "CB"]
    if CB_list:
        cell_barcode = CB_list[0][1]
    # the bam files may contain reads not in the final clustered barcodes
    # will be None if the barcode is not in the clusters.csv file
    cluster_id = cluster_dict.get(cell_barcode)
    if cluster_id:
        fouts_dict[cluster_id].write(read)

## do not forget to close the files
fin.close()
for fout in fouts_dict.values():
    fout.close()

There are a little more reads pulled out by pysam.

download the bam file and the clusters.csv files from http://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_5k/atac_v1_pbmc_5k_analysis.tar.gz

http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-atac/1.0.1/atac_v1_pbmc_5k/atac_v1_pbmc_5k_possorted_bam.bam

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

hold on...I just noticed I was using different bam files to begin with, one is the one downloaded and the other one is from fastq that I processed with cellranger. Let me test on the same bam...

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

I won't have time to debug this. The code looks fine to me. If you have some difference, I would start by finding a single read that's in pysam output but not hts-nim (or vice-versa) and figuring out why.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

sure. totally understand.

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

were you able to resolve this?

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

was distracted...working on it now. will report back.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

update, not sure why pysam pull out some read records without the CB tag.

## by pysam
samtools flagstat cluster12.bam
16771134 + 0 in total (QC-passed reads + QC-failed reads)
140 + 0 secondary
0 + 0 supplementary
9381485 + 0 duplicates
16637530 + 0 mapped (99.20% : N/A)
16770994 + 0 paired in sequencing
8385399 + 0 read1
8385595 + 0 read2
16508476 + 0 properly paired (98.43% : N/A)
16556331 + 0 with itself and mate mapped
81059 + 0 singletons (0.48% : N/A)
24024 + 0 with mate mapped to a different chr
12056 + 0 with mate mapped to a different chr (mapQ>=5)

# by hts-nim
samtools flagstat out-cluster-12.bam
16407030 + 0 in total (QC-passed reads + QC-failed reads)
136 + 0 secondary
0 + 0 supplementary
9162203 + 0 duplicates
16309886 + 0 mapped (99.41% : N/A)
16406894 + 0 paired in sequencing
8203447 + 0 read1
8203447 + 0 read2
16201216 + 0 properly paired (98.75% : N/A)
16246368 + 0 with itself and mate mapped
63382 + 0 singletons (0.39% : N/A)
22746 + 0 with mate mapped to a different chr
11304 + 0 with mate mapped to a different chr (mapQ>=5)

There are more reads pulled out by pysam than hts-nim

samtools view cluster12.bam | cut -f1 | sort -u > cluster12_readIDs.txt

samtools view out-cluster-12.bam | cut -f1 | sort -u > out-cluster-12_readIDs.txt

comm -23 cluster12_readIDs.txt out-cluster-12_readIDs.txt > cluster12_specific_readIDs.txt

I then checked one of the reads that pysam specific.
There is no CB tag, how come it was pulled out by my pysam code, to begin with...?

samtools view cluster12.bam | grep -w A00519:57:HG3JCDMXX:1:1101:10375:20400
A00519:57:HG3JCDMXX:1:1101:10375:20400  1187    chr8    21769249        60      49M     =       21769275        76      ACCCGGAAGCGAGAAGGAGAGGCTGATAACGGGGCTGGCCCAGGCTGGG  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F       NM:i:0  MD:Z:49 MC:Z:50M        AS:i:49 XS:i:20 CR:Z:GGCAATTTGCTGGAAT   CY:Z:,F,,,,,F,,F,F:,,      BC:Z:CTATACGC   QT:Z:FFFFFFFF   GP:i:1414564938 MP:i:1414565014 MQ:i:60 RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:1

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

that is odd. The pysam code looks sane to me.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

I can not figure out... this bothers me! It has to do with the paired-end reads?

samtools view atac_v1_pbmc_5k_possorted_bam.bam | grep -w "A00519:57:HG3JCDMXX:2:1318:26684:24220"
A00519:57:HG3JCDMXX:2:1318:26684:24220  163     chr1    30687631        60      49M     =       30687748        167     TAGATGGGATACTTCATATCATCCCCCTGCCCTTCCATTGCATTAAGAT      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F       NM:i:0  MD:Z:49 MC:Z:50M        AS:i:49
        XS:i:19 CR:Z:ATGGCTAGGTGTCGAG   CY:Z::FFF::,FFFF,:F,,   BC:Z:AGGGCGTT   QT:Z::FFFFF:F   GP:i:30687630   MP:i:30687797   MQ:i:60
        RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:2
A00519:57:HG3JCDMXX:2:1318:26684:24220  83      chr1    30687748        60      50M     =       30687631        -167    TCTTCCAATCCATCATCCTACTTTCAGGCTGGGAAGAGCGTGTTCTGATT     FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFF      NM:i:0  MD:Z:50 MC:Z:49M        AS:i:50
        XS:i:20 CR:Z:ATGGCTAGGTGTCGAG   CY:Z::FFF::,FFFF,:F,,   BC:Z:AGGGCGTT   QT:Z::FFFFF:F   GP:i:30687797   MP:i:30687630   MQ:i:60
        RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:2

There are two reads with the same id from the original bam. only one of them appeared in the split bam by pysam.

from hts-nim-tools.

brentp avatar brentp commented on June 2, 2024

it's a bug in your pysam code. you have:

    CB_list = [ x for x in tags if x[0] == "CB"]
    if CB_list:
        cell_barcode = CB_list[0][1]

so if CB_list is empty, then cell_barcode is left as the value of the last seen CB tag. you need an else that does continue

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

thanks a billion! This saves me. will update the code.

from hts-nim-tools.

crazyhottommy avatar crazyhottommy commented on June 2, 2024

just confirm that now pysam and hts-nim produce identical results after updating the code. thanks!

from hts-nim-tools.

vertesy avatar vertesy commented on June 2, 2024

subset-bam from 10X seems to be fit for the job. Did not try yet.

from hts-nim-tools.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.