Giter VIP home page Giter VIP logo

sinto's Introduction

Sinto: single-cell analysis tools

https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat https://pepy.tech/badge/sinto

Sinto is a toolkit for processing aligned single-cell data. Sinto includes functions to:

  • Subset reads from a BAM file by cell barcode
  • Create a scATAC-seq fragments file from a BAM file
  • Add read tags to a BAM file according to cell barcode information
  • Add read groups based on read tags
  • Copy or move read tags to another read tag
  • Copy cell barcodes to/from read names or read tags
  • Add cell barcodes to FASTQ read names

Read the documentation at https://timoast.github.io/sinto

sinto's People

Contributors

bgruening avatar dawe avatar timoast avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

sinto's Issues

Sinto fragments: negative positions in output

Hi @timoast ,

I have spotted a few cases where there are negative positions listed in the fragments file, especially in chrM, I think due the higher chance of finding reads overlapping the start/end of this chromosome. I tried to make a minimal example here:

Using the reads in this BAM file:

VH00445:3:AAAJTTYM5:1:1305:65210:2912   1123    chrM    1       33      32S21M  =       1       45      ATCATACTCTATTACGCAATAAACATTAACAAGTTAATGTAGCTTAATAACAA   -CC;CCCCC-CCCCCCC-CC-C;;CC;;;C;CC;CCCCCCCCCC;CCC-CCCC NM:i:0  MD:Z:21 AS:i:21 XS:i:22 XA:Z:chr16,-28715534,8S22M23S,0;        MQ:i:60 MC:Z:8S45M      ms:i:1642       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:2409:29309:43198  99      chrM    1       33      32S21M  =       1       45      ATCATACTCTATTACGCAATAAACATTAACAAGTTAATGTAGCTTAATAACAA   CCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCC;CCCC NM:i:0  MD:Z:21 AS:i:21 XS:i:22 XA:Z:chr16,-28715534,8S22M23S,0;        MQ:i:60 MC:Z:8S45M      ms:i:1794       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:1305:65210:2912   1171    chrM    1       60      8S45M   =       1       -45     ATTAACAAGTTAATGTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTTAGA   ;CCCCCCCCC-CCCCCCCCC-CCCC;CCC;C-CCCCCCCCCCCC-CCCCCCCC NM:i:0  MD:Z:45 AS:i:45 XS:i:21 MQ:i:33 MC:Z:32S21M     ms:i:1560       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:2409:29309:43198  147     chrM    1       60      8S45M   =       1       -45     ATTAACAAGTTAATGTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTTAGA   CCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NM:i:0  MD:Z:45 AS:i:45 XS:i:21 MQ:i:33 MC:Z:32S21M     ms:i:1786       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:1410:62900:52436  1187    chrM    11796   60      52M     =       11833   90      ATCCTAATTTCAATATCAAACCTAATTAAACACATCAACTTCCCACTGTACA    CCCCCCCCCCCCCCCCCC;-CCCCCCCC;CCC-CCCC-CCCCCCCCCCCCCC  NM:i:0  MD:Z:52 AS:i:52 XS:i:19 MQ:i:60 MC:Z:53M        ms:i:1684       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:1410:62900:52436  1107    chrM    11833   60      53M     =       11796   -90     ACTTCCCACTGTACACCACCACATCAATCAAATTCTCCTTCATTATTAGCCTC   CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CCC;CCC-CCCCCC;CCC-CCC NM:i:0  MD:Z:53 AS:i:53 XS:i:20 MQ:i:60 MC:Z:52M        ms:i:1650       CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT     CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT

I run sinto fragments and get:

chrM    -28     40      ACTAGGCTTCGTATTGAGCCGAACAGTAGT  2

This read aligns to chrM position 1 and is soft-clipped by 32 bases (cigar: 32S21M). Looking at the code, I see there is correction for soft-clipping:

sinto/sinto/fragments.py

Lines 330 to 338 in b57d735

# correct for soft clipping and 9 bp Tn5 shift
if is_reverse:
suffix_clip = sum([x[1] for x in itertools.takewhile(lambda x: x[0] == 4, reversed(cigar))])
rend = rend + suffix_clip
rend = rend - 5
else:
prefix_clip = sum([x[1] for x in itertools.takewhile(lambda x: x[0] == 4, cigar)])
rstart = rstart - prefix_clip
rstart = rstart + 4

and from this it makes sense how we get to -28 (0 - 32 + 4), but I don't quite understand why this correction is applied. I thought that the position reported in the bam is the start of the mapped portion of the read, which already takes any soft-clipped portions into account. Looking at this read in IGV, and using bamtools bamtobed seems to confirm this (though without the Tn5 offset).

Looking at another soft-clipped read, this time in the middle of chr1:

bam:

VH00445:3:AAAJTTYM5:1:1609:77708:31726  99      chr1    3012667 60      8S44M   =       3012708 94      CCGTATTTCTGATCAGTTCTGAGACAAGTTTTCACTTTATCTATGAAGCCCA    CC-;CCC-;CC-CCCCCCCCC-CCCC;CC-CC;;C-CCCCCCCCCCC;-CCC  NM:i:0  MD:Z:44 AS:i:44 XS:i:19 MQ:i:60 MC:Z:53M        ms:i:1608       CR:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT     CB:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT
VH00445:3:AAAJTTYM5:1:1609:77708:31726  147     chr1    3012708 60      53M     =       3012667 -94     CCACTAGGGTGCAGTCCTGTGCTGAACAAGTAACAATGGCCTGAGTGTGACAA   CC-CCCCCCCCCCCCCC;CCCC-CC;CCCCCCCC--CCCC;CCCC-CCCCCCC NM:i:0  MD:Z:53 AS:i:53 XS:i:20 MQ:i:60 MC:Z:8S44M      ms:i:1482       CR:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT     CB:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT

fragments:

chr1    3012662 3012755 ATTGAACCACATTCGGTCAGGTCACTCAAT  1

where it seems like the start position should be 3012666 + 4 = 3012670. Am I missing something about the soft-clipping correction?

Zero-length fragments generated from Cell Ranger BAM

Hi, thanks for making this tool!

I've come across this issue and I'm not sure if this is the expected behavior or not. I'm using Sinto 0.7.1 to create a fragments file from a Cell Ranger bam file. In the output, I get many fragments with the same start/end position (around 6000 in total). For example:

chr5    49658161        49658162        CGCACAGCACCTATTT-1      1
chr5    49658161        49658164        GATTGACCACGTTGTA-1      2
chr5    49658161        49658168        TGTGTCCGTATTGTCG-1      1
chr5    49658162        49658162        CTCTACGCAAAGGTCG-1      1 # <--- this fragment
chr5    49658162        49658168        CCGTACTCACACACAT-1      2
chr5    49658162        49658173        GTGGATTCAGCAACAG-1      1
chr5    49658166        49658432        CTGAATGAGGACTAGC-1      2
chr5    49658168        49658168        CACCTTGAGCCTGTAT-1      3

When comparing to the Cell Ranger fragments file from the same bam, I don't see any of these. From Cell Ranger, the minimum fragment size seems to be 10, so maybe it has been filtered. Should I filter the Sinto fragments as well?

python

Hi all,

I'm trying to run sinto on our local cluster. We have sinto v0.7.3.1 available with python 3.8.6. GCC v10.2.0 and OpenMPI v4.0.5 are loaded in the background. I use the following code to generate my fragments file:

sinto fragments -p 8 \
      -b /dir/file.bam \
      -f /dir/file.bed \
      --barcode_regex "[^:]*" \
      --use_chrom "*"

This generates the following output (with errors):

Function run_fragments called with the following arguments:
 
bam   /dir/file.bam
fragments   /dir/file.bed
min_mapq    30
nproc 8
barcodetag  CB
cells None
barcode_regex     [^:]*
use_chrom   *
max_distance      5000
min_distance      10
chunksize   500000
func  <function run_fragments at 0x2b3f2e8843a0>
Traceback (most recent call last):
  File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/bin/sinto", line 8, in <module>
    sys.exit(main())
  File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/sinto/arguments.py", line 346, in main
    options.func(options)
  File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/sinto/utils.py", line 21, in wrapper
    func(args)
  File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/sinto/cli.py", line 45, in run_fragments
    fragments.fragments(
  File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/sinto/fragments.py", line 470, in fragments
    chrom = utils.get_chromosomes(bam, keep_contigs=chromosomes)
  File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/sinto/utils.py", line 134, in get_chromosomes
    pattern = re.compile(keep_contigs)
  File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/re.py", line 252, in compile
    return _compile(pattern, flags)
  File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/sre_parse.py", line 668, in _parse
    raise source.error("nothing to repeat",
re.error: nothing to repeat at position 0
srun: error: node245: task 0: Exited with exit code 1

We suspect this may be caused by the python version, but not sure. Could there be another reason why these errors are produced?

details of filtering

Dear Tim,

I am trying to understand how sinto ends up filtering and selecting unique fragments per cell. my input bam file has the following reads assigned to my cell of interest

A00261:518:HK73GDSX3:1:1515:27118:35211 147     chr1    9997    51      110S40M =       10010   -27     CCACAGCCGCGGCAAAGCCACATCACTTTCACCTCCACCAACACACAAAATCAAACAATCACTAACGCTAACTGTCTGACTCACTCTGCCTCACTATACCTAAACCTATACCGATAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    :,,,,,,,,,,,,,,F,,,F,:,,,,:,,,,,,,,F,:,,,:,,,,,,,,:,F,,,,:,,,,F,,,,,,,,,,,,F,,,,F,,,,,,,,,F:,:,,F,,,,:,:,,,F,:,:,:F:,F:,:F,FFFFFFFFFFFFFFFFFFFFFFFFFFF    NM:i:0  MD:Z:40 AS:i:40 XS:i:37 XA:Z:chr6,-147869,113S37M,0;chr7,-10002,114S36M,0;chr1,-180752,114S36M,0;chr15,+101981123,36M114S,0;      CR:Z:AGATTCAAGGTTGTAA   CY:Z:FFFFFFFFFFFFFFFF   CB:Z:CTGAATATCCTGGTCT-1 BC:Z:TTATTGGT   QT:Z:FFFFFFFF     RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:2125:4083:16673  147     chr1    10002   0       43S107M =       10010   -99     CCTCTTTCTCCTGCAGCGTCATATGTTTAGTATAGCCCTCCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    ,,,,:,,,,,,,,,,,F::,,,,,,,,:,,,,,,,,,,,:,,,,:::,:F,:F,,:,:F:FF,F::FF,F:FFF::,,FFF:F,FFF:F,FFF:FFFFFF:FFFF:F,FFF:::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    NM:i:0  MD:Z:107        AS:i:107        XS:i:108        CR:Z:AGATTCAAGGTTGTAA     CY:Z:FFFFFFFF::FFFFFF   CB:Z:CTGAATATCCTGGTCT-1 BC:Z:AACGGTCA   QT:Z:FFFFFFFF   RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1216:26946:37012 147     chr1    10003   0       98S52M  =       10045   -10     CACCCCAACTCTAATGCCTCGGCGTCCACCTAGTCCTACTCATATTCATTGTGGTTACGGGTTTGTCTTCGGTATCGTAAGATGTGTATATTACACTTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    ,,,,,,,,,,,,,F,,,,,,,,,,,,,,,,,F,:,,::,,,,FF:,,,F:,:,,F,F,:,F:,,F:,:::,F:,:,,,,,,,FFF,F,:,F,,,,,F,,F,FFFFF,FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    NM:i:0  MD:Z:52 AS:i:52 XS:i:53 CR:Z:AGATTCAAGGTTGTAA   CY:Z:FFFFFFFFFFFFFFFF     CB:Z:CTGAATATCCTGGTCT-1 BC:Z:CCGAACTC   QT:Z:FFFF:FFF   RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1515:27118:35211 99      chr1    10010   60      62M2I35M3D28M23S        =       9997    27      CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCTAACCCTCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACCCTAACCCTAACCCTAACCCTAACCCGGGGCGTTACGCTCCCTCTAACC    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FF:FF,FFF,,,FF,,F:F,,:F::,F:,::FFF,,:F,::::,:FF,:FFFF:F,FFF:,FFFF:F:,FFF,FFF:,::FF,,:FF,,,,,,,,,,,,,,,,,,,,,,,    NM:i:6  MD:Z:45T51^CCA28        AS:i:103        XS:i:91 XA:Z:chr7,+10001,14S48M2I35M1D28M23S,4;chr7,+10035,45M3D19M4D35M1D28M23S,10;chr1,+180749,53M2D9M2I35M1D20M31S,6;  CR:Z:AGATTCAAGGTTGTAA   CY:Z:FFFFFFFFFFFFFFFF   CB:Z:CTGAATATCCTGGTCT-1   BC:Z:TTATTGGT   QT:Z:FFFFFFFF   RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:2125:4083:16673  99      chr1    10010   0       101M49S =       10002   99      CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAAACCAAAACCCACTCACTTATAAACATCTACGAACCAACCAGACAAAGG    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FF:FFFFF:FFFFF,FF:FF:FF::F,FF,F:,FF,FF,FF,:,,:,,,,,:,F,F,FFF,,,,,,F,:F,F,,F,FFFF,,,    NM:i:0  MD:Z:101        AS:i:101        XS:i:100        CR:Z:AGATTCAAGGTTGTAA     CY:Z:FFFFFFFF::FFFFFF   CB:Z:CTGAATATCCTGGTCT-1 BC:Z:AACGGTCA   QT:Z:FFFFFFFF   RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1664:20157:30859 147     chr1    10027   0       74S76M  =       10033   -70     ACCACCGAGATCTACACATATTCATGGTTGTAACGCGTCTGTTGTAGGCAGCGTCATATGTGTATATTATACTGACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    ,,F,F,FF,FF:FF,F,:FF,FF,,,,F:FFF:F:,,::F::,,F,F,,F,:FF,,,FF,F,FFF,::F,,,,,F:::FFFF,,FFF,::F::,,FF:FFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    NM:i:0  MD:Z:76 AS:i:76 XS:i:77 CR:Z:AGATTCAAGGTTGTAA   CY:Z:FFFFFFFFFFFFFFFF     CB:Z:CTGAATATCCTGGTCT-1 BC:Z:CCGAACTC   QT:Z:FFFFFFFF   RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1431:25192:2347  99      chr1    10028   0       83M67S  =       10034   81      CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCAAAACCAAACACTAACCCACAACCAGACGCTCCAACTAACCCTAAGCCTAAGCCTGCAAGTAAGCCTCG    FFFFFFFFFFFF:FFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFF:FFF:FFFFFFF:FFFFF:,,,F,F:,,:,,F:F,FFF::F::,,,F,,,F,,FF,F,,:,:,,:,:,,,:,:,,,,F,,,,,:,:,,F,:,,    NM:i:1  MD:Z:75T7       AS:i:78 XS:i:77 CR:Z:AGATTCAAGGTTGTAA   CY:Z:FFFFFFFFFFFFFFFF     CB:Z:CTGAATATCCTGGTCT-1 BC:Z:GGTCCAAG   QT:Z:FFFFF:FF   RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:2414:19244:7044  1123    chr1    10028   0       83M67S  =       10028   81      CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAGACGAAAAAAAACAACTAACACAACCCCACACAAAACACAATACCCTATCCCGAGCGCTGCGACTAA    FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFF:FF,FF,FF:FFF::FFF:FFFFF:FF:FF:F::FFFFFF:,,F,,::,,:,F,,:F:,,,,,,,,,F::,:F:,:,,,F,F,,,:,F:,,,:,,,,,::,,,,:,,,    NM:i:0  MD:Z:83 AS:i:83 XS:i:81 CR:Z:AGATTCAAGGTTGTAA   CY:Z:FFFFFFFFFFFFFFFF     CB:Z:CTGAATATCCTGGTCT-1 BC:Z:AACGGTCA   QT:Z:F:FFFFFF   RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1318:11731:7592  99      chr1    10028   0       83M67S  =       10400   414     CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCATAAACCAAAACATCAACATAACCCTAACACTACCCCAATCCCTACCCCTAACGCTCAGCGTAG    F,FFFFFFFFFFFFFFFF,,F,FF:FFF:FFFF:F:F:FF:F,,,FFF::,,:F,::::F,F:FFF,,F,:F:FF:FFF,F,F,:F,,,,,F,,:F,,,,,F,,,,FF,,,:::F,F:F,,,,,,,,,,F,,F,,,,,,,,F,,,,,,,,    NM:i:0  MD:Z:83 AS:i:83 XS:i:83 CR:Z:AGATTCAAGGTTGTAA   CY:Z:FFFFF,F:::FFFFFF     CB:Z:CTGAATATCCTGGTCT-1 BC:Z:TTATTGGT   QT:Z::FFFFFFF   RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:2414:19244:7044  1171    chr1    10028   0       69S81M  =       10028   -81     AGAGAGCAACACTCATACTATGTTGTAACGGATCTGTATTAGTAAGAGTCAGATGTAGCTAAGACACATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    ,,::F,,,FFFF,,F,,,,F,::,,,F,,,:,:F,,F,,F,F:FF,,,F,FF:,,,,,,,,F,:,,,F,FFFFFFF,F:FFF:FFFFFFFFFFFF:FF,FF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    NM:i:0  MD:Z:81 AS:i:81 XS:i:81 CR:Z:AGATTCAAGGTTGTAA   CY:Z:FFFFFFFFFFFFFFFF     CB:Z:CTGAATATCCTGGTCT-1 BC:Z:AACGGTCA   QT:Z:F:FFFFFF   RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1664:20157:30859 99      chr1    10033   0       78M72S  =       10027   70      ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAGACAATTAAAAACAACTCACAGCCCACGATACCCGAACTCATCGCGTATGGCGTGGGCTGCGGGTAACCGGG    FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFF:FF,FFF:F:FF,:FFFF:FF:FFFFF:FF,FF,,,,F,,:,F,F,F,:,F:,,,F,,,:F:,:,F,FF,FFF,F,,F,,,F,:,,F,,,,,,,,,,,,,,,,,,,,    NM:i:0  MD:Z:78 AS:i:78 XS:i:76 CR:Z:AGATTCAAGGTTGTAA   CY:Z:FFFFFFFFFFFFFFFF     CB:Z:CTGAATATCCTGGTCT-1 BC:Z:CCGAACTC   QT:Z:FFFFFFFF   RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1431:25192:2347  147     chr1    10034   0       75S75M  =       10028   -81     TACCACTTAGATATACACTTATACTACGTTTTAGCGTTTCTGTATTCGTAAGCGTAAGATTATAAATAAACATATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC    :F,F,,,:F:,:,F:,,,F,::,,FF,:,,F,:,:,,::,F,F:,,:,,,F,,,F:F,:,,,:,F,F,,,,,,F,FF:FFF::FFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF    NM:i:0  MD:Z:75 AS:i:75 XS:i:75 CR:Z:AGATTCAAGGTTGTAA   CY:Z:FFFFFFFFFFFFFFFF     CB:Z:CTGAATATCCTGGTCT-1 BC:Z:GGTCCAAG   QT:Z:FFFFF:FF   RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1`

However, sinto output has just 1 line corresponding to this cell, and that is :
chr1 10013 10031 CTGAATATCCTGGTCT-1 1

I understand that many of these reads will get removed due to mapping quality. Still, I don't really understand what leads to the positions 10013 and 10031. Is this due to +4/-5 shifting? Even so, I don't see how these numbers are arrived at. Could you please help me understand this?

Thanks

sinto installation - problem with python

Hi,
I installed sinto successfully with python2 etc. But I seem to have an error with python:

File "/home/mfaxel/lib/python2.7/site-packages/sinto-0.7.2.2-py2.7.egg/sinto/tagtorg.py", line 9
return "\t".join(f"{k}:{v}" for k, v in line.items())
^
SyntaxError: invalid syntax

Is there a workaround or anyone already had this problem?
Thanks in advance!

empty fragment file

Hi Tim,
So I created a new column for barcode info (using sam), and then convert the sam to the bam file and indexed it. I still get no fragment file.
The head:

7001113:989:HTKVHBCX2:2:1105:15600:5528:77:15:82:15:CGGTATTTGG	0	chr1	3000049	1	23M	*	0	0	TCTTTGAAGGTCTGGTAGAACTC	DDDDDIIIIIIIIIIIIIIIIII	AS:i:0	CB:Z:77158215
7001113:991:HVWNKBCX2:1:2110:2683:18182:87:93:72:36:CAGGTATGGC	0	chr1	3000132	39	53M	*	0	0	GACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTTTAGTCCADDDDCHIIHGFHIIIIIIIIIIHHHEHIIIIIIIDDGHHHDHHIIIIIIIIHI	AS:i:0	CB:Z:87937236
7001113:990:HTKL3BCX2:2:2111:14944:4116:87:93:72:36:CAGGTATGGC	0	chr1	3000134	38	52M	*	0	0	CTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTTTAGTCCATGDDDDDIIIIIIIIIIIIIIIIIIIGIIIIIIIDFHHHHHIIIIIIIIIHIII	AS:i:0	CB:Z:87937236
7001113:991:HVWNKBCX2:2:2105:19568:57493:53:23:77:35:CACTATTTTG	0	chr1	3000159	0	52M	*	0	0	GGGGGGGCATGGGACTTTTAGTCCATGAATCTGATCCTGATTTAGCTTTGGTDDDDDIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIHEHIIHGIIIIIH	AS:i:-22	CB:Z:53237735
7001113:993:HVWMKBCX2:2:1211:1360:3780:53:23:77:35:CACTATTTTG	0	chr1	3000353	37	53M	*	0	0	GTTAATTATAGTACAGTCCCTATGCCCTCTAGTTAGTCTGGCTAAGGGTTTADDDDDIIHIIIIIIIIIIIHIIIIIIIIIIIIHIGIHIIHHIIIHHIIIHIII	AS:i:0	CB:Z:53237735
7001113:991:HVWNKBCX2:2:2211:8406:75832:21:06:91:12:TCATCTTTGT	16	chr1	3000464	1	52M	*	0	0	TCTTTTTGTTTCCACTTGGTTGATTTCAGCTCTGAGTTTGATTATTTCCTGCIHIHIIIIHHFF@EHIHGCHF<1IHFEEHEIHIIHIIHHHFIIGIHIDDDDD	AS:i:0	CB:Z:21069112
7001113:989:HTKVHBCX2:2:1211:2292:73091:90:80:26:12:CTGTACGGCT	0	chr1	3000559	42	53M	*	0	0	CTTCTAGATTTGCTGTCAGGCTGCTAGTGTATACTCTAGTTTCCTTTTGGAGDDCDDIIIIIIIIHIIIIIIIIIIIIIHIIIIIIIIHIIIIIIIIIIIIHIII	AS:i:0	CB:Z:90802612
7001113:991:HVWNKBCX2:1:2205:17457:98814:90:80:26:12:CTGTACGGCT	0	chr1	3000633	30	53M	*	0	0	CTCTTAGGACTGCCTCATTGTGCCCCATATGTTTGGCTATGTTGTGGATTTADDDDDIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIII	AS:i:0	CB:Z:90802612
7001113:989:HTKVHBCX2:1:2216:4512:6199:90:80:26:12:CTGTACGGCT	0	chr1	3000747	32	52M	*	0	0	ATTAAGTAGAGTATTGTTCAGTTTCCAGGTGAATGTTGGCTTTCTATTATTTDDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIH	AS:i:0	CB:Z:90802612
7001113:989:HTKVHBCX2:2:2209:10226:10613:58:87:87:11:TCGAATTTGT	0	chr1	3000919	32	51M	*	0	0	ATTTGGTACTGAGAAGAAGGTATATATCCTTTTGTCTTATGATAAAATGTT	DDDDDIIIIIIIHIIIIIHIIIHIIIIIIIIIIIIIIGIIIIIIIIIIIII	AS:i:0	CB:Z:58878711

the sinto code:
sinto fragments -b merge_CB.bam -f fragment

issue running filterbarcodes

Thank you for making this tool, I was searching for something like this and couldn't get other things I found to work. It was difficult to find yours. Im trying to run sinto filterbarcodes to create pseudobulk data. I sam sorted and indexed my bam files and trying ran sinto filterbarcodes, and got the following error

Traceback (most recent call last):
File "/home/tasakis/anaconda2/envs/SingleCells/bin/sinto", line 216, in
options.func(options)
File "/home/tasakis/anaconda2/envs/SingleCells/lib/python3.6/site-packages/sinto/utils.py", line 21, in wrapper
func(args)
File "/home/tasakis/anaconda2/envs/SingleCells/lib/python3.6/site-packages/sinto/cli.py", line 14, in run_filterbarcodes
cellbarcode=options.barcodetag
File "/home/tasakis/anaconda2/envs/SingleCells/lib/python3.6/site-packages/sinto/filterbarcodes.py", line 91, in filterbarcodes
cb = utils.read_cell_barcode_file(cells)
File "/home/tasakis/anaconda2/envs/SingleCells/lib/python3.6/site-packages/sinto/utils.py", line 198, in read_cell_barcode_file
groups = line[1].split(",")
IndexError: list index out of range

Could you suggest what would lead to this error and how I might fix it?
Thank you for your help!

Samtools is not a dependency

Hi @timoast,

I installed sinto with conda (conda install sinto) in a new env. It looks like samtools was not one of the dependencies and was not installed automatically, so I have to manually install it to make it work. Would you consider mentioning it in your installation guide or make samtools a dependency?

filterbarcodes - Samtools merge failed

Hi , I'm trying to run Sinto filterbarcodes -b .bam -p 1 -c .csv and I get this error:

[E::hts_open_format] Failed to open file "8.tmp > 8.bam" : No such file or directory
samtools reheader: fail to open file '8.tmp > 8.bam': No such file or directory

File "/ihome/rlafyatis/rib35/.local/lib/python3.7/site-packages/sinto/filterbarcodes.py", line 58, in mergeAll
raise Exception("samtools merge failed, temp files not deleted")
Exception: samtools merge failed, temp files not deleted

I'm not sure why I'm getting this error and would appreciate any help.

Thank you!

Fragments file has size of zero

Hi - I'm new to sinto and I'm unable to to produce a fragments file. My cell barcodes are in the header rows of my bam file, between the first and second underscore, and I have numeric values for my chromosomes, with no "chr" at the beginning. So, I need to know what the regex pattern would look like for both the --barcode_regex and --use_chrom options, and I need to know how to stop the --barcodetag using the default of "BC". Here are the first 10 rows of my bam file:

E00558:642:HFL3TCCX2:8:2106:29812:51834_ATTGAATTACAGCCGTCTTACACTGA_ATGCCATTCT 163 10 3100324 0 87M = 3100324 87 CATTTACACAATGGAATACTACTCAGCTATTAAAAAATGAATTTATGAAATTCCTAGGCAAATGGATGGACCTGGAGGGTATCATCC JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   NM:i:0  MD:Z:87 MC:Z:87M AS:i:87 XS:i:87 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2106:29812:51834_ATTGAATTACAGCCGTCTTACACTGA_ATGCCATTCT 83 10 3100324 0 87M = 3100324 -87 CATTTACACAATGGAATACTACTCAGCTATTAAAAAATGAATTTATGAAATTCCTAGGCAAATGGATGGACCTGGAGGGTATCATCC JJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA   NM:i:0  MD:Z:87 MC:Z:87M AS:i:87 XS:i:87 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2206:16802:5212_TGAAGAAACCGTTTGTTTACACAACA_GCNATCCATC 99 10 3100767 0 62M = 3100767 62 ATGCCGGGGCCTAGCAAACACAGAAGTGGATGATCACAGTCAGCTATTGGATGGGTCACACG   AAFFFJJJJJJJJJJJ-JJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFJJJJJJ   NM:i:0  MD:Z:62 MC:Z:43S62M     AS:i:62 XS:i:62   RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2206:16802:5212_TGAAGAAACCGTTTGTTTACACAACA_GCNATCCATC 147 10 3100767 0 43S62M = 3100767 -62 AGTTTCTTCATCGTCGGCAGCGTCAGATGTGTATGAGATACAGATGCCGGGGCCTAGCAGACACAGAATTGGATGATCACAGTCAGCTATTGGATGGGTCACACG    <FFFJ<-JFFFJJJJJJF77F-JJJF<-J-JAJF-JF7AA-7F-JJJJJ7JF-FAA7-<-F<-<A-<F-J7JJJJJJJ7FJF-F7JJJJJJJF-JAAJAFF-JF- NM:i:2 MD:Z:16A8G36 MC:Z:62M AS:i:52 XS:i:57 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2118:27225:71717_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCGC 163 10 3104258 60 117M = 3104509 401 AGTGTGTAGCTTATTAGTGGGGTGTTTGGCAGCATACATGAGGTTTTAGATTAAATCCCCCTGTTACAAAATAAGTAAAAGAGCATATCAGACACACCCCCCCATAGGAAAGAACAA        JJ7FJJFJJJJJJJJJJFF<FFJJJFJJJJJJJJJJJFFJJJJAJFJJJJJJJJJJFAAJF7AF-<AJJJJFAJJJJJJJJJJ-AF<FJJJJJFF<<-7777A-7FF7-AJJJJ-AA NM:i:1 MD:Z:95C21 MC:Z:150M AS:i:112 XS:i:93 RG:Z:BPA1 XA:Z:10,-7456812,20M2I92M3S,5;10,+22240801,94M2D23M,7;10,+22431027,94M3D23M,8;
E00558:642:HFL3TCCX2:8:2120:9628:66127_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCNC 163 10 3104273 0 15S68M33S = 3104509 386 AGTGTGTAGGTGATGAGTGGGGGGTTTGTCAGAATACATGAGGATTTAGATGAAATCACCCGGATACAAAAGAAGTAAAAGAGAATAAAAGACGGCACAGAGCATATAATAAAACA      AA<-FFJA7-7-FAA-----FJ---7-<--A<-77F<<--A-----7FA-<7-<77-<-A-----7-<7---7----7--A-A<------<----77--A--7--7-77-<-7-7< NM:i:9 MD:Z:7T5G3C10T7T5C3T1T7T11 MC:Z:150M AS:i:23 XS:i:23 RG:Z:BPA1        XA:Z:8,-16209092,29S21M66S,0;1,+129147041,65S20M31S,0;14,-14404651,24S19M73S,0;10,-113473301,33S19M64S,0;15,+60550338,70S19M27S,0;
E00558:642:HFL3TCCX2:8:2118:27225:71717_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCGC 83 10 3104509 60 150M = 3104258 -401 TCAGTAGGCAGACAGGAATAACCAAGGCCAGAAGATAATCTCTTTCCAATGGGCATAGAACCCTTCACTCTGCAGGCTGAGATGTGTTGCCATTATGAAGGAGATAAAAGTTTCAGGGGATCTTGTGTTGTTAGCCTCAATGGAAAGAAC       FJJJFJFJJJJJJJFJJJJJJFJJJJJJJJJJJJJJJJJJJJAFFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA    NM:i:0  MD:Z:150        MC:Z:117M       AS:i:150        XS:i:102        RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2120:9628:66127_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCNC 83 10 3104509 60 150M = 3104273 -386 TCAGTAGGCAGACAGGAATAACCAAGGCCAGAAGATAATCTCTTTCCAATGGTCATAGAACCCTTCACTCTGCAGGCTGAGATGTGTTGCCATTATGAAGGAGATAAAAGTTTCAGGGGATCTTGTGTTGTTAGCCTCAATGGAAAGAAC        A7JJFJAF<-F7-F<JJJJJF7-A777--FJJJFJJAA7FFJFFFJJJJFA7-JFJJJFAA-7AAF-FJJJJFJFFJJJAJFJJJJF<FJJJJJJFJJFAJJAJJJJJJJJJJJJJFJJJ<JJJJJJJFFJJJJJA-FJJJJJJFFFFAA    NM:i:1  MD:Z:52G97      MC:Z:15S68M33S  AS:i:145        XS:i:97 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2201:25083:16006_TGAAGAAAGCAGTAGAGCACTTGGCG_TATGCNTTAC 99 10 3104650 60 150M = 3105087 553 GGAAAGAACATGTTCATGTTGACACAAGCACTGGCAACTGGACTCAATTGGATCCTAGATTGAAGAAGAGTATAGAAATAGGGAAGGAAGACAGGACTCGATCTTCCTTCTTAGAGAAGACTACAGAGGGTGACTGCAAGACCTGGCGTG        AAFFFJJJFJJJFJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJAJJJJFJ<FJFJFFJFJJJAFAFJ7JFFAJJ7F<7FAJJFJAAJJJJJJJJJJJJFFJFJAJJJ<AAJJFJAJJJJJJJJJFAAAAJ7FFA<AFJJF7FJ77<--<    NM:i:1  MD:Z:147T2      MC:Z:116M       AS:i:147        XS:i:95 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2201:25083:16006_TGAAGAAAGCAGTAGAGCACTTGGCG_TATGCNTTAC 147 10 3105087 60 116M = 3104650 -553 GTGCGGAAGAGGAGGCACACAACATGTAAGAACCAGAGGGGATTGAGGACACCAAGGATTTCTCCTCTTAAGTCAACACGATCCACACACATATGAACTCACAGGTACTGGAGTAG        7FF7JJJFJFAF7-F-7F-AFFJJJJAAFFJFJJJA<JFAFFJJJJJ<FJJJJJJJJJJFA-A-FFFJJJFAFFFJJJJJJJJJJFFJJJJJJJFF7JJJJJJJJJJAAFJJJJJJ NM:i:0 MD:Z:116 MC:Z:150M AS:i:116 XS:i:103 RG:Z:BPA1 XA:Z:10,+7455967,116M,3;10,+3205176,112M4S,4;

Memory issue with Add cell barcodes to FASTQ read names function

Hi,

I am trying to use sinto barcode command to add the cell barcodes stored in R2.fastq to the R1 and R3 scATAC-seq reads. Here is the command I am using. The command gives the right output but the only problem is that it is using ~30G RAM per sample, which means I will have to assign a lot of memory if running parallel. It would be helpful if you could provide any potential solutions for that. Thanks!

$ sinto barcode --barcode_fastq R2.fastq --read1 R1.fastq --read2 R3.fastq -b 16

The size of certain sample files:
13G R1.barcoded.fastq
20G R3.barcoded.fastq
15G R1.fastq
7.4G R2.fastq
15G R3.fastq

Citing Sinto

For citing use of Sinto would you like us to include the link to the github page and the version used as you prefer for Signac or do you have a specific way you would like this program cited?

Thanks for your time!

filterbarcodes bam outputs very small?

Hi, I'm new to using this tool and not sure if it worked correctly.

I have a sorted BAM file of attack seq data that's ~100 Gb, and 12 clusters that I'm looking to subset the data into. The filterbarcodes function ran with no errors, but I am having trouble understanding the output. The 12 bam files all were tiny in comparison to the original BAM file (each about 2 kilobytes), where I thought they would each roughly be 1/12 the size of the original BAM file.

Also, each bam file (ex. 0.bam) seems to be accompanied by another binary file named something like 0_7U58WO (also around 2 kb), that I can't figure out how to read or what it is.

I'm not sure how to make sense of this because there were no errors indicated while running the program. Anything that could shed light on these unexpected results would be helpful.

Clarifications

Two questions:

  • What format does sinto use for the output (the documentation says text file but is it a BAM or SAM?)
  • Do you need barcodes in plaintext or with quotes?

Error running filterbarcodes

Good morning! I am using the filterbarcodes function to subset a bam to generate a psuedobulk bam file for further processing. When using the filterbarcodes function I came across two things:

  1. Is the -o output part of the parameters required? The error:
    "sinto: error: unrecognized arguments: -o ./ " occurs.

  2. If the -o parameter is left out the program will run generate two subfiles with the correct A_xx titles about 2-3g in size and exits with the following error leaving the two subfiles in place:

"File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/sinto/filterbarcodes.py", line 55, in mergeAll
raise Exception("samtools merge failed, temp files not deleted")
Exception: samtools merge failed, temp files not deleted"

What typically leads to this error?

Thanks for your help!

Renaming read tags

I need to temporarily remove read groups from my BAM file in order to run BQSR in a read-group unaware mode. I thought I might just rename the RG read tag to something else while I run BQSR and then rename the read tag back to RG. I've looked around a little, and I can't find a tool to do it, so I'm writing a script for it. Would that be something you would accept a PR for?

Potential naming issue with import of importlib-metadata package

Hi Tim, when loading your package in python I ran into an error:

>>> import sinto
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/user/conda/envs/sinto/lib/python3.7/site-packages/sinto/__init__.py", line 1, in <module>
import importlib.metadata
ModuleNotFoundError: No module named 'importlib.metadata'

Then I noticed in my installation the package is called envs/sinto/lib/python3.7/site-packages/importlib_metadata (with an underscore instead of a dot). So changing import.metadata to import_metadata in conda/pkgs/sinto-0.8.1-pyhfa5458b_0/site-packages/sinto/__init__.py resolves the error. Maybe this is a version specific naming?

Thanks!
Tilo

My conda environment:

  • _libgcc_mutex=0.1=conda_forge
  • _openmp_mutex=4.5=2_gnu
  • bzip2=1.0.8=h7f98852_4
  • c-ares=1.18.1=h7f98852_0
  • ca-certificates=2022.6.15=ha878542_0
  • curl=7.83.1=h2283fc2_0
  • keyutils=1.6.1=h166bdaf_0
  • krb5=1.19.3=h08a2579_0
  • ld_impl_linux-64=2.36.1=hea4e1c9_2
  • libblas=3.9.0=15_linux64_openblas
  • libcblas=3.9.0=15_linux64_openblas
  • libcurl=7.83.1=h2283fc2_0
  • libdeflate=1.6=h516909a_0
  • libedit=3.1.20191231=he28a2e2_2
  • libev=4.33=h516909a_1
  • libffi=3.4.2=h7f98852_5
  • libgcc-ng=12.1.0=h8d9b700_16
  • libgfortran-ng=12.1.0=h69a702a_16
  • libgfortran5=12.1.0=hdcd56e2_16
  • libgomp=12.1.0=h8d9b700_16
  • liblapack=3.9.0=15_linux64_openblas
  • libnghttp2=1.47.0=he49606f_0
  • libnsl=2.0.0=h7f98852_0
  • libopenblas=0.3.20=pthreads_h78a6416_1
  • libssh2=1.10.0=ha35d2d1_2
  • libstdcxx-ng=12.1.0=ha89aaad_16
  • libzlib=1.2.12=h166bdaf_2
  • ncurses=6.3=h27087fc_1
  • numpy=1.21.6=py37h976b520_0
  • openssl=3.0.5=h166bdaf_1
  • pip=22.2.2=pyhd8ed1ab_0
  • pysam=0.16.0=py37ha9a96c6_0
  • python=3.7.12=hf930737_100_cpython
  • python_abi=3.7=2_cp37m
  • readline=8.1.2=h0f457ee_0
  • scipy=1.7.3=py37hf2a6cf1_0
  • setuptools=63.4.1=py37h89c1867_0
  • sinto=0.8.1=pyhfa5458b_0
  • sqlite=3.39.2=h4ff8645_0
  • tk=8.6.12=h27826a3_0
  • wheel=0.37.1=pyhd8ed1ab_0
  • xz=5.2.5=h516909a_1
  • zlib=1.2.12=h166bdaf_2
  • pip:
    • importlib-metadata==4.12.0
    • typing-extensions==4.3.0
    • zipp==3.8.1

Empty bam fles after merging in filterbarcodes

I am splitting a bam file with sinto filterbarcodes -b $BAM -c $CELLS -p 16 . This is a snippet of my input $BAM:

@HD     VN:1.6  SO:coordinate
@SQ     SN:chr_1        LN:356613585
GGGATTGGATCTATCT:NS500645:228:HGT2VAFX2:1:11311:15433:7613      1187    chr_1   9139    60      50M     =       9211    122     ATATACTCTATTAGCTCCTTTCTTTTTTCCTGGAAAGTAGGACATATTAT      AAAAAEEAAEAAE6EA6EEEAEEEEEEEEEEEEEEEEEEE/EEEEEEA6<      NM:i:0  MD:Z:50 AS:i:50 XS:i:22 MQ:i:60 MC:Z:50M        ms:i:1716       CB:Z:25m_PFA#GGGATTGGATCTATCT
GGGATTGGATCTATCT:NS500645:228:HGT2VAFX2:2:11306:22648:13604     1187    chr_1   9139    60      50M     =       9211    122     ATATACTCTATTAGCTCCTTTCTTTTATCCTGGAAAGTAGGACATATTAT      AAAAAEEEE<EAEEEEEEEEEAEEEE/E/EEE/EEEE6//EEEAEEEEAE      NM:i:1  MD:Z:26T23      AS:i:45 XS:i:0  MQ:i:60 MC:Z:50M        ms:i:1742       CB:Z:25m_PFA#GGGATTGGATCTATCT

Temporary files being created are ok, i.e barcodes are read and reads are split correctly, however after merging, all outputs are empty.

Is this a samtools merge or reheader issue that I can't figure out, or is it something sinto-related?

Thank you in advance,
Anamaria

filterbarcodes drops unmapped reads

Hi Tim,

Thank you so much for working on this tool! We have been using the filterbarcodes function to demultiplex bam files (to rerun the read alignment step, starting from the output bam from CellRanger). For our particular application, we are interested in unmapped reads, but unfortunately the filterbarcodes function seems to not carry over the unmapped reads after demultiplexing. (ie. the input bam file has unmapped reads, but the resulting demultiplexed bams have 0 unmapped reads). Is it possible to tweak the function to have an option to keep unmapped reads? We would greatly appreciate it!

Thanks again,
Joyce

collapsing reads issue follow-up

Hi, please look at the following comment from a closed issue. Opening a new issue here since I haven't heard back from anyone (presumably because commenting on a closed issue doesn't automatically reopen it).

Thanks.

" As a follow up, looking at the code it seems to me that you use 20 as the threshold for this. i.e. if one end is the same, we allow the other end to be up to 20 bases away for it to still be considered a duplicate. Is that correct?

However, even in that case, I'm confused because I see multiple cases where the end is the same, the start is <20 bases away, but these are still not counted separately (i.e., they are considered duplicates) by sinto. e.g. with the following 4 reads:

A00261:525:HK77VDSX3:1:1133:17969:2613 99 chrM 9947 60 150M = 10023 226 GGTTTGACTATTTCTGTATGTCTCCATCTATTGATGAGGGTCTTACTCTTTTAGTATAAATAGTACCGTTAACTTCCAATTAACTAGTTTTGACAACATTCAAAAAAGAGTAATAAACTTCGCCTTAATTTTAATAATCAACACCCTCCT FFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:150 AS:i:150 XS:i:34 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFFFFFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:TCGAATTG QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1
A00261:525:HK77VDSX3:1:1133:17969:2613 147 chrM 10023 60 150M = 9947 -226 CAATTAACTAGTTTTGACAACATTCAAAAAAGAGTAATAAACTTCGCCTTAATTTTAATAATCAACACCCTCCTAGCCTTACTACTAATAATTATTACATTTTGACTACCACAACTCAACGGCTACATAGAAAAATCCACCCCTTACGAG :FFFFFFFFFFFFFFFF:FFFFFF:FFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:150 AS:i:150 XS:i:0 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFFFFFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:TCGAATTG QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1
A00261:525:HK77VDSX3:1:1370:20518:3302 99 chrM 10092 60 81M = 10092 81 CTCCTAGCCTTACTACTAATAATTATTACATTTTGACTACCACAACTCAACGGCTACATAGAAAAATCCACCCCTTACGAG FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:81 AS:i:81 XS:i:0 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFF,FFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:CGAGTGAT QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1 TR:Z:CTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAGTGATATCTCGTATGCCGTCTTCTGCTTGAAA TQ:Z:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF
A00261:525:HK77VDSX3:1:1370:20518:3302 147 chrM 10092 60 81M = 10092 -81 CTCCTAGCCTTACTACTAATAATTATTACATTTTGACTACCACAACTCAACGGCTACATAGAAAAATCCACCCCTTACGAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:81 AS:i:81 XS:i:0 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFF,FFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:CGAGTGAT QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1 TR:Z:CTGTCTCTTATACACATCTGACGCTGCCGACGACAGACGCGACCCTCCTGAGCCTGTGTGTAGATCTCG TQ:Z:::FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

I would have expected the following two start,end pairs to be considered separate fragments:
9950 10167
10095 10167

but sinto actually only counts the second fragment here (i.e. 10095 10167), and ignores the first. What am I missing?

Thanks
"

Originally posted by @rtyags in #48 (comment)

Regarding utils::chunk_bam()

Hi developers,

I understand that the chunk_bam() function splits the genome into multiple intervals for multiprocessing.

Basically, for each paralleled task, it calls pysam.fetch() to retrieve all the reads that map to the supplied interval. One concern to me is that, if certain reads overlap with more than one "intervals" (thus, will be fetched by pysam more than once from parallel jobs), will those reads be double counted?

Please let me know if this is a valid concern or not based on your experience. Really appreciate it!

filter barcodes parallel issue

Hi Tim,

Thought I'd point out an issue that I noticed. I can't figure out exactly what is happening. Basically, when I run filterbarcodes with -p >1 each certain header entries are duplicated once for every process. So Every @PG entry is duplicated but with a unique string in the ID name.

@PG	ID:minimap2-1FF947E	PN:minimap2	VN:2.7-r654	CL:minimap2 -ax splice -t 10 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no genome.fa sc_bams/HP_104_Normal_soup/tmp.fq
@PG	ID:minimap2-2E680064-4AC70EAD	PN:minimap2	VN:2.7-r654	CL:minimap2 -ax splice -t 10 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no genome.fa sc_bams/HP_104_Normal_soup/tmp.fq

and a unique read group is produced for each process, with a unique string appended to the ID. This bam should only have one read group (the top one is correct), but now has 10 read groups.

@RG	ID:HP_104_Normal	LB:1	PL:ILLUMINA	SM:HP_104_Normal	PU:1

@RG	ID:HP_104_Normal-401FEFD5	LB:1	PL:ILLUMINA	SM:HP_104_Normal	PU:1

@RG	ID:HP_104_Normal-75ACD5C2	LB:1	PL:ILLUMINA	SM:HP_104_Normal	PU:1

@RG	ID:HP_104_Normal-17EC0C41	LB:1	PL:ILLUMINA	SM:HP_104_Normal	PU:1

@RG	ID:HP_104_Normal-58171F5E	LB:1	PL:ILLUMINA	SM:HP_104_Normal	PU:1

@RG	ID:HP_104_Normal-2AA79EC2	LB:1	PL:ILLUMINA	SM:HP_104_Normal	PU:1

@RG	ID:HP_104_Normal-738A0D0F	LB:1	PL:ILLUMINA	SM:HP_104_Normal	PU:1

@RG	ID:HP_104_Normal-4AEA10E1	LB:1	PL:ILLUMINA	SM:HP_104_Normal	PU:1

@RG	ID:HP_104_Normal-57A6356B	LB:1	PL:ILLUMINA	SM:HP_104_Normal	PU:1

@RG	ID:HP_104_Normal-5E8828B4	LB:1	PL:ILLUMINA	SM:HP_104_Normal	PU:1

Modified RG tag and duplicated entry after sinto

Hi,
I have a BAM file that contain this entry, for example:

$ samtools view possorted_bam.hornet.final.bam|grep "A01040:79:H2F2YDRXY:2:2165:10782:19977"
A01040:79:H2F2YDRXY:2:2165:10782:19977  163     chr8    120623305       60      50M     =       120623620       365     ATGGGAATGACATTGTATCTTGTGATGTGCTATTTATTAGAAATCAAAAA      FF,F,FFFFFFFFFFFFF,FFFFFFF:F:FFFF,FFFFFFFFFFFFFFF:      NM:i:0  MD:Z:50 AS:i:50 XS:i:19 CR:Z:TCAGTTTGTGATCAGG   CY:Z::FF:FFFFF:::FFFF CB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA   QT:Z::::F,FFF   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C
A01040:79:H2F2YDRXY:2:2165:10782:19977  83      chr8    120623620       60      50M     =       120623305       -365    ATCGCTGAGAATCTGAACAAATTAAGGGTGTGGGGGTTGGGGGAGGCAGC      :F:F,F:,:FFFF,,FF,FFFFFFF:F:F:FF,:FFFFFFFF,FF:FFFF      NM:i:1  MD:Z:13A36      AS:i:45 XS:i:23 CR:Z:TCAGTTTGTGATCAGG   CY:Z::FF:FFFFF:::FFFF        CB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA   QT:Z::::F,FFF   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C

Then I ran sinto filterbarcodes and got this in the output:

samtools view PBMC002.bam|grep "A01040:79:H2F2YDRXY:2:2165:10782:19977"           
A01040:79:H2F2YDRXY:2:2165:10782:19977  163     chr8    120623305       60      50M     =       120623620       365     ATGGGAATGACATTGTATCTTGTGATGTGCTATTTATTAGAAATCAAAAA      FF,F,FFFFFFFFFFFFF,FFFFFFF:F:FFFF,FFFFFFFFFFFFFFF:      NM:i:0  MD:Z:50 AS:i:50 XS:i:19 CR:Z:TCAGTTTGTGATCAGG   CY:Z::FF:FFFFF:::FFFFCB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA   QT:Z::::F,FFF   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C-3A2DA946
A01040:79:H2F2YDRXY:2:2165:10782:19977  163     chr8    120623305       60      50M     =       120623620       365     ATGGGAATGACATTGTATCTTGTGATGTGCTATTTATTAGAAATCAAAAA      FF,F,FFFFFFFFFFFFF,FFFFFFF:F:FFFF,FFFFFFFFFFFFFFF:      NM:i:0  MD:Z:50 AS:i:50 XS:i:19 CR:Z:TCAGTTTGTGATCAGG   CY:Z::FF:FFFFF:::FFFFCB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA   QT:Z::::F,FFF   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C-12D1C06B
A01040:79:H2F2YDRXY:2:2165:10782:19977  83      chr8    120623620       60      50M     =       120623305       -365    ATCGCTGAGAATCTGAACAAATTAAGGGTGTGGGGGTTGGGGGAGGCAGC      :F:F,F:,:FFFF,,FF,FFFFFFF:F:F:FF,:FFFFFFFF,FF:FFFF      NM:i:1  MD:Z:13A36      AS:i:45 XS:i:23 CR:Z:TCAGTTTGTGATCAGG   CY:Z::FF:FFFFF:::FFFF        CB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA   QT:Z::::F,FFF   RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C-12D1C06B

There's one BAM line that's duplicated, but with a different RG tag. I'm wondering why this happens?
I'm worried that this will create bias when counting the BAM reads for downstream analysis.

Thanks!

Too many open files

Hi,

Thank you for developing this tool.

I am applying this code on a real-time CellRanger output where the bam file size is 35GB and I am getting this error Failed to open file "sam_files/T10/AGGTCATTCCTAAGTG-1_7G1BJ1" : Too many open files.

Do you have any fix for this?

Regards,
Nitin N.

error parsing barcode from middle of read name with nametotag

Hi @timoast,

Thanks for adding nametotag, this is very helpful for my dataset (linked-reads sharing the same molecular barcode). I currently have mapped reads in a bam file with the barcode embedded in the read header, same format as isssue #32 where the barcode is not at the beginning of the read.
example (barcode = CATTTGGCCTCGAATCGCGTCGGTGCGGTAACACTC)
A00564:478:HG5NJDSX3:1:2556:3992:34867_CATTTGGCCTCGAATCGCGTCGGTGCGGTAACACTC_GAACGACTACCACAG

You provided the regex to use:
--barcode_regex "(?<=)(.*)(?=)"
which works for me with sinto fragments, however I get an error with sinto nametotag:

Traceback (most recent call last):
File "/programs/sinto-0.8.0/bin/sinto", line 8, in
sys.exit(main())
File "/programs/sinto-0.8.0/lib/python3.9/site-packages/sinto/arguments.py", line 457, in main
options.func(options)
File "/programs/sinto-0.8.0/lib/python3.9/site-packages/sinto/utils.py", line 23, in wrapper
func(args)
File "/programs/sinto-0.8.0/lib/python3.9/site-packages/sinto/cli.py", line 109, in run_nametotag
tagtoname.move(
File "/programs/sinto-0.8.0/lib/python3.9/site-packages/sinto/tagtoname.py", line 51, in move
cell_barcode = re_match.group()
AttributeError: 'NoneType' object has no attribute 'group'

I get a similar error for filterbarcodes using the same regex and input bam file.

Im running sinto v0.8.0

Thanks for your help!

filterbarcodes TypeError: unhashable type: 'list'

Hi Tim,

I have been running successfully sinto barcodes for a batch of BAM files but now it's throwing an error with another batch of BAM files. I had a look into the bam and they seem ok, and the barcodes list is formatted the same way. Its possible also that these barcodes won't match anything in the bam, but that doesnt throw any errors.
Here is head of one BAM and the barcode list im trying to subset it to:


A00445:16:H7YL5DMXX:1:1439:23601:3098	16	2	46895278	255	1S38M487913N59M	*	GTCACTGCAACCTCCACCTTCCAGGTTCAAGCAATTCTCCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATTGCACCACTGCACTCCAGCCTGGGTGAC	FFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFF:F:FF:FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NH:i:1	HI:i:1	AS:i:92	nM:i:0	RE:A:I	BC:Z:CCTTTGTC	QT:Z:FFFFFFFF	CR:Z:TTCTCAAAGATGTGGC	CY:Z:FFFFFFFFFFFFFFFF	CB:Z:TTCTCAAAGATGTGGC-1	UR:Z:TTGGGGTTGG	UY:Z::FFFFFFFFF	UB:Z:TTGGGGTTGG	RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:1:1236:9136:10614	0	2	47109310	255	66M257308N32M	*	ACTTTGGGAGGCTGATGTGGGTAGATCACCTGAACTCAGGAGTTCAACACCAGCCTGGCCAACAAGAAACCCCATCTCTACTAAAAATACAAAAAATT	:,FFFFFFF,FFF:F,FFFFFFFFFFFFFFFF:F,FFFFFFF:FFFF,F,F:FFF,FFFFFFFF,:FFFFF,FFFFFFFFFFFFF:FFFFFFFFFFFF	NH:i:1	HI:i:1	AS:i:84	nM:i:5	RE:A:I	BC:Z:AGCACACT	QT:Z:,:FFFFF,	CR:Z:GTCAGGGAGGTGATAT	CY:Z:F::F,FFFFFFFFF,F	CB:Z:GTCACGGAGGTGATAT-1	UR:Z:ATCTGGGAAG	UY:Z:FFFFFF:F,F	UB:Z:ATCTGGGAAG	RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:2:1121:8395:11741	16	2	47151916	255	7S41M257273N50M	*	CGCCGGCTTTGTTTTTTTTTTTTTTTTTGTATTTTTAGTAGAGACAGGGTTTCACCATGTTGGCCAGCCTGGTCTTGAACTCCTGACCTCAAGTGATC	:,F:FF,:,,,:FFFF,FFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFF,FFFFFFFFFFFFFFF,,FFFFFFFFFF:FFFFFFFFFF:FFF::FFF	NH:i:1	HI:i:1	AS:i:83	nM:i:2	RE:A:I	BC:Z:AGCACACT	QT:Z:FFFFFFFF	CR:Z:ACGGGTCAGCGTGTCC	CY:Z:FFFFFFFFFFFFFFFF	CB:Z:ACGGGTCAGCGTGTCC-1	UR:Z:TGGCGTCTGA	UY:Z:FF:F,:FFFF	UB:Z:TGGCGTCTGA	RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:2
A00445:16:H7YL5DMXX:1:1369:20292:30185	0	2	47151961	255	51M200703N47M	*	CAATGTGTTAGCCAGGATGGTCTAGATCTCCTGACCTTGTGATCCGCCCGCCCCTGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGCCACCGTGCCC	FF,FFFFFFFFFFFFFFF:F:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFF,FFFFFFFFF	NH:i:1	HI:i:1	AS:i:88	nM:i:3	RE:A:I	BC:Z:AGCAAACT	QT:Z:FFFFFFFF	CR:Z:TGCACGCTCGTGGGAA	CY:Z:FF,FFF:FFFF,FFF:	CB:Z:TGGACGCTCGTGGGAA-1	UR:Z:TGTTCCATTT	UY:Z:FF,:FFFFFF	UB:Z:TGTTCCATTT	RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:2:2451:8847:2973	16	2	47168859	255	62M191908N36M	*	CCTCTGCCTCCCAGGTTCAAGTGATTCTCCTGACTCAGCCTCTAGAGTCGCTGGGATTACAGGCACACGCCACCATGCCAGGCTAATTTTTATATTTT	::FFFFF:FFFFFF,FFFFF:FFFFFFF::FF,FFFF:FFFFFFFFFF,FFFF:FFFFFFFF:FFFFFFF,FFF,FFFF,FFFFFFFFFF:FF,:FFF	NH:i:1	HI:i:1	AS:i:84	nM:i:5	RE:A:I	BC:Z:TAGGATGA	QT:Z::F,F,FFF	CR:Z:TGGTTCCAGTCCCGGA	CY:Z:F:FFFF,:,:FFFFFF	CB:Z:TGGTTCCAGTACCGGA-1	UR:Z:ACGTAGAACC	UY:Z:F:FFFFF,::	UB:Z:ACGTAGAACC	RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:2
A00445:16:H7YL5DMXX:2:2444:12192:33129	16	2	47179880	255	45M184241N53M	*	CACACCATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCGTGCCACCACACCCAGCTAATTTTGTATTTTTAGTAGAGACGGGGTTTC	FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NH:i:1	HI:i:1	AS:i:90	nM:i:2	RE:A:I	BC:Z:GTACGCGG	QT:Z:FFFFFFFF	CR:Z:AGGCCGTAGCAACGGT	CY:Z:FFFFFFFFFFFFFFFF	CB:Z:AGGCCGTAGCAACGGT-1	UR:Z:TCGCTTTACT	UY:Z:FFFFFFF,FF	UB:Z:TCGCTTTACT	RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:2
A00445:16:H7YL5DMXX:1:2449:20229:6840	0	2	47212619	255	22M237731N76M	*	AAACAAAACAAAAAAAAAAAACACCGGGCGTGGTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCGAGGCAGGCAGATCACAAGGTCAGGAGAT	F,FFFF:FFFFFFFFFF::FF:FFFFFFFF:FFF,FFFFFFF:F:FFFF,FFFFFFF,FFFFFF:FFFFFFFFFFFF,FF,,FFFFFFFFFFFFFFFF	NH:i:1	HI:i:1	AS:i:88	nM:i:3	RE:A:I	BC:Z:AGCACACT	QT:Z:FFFFFFFF	CR:Z:TGCCCTAAGCGTCAAG	CY:Z:FFFFFFFFFFFFFFFF	CB:Z:TGCCCTAAGCGTCAAG-1	UR:Z:GTCACAAATT	UY:Z:FFFFFFFFFF	UB:Z:GTCACAAATT	RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:1:2324:16776:36276	16	2	47251321	255	1S24M202853N73M	*	GATCCTCCCTCCTCAGCCTCCCAAAGTGGGCGGATCACGAGGTCAAGACATCAAGACCATCCTGACCAACATGGCGAAACCCCGTCTGTACTAAAAAT	FFFFFFFFFFFFFFFFFFFFFFF:FF,FF,FFFFFFFFFFFFFFFFFF:,FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF	NH:i:1	HI:i:1	AS:i:77	nM:i:8	RE:A:I	BC:Z:CCTTTGTC	QT:Z:FFFFFF:F	CR:Z:TGCCCATTCAGAGACG	CY:Z:FFFFF,FFFFFFFFFF	CB:Z:TGCCCATTCAGAGACG-1	UR:Z:GAGTGTGCGA	UY:Z:FFFFFFFFFF	UB:Z:GAGTGTGCGA	RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:1:2173:5032:7733	0	2	47253083	255	4S70M108784N24M	*	GGAATTCAAGACCAGCCTGGCCATCATGGTGTAACCCCATCTCTACTAAAAATACTAAAAATTAGCTAGGTGTGGTGGTTCATGCCTGTAATCCCAGC	FFF:FFFFFFFFFFFFFFFFFFF,FFFFFFF,FFF::FFFFFFFFFFFFFFFF:F:F,FFFF,FFFFFFFFFFF:FFFF,F:FFFFFFFFFFFFFFFF	NH:i:1	HI:i:1	AS:i:84	nM:i:3	RE:A:I	BC:Z:GTCCGCGG	QT:Z:FF,:FFFF	CR:Z:GCGGGTTTCTATCCCC	CY:Z:FFFFFFFFFFFFFFFF	CB:Z:GCGGGTTTCTATCCCG-1	UR:Z:ACCTTAGGGG	UY:Z:,FFFFFFFFF	UB:Z:ACCTTAGGGG	RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:2:2344:4227:21245	0	2	47261720	255	42M278937N53M3S	*	GGCTCACACATGTAATCCCAGCACTTTGGGAAGCGAAGGCAGGCGGATTGCTTGAGGCCAGGAGTTTGGGACCAGCCTGGGTCACATAGCCAGACCCT	F,,FFFFFFFFFFFFFFFFFFFFFFF,F,FF:FFF:FFFFFFF,FFFFFFFF:FFFFFFFFFFFFFF:FFFFFF,FF,FFFF,FFF:FFFFFFFFFFF	NH:i:1	HI:i:1	AS:i:72	nM:i:9	RE:A:I	BC:Z:AGCACACT	QT:Z::FF,FFFF	CR:Z:GTGAAGGGTTGTCTTT	CY:Z:FF:FFF


And the barcode tab-delimited file:
AAACCTGAGACTACAA-1	Myeloid
AAACCTGCACCTGGTG-1	Myeloid
AACCATGCATCACGAT-1	T.NK.cells
AACGTTGGTGTTGGGA-1	Myeloid
AACTGGTGTTACGGAG-1	T.NK.cells
AAGACCTTCCAGAGGA-1	T.NK.cells
AAGGAGCTCTGATACG-1	T.NK.cells
AAGGTTCAGGTTACCT-1	T.NK.cells
AAGTCTGGTATCAGTC-1	Myeloid
AAGTCTGTCTATGTGG-1	T.NK.cells

And here is the error:
Function run_filterbarcodes called with the following arguments:

 sinto filterbarcodes -b file.bam \
 -c celltypes.txt \
--barcodetag "CB"

bam	file.bam
cells	celltypes.txt
trim_suffix	False
nproc	1
barcode_regex	None
barcodetag	CB
func	<function run_filterbarcodes at 0x2b5b48a5b6a8>
Traceback (most recent call last):
  File "/broad/hptmp/bgiotti/signac/bin/sinto", line 263, in <module>
    options.func(options)
  File "/broad/hptmp/bgiotti/signac/lib/python3.6/site-packages/sinto/utils.py", line 21, in wrapper
    func(args)
  File "/broad/hptmp/bgiotti/signac/lib/python3.6/site-packages/sinto/cli.py", line 14, in run_filterbarcodes
    cellbarcode=options.barcodetag,
  File "/broad/hptmp/bgiotti/signac/lib/python3.6/site-packages/sinto/filterbarcodes.py", line 92, in filterbarcodes
    unique_classes = list(set(chain.from_iterable(cb.values())))
TypeError: unhashable type: 'list'

Thanks a lot for your support!

scaling with bam size and parallel processing

Hi,

Could you help me with speeding up my sinto runs? How does it scale with data size? I have noticed that it runs for a very long time when we have a large bam file as input. Part of this could also be that at no stage does it seem to use multiple processors even though I have provided a high number with -p option. Is it possible that something was missed during installation so that parallelization is somehow not available to sinto on my system?

Thanks

Single-Read BAM compatibility

Hi @timoast,

I am wondering if there is a way to make the program, specifically the fragments function compatible with SR BAM files.

Thank you very much!

filter barcodes parallel issue

similar but different from issue #15
this is my original single cell bam file

@RG     ID:CH4-LN_2_L001        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L002        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L003        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L004        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L005        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L006        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@RG     ID:CH4-LN_2_L007        SM:CH4-LN_2     LB:CH4-LN_2     PL:illumina
@PG     ID:STAR PN:STAR VN:2.7.4a       CL:STAR   --runThreadN 8   --genomeDir /data/wangzw/dropseqMetadata_b37/STAR   --readFilesIn unaligned_mc_tagged_polyA_filtered.fastq      --outFileNamePrefix star   --outReadsUnmapped Fastx   --twopassMode Basic
@PG     ID:0    PN:TagReadWithGeneFunction      CL:TagReadWithGeneFunction INPUT=merged.bam OUTPUT=star_gene_exon_tagged.bam ANNOTATIONS_FILE=/data/wangzw/dropseqMetadata_b37/b37.refFlat    GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf READ_FUNCTION_TAG=XF USE_STRAND_INFO=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false  VN:2.3.0(34e6572_1555443285)
ST-E00522:612:H2WHKCCX2:7:1220:14093:56985      16      1       10166   0       49M44N32M       *       0       0       CCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAAGCCCTAACCCTAACCCTAACCCTAACCCTAACCC       JJJJJJJJFJJFJJJFAJJJJAJJJJJJ7JJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFAJJJJJFJJJJFFFAA       XC:Z:CCTTGTCGACTC
       MD:Z:47C33      XF:Z:INTERGENIC PG:Z:STAR       RG:Z:CH4-LN_2_L003      NH:i:5  NM:i:1  XM:Z:TTGGCCTC   ZP:i:82 UQ:i:41 AS:i:79
ST-E00522:612:H2WHKCCX2:7:2216:25540:11839      16      1       11743   1       150M    *       0       0       TGACGATTTTGCTGCATGGCCGGTGTTGAGAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCT  <7<-A-JJA7FFAAA))7)7A-7AFF7JA<F7JJ<JJJJFFA-<AJJJFF
JJFFF--777JA<FA-7JF7AJJFJJJJJJJFJJFJJ<J7J<JJJJFJFJF<<FJJFJ<JJJJJJJJFJFFFJJJJJFA7-JJAFF<JJJFFJJJAAAAA  XC:Z:GAGACGAGGCCC       MD:Z:3T146      XF:Z:INTERGENIC PG:Z:STAR       RG:Z:CH4-LN_2_L003      NH:i:3  NM:i:1  XM:Z:ACTGGCCA   UQ:i:12 AS:i:146        gf:Z:CODING,INTERGENIC  gn:Z:DDX11L1,DDX11L1    gs:Z:+,+
ST-E00522:612:H2WHKCCX2:6:1124:17797:16041      16      1       11762   1       150M    *       0       0       CCGGTGTTGAAAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTT  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA  XC:Z:CTTCTTCCGCTT       MD:Z:10G139     XF:Z:INTERGENIC PG:Z:STAR       RG:Z:CH4-LN_2_L007      NH:i:3  NM:i:1  XM:Z:ATGGAGGA   UQ:i:41 AS:i:146        gf:Z:CODING,INTERGENIC  gn:Z:DDX11L1,DDX11L1    gs:Z:+,+

which has 7 RG from CH4-LN_2_L001 to CH4-LN_2_L007

I want to extract reads tagged by 100 cell barcodes, and here is my code:
sinto filterbarcodes -b star_gene_exon_tagged.bam -c xcForTest.txt --barcodetag XC -p 16

xcForTest.txt has 100 cell barcode,like this:

ACCGTCAGCGAT	subset
GTTCAGAATAGC	subset
GCAACACGAGTG	subset
GCTTCACCCTTA	subset
TCGATCCACGAG	subset
CACGCCAATTAG	subset
CGACCGGGAAAA	subset
CAAGCATATGCA	subset
CTCATGTTGTAG	subset
TCCTCCGACCCA	subset
...

when i was checking the output bam, which is subset.bam, using code:
samtools view -h ./subset.bam | grep "CH4-LN_2_L007-" | less
I found some unexpected records like:

ST-E00522:612:H2WHKCCX2:6:2112:30259:10521      16      1       197113101       255     25M     *       0       0       GCTCTTCTGCATTTCCTAGTAATAT       JJJJJJJJJJJJJJJJJJJJFFFAA       XC:Z:CTAACAAGTTCT       MD:Z:25 XF:Z:CODING     NH:i:1  NM:i:0  XM:Z:TGAGCGGG   ZP:i:26 UQ:i:0  AS:i:24 gf:Z:CODING     gn:Z:ASPM
       gs:Z:-  RG:Z:CH4-LN_2_L007-364D2CCB     PG:Z:STAR-7303B068
ST-E00522:612:H2WHKCCX2:6:2103:22922:9027       16      1       197113183       255     48M2040N102M    *       0       0       TTCTTTAATTACTCTCCACTTAACAGAAATAACAATTTTCTCTTTAGGCTGCAACACGAAACAGCGCTGCGACACACTGAAGCCCAGGTCCGCGGCCGGGAAGTGGGAGATCTTCACTTCTGCCACCTCCTCGTTAGGGTTGTCTAGGGC  <FF7---7-7<A<FJFF<-JJJJJAJFFAFFAFJFJAF7JJFFA7-7-AFJJFJAJAJJJ<JFJJFFAJJFFA<FJJFFA7AA77--AAA7A-7AJ-JFJJA7-<AJJJJJJJAJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJFFFAA  XC:Z:CTAACAAGTTCT       MD:Z:6G1G1G0G1G4G131    XF:Z:CODING     NH:i:1  NM:i:6  XM:Z:TGAGCGGG   UQ:i:132        AS:i:137        gf:Z:CODING     gn:Z:ASPM       gs:Z:-  RG:Z:CH4-LN_2_L007-364D2CCB
     PG:Z:STAR-7303B068
ST-E00522:612:H2WHKCCX2:6:2208:13829:28611      16      1       197122849       255     150M    *       0       0       AAGTAAAACAAAGAACTAGTTCAATATACAGTACACTTCCTACTCTTCACAGAGAACTGAAATTTTCTATAAAGACATTTATACTTAGGAAACATCAGACAACCAAAGTATGTATAAAACTCACAAGATATTTTACACACAGTTCACAAT  AA--A<-F<F<<7<F7-77<<JFF<F7<-A<FFF77-FFF<-AFAAFJJJJJJFJJFJJJAAFJFJJFJFJJJJFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA  XC:Z:GTGATGTCGGCT       MD:Z:6C143      XF:Z:UTR        NH:i:1  NM:i:1  XM:Z:GCGTGAGG   UQ:i:12 AS:i:146        gf:Z:UTR        gn:Z:ZBTB41     gs:Z:-  RG:Z:CH4-LN_2_L007-364D2CCB     PG:Z:STAR-7303B068
ST-E00522:612:H2WHKCCX2:6:2105:16122:62505      16      1       197168944       255     150M    *       0       0       CTTTTTATAACAAAAATGTCTACTACAGAATTTGCACTGATGATTATTTGATAGTCTTCCAGTTAATTCATTTAGTGTTTCTTCTGGTGATGACTTTTCACTTAGCTCTGAATGAAAAGGGGCAACATTTTCGTTATTTAACAACTTCAC  FA<-<7JJJJJJJJJJJJFAJJJJFAJJJJJFJAAFJJJJJJJJFFJJJJJJJJJJFJJJJJJJFJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJAJFJJJJAF7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFAFAA  XC:Z:GAGTCGAGCGAA       MD:Z:100G49     XF:Z:CODING     NH:i:1  NM:i:1  XM:Z:AGGGGCGG   UQ:i:37 AS:i:146        gf:Z:CODING     gn:Z:ZBTB41     gs:Z:-  RG:Z:CH4-LN_2_L007-364D2CCB     PG:Z:STAR-7303B068

'-364D2CCB' were added to the cell barcode tag, which influence my downstream process.
Is it a bug?

filterbarcodes empty BAM file

Hi Tim! Great idea this tool..really useful! I want to subset by barcodes a BAM file from 10x 3' scRNA which have been already subset with samtools for a specific gene locus. sinto barcodes seems to run smoothly with no errors but then my new BAM file is empty:

` sinto filterbarcodes -b BAM_ACE2/1B1_ACE2.bam
-c BC_groups/1B1_ACE2.txt -o BAM_ACE2_BC/1B1_ACE2_BC.bam
--barcodetag "CB"

Function run_filterbarcodes called with the following arguments:

bam BAM_ACE2/1B1_ACE2.bam
cells BC_groups/1B1_ACE2.txt
output BAM_ACE2_BC/1B1_ACE2_BC.bam
trim_suffix False
sam False
nproc 1
barcode_regex None
barcodetag CB
func <function run_filterbarcodes at 0x7fb2498cb8c8>

Function completed in 0.0 m 0.12 s

`
My barcodes file is a tab-delimited txt file with no quotes generated in R like this:
AGCATACTCAATCACG-1 Goblet
AGTCTTTTCATCGCTC-1 Basal
AGTGAGGTCCACGAAT-1 Basal
ATTCTACAGATTACCC-1 Secretory
CAAGGCCAGATCCCAT-1 Ciliated
CATCAGAAGGCTAGAC-1 Goblet
CATTATCCACCTCGGA-1 Secretory
CATTCGCAGCTAGCCC-1 Basal
CCACGGACACAGATTC-1 Secretory
CCTACACAGACGCACA-1 Basal
AAGGTTCCATACTACG-1 Basal
AAGTCTGCATACTCTT-1 Goblet
CGTCCATTCAAGGTAA-1 Ciliated
GATCAGTGTTCACGGC-1 Basal
GATGAAACATTCGACA-1 Ionocytes
GGACATTAGGTCATCT-1 Basal
GGGTTGCCACGACGAA-1 Secretory
ACGATACCACAGAGGT-1 Basal
TCATTACTCGCCGTGA-1 Goblet
TGAGCATTCTGCTGTC-1 Basal
ACGGCCACAGCGATCC-1 Goblet
TTGACTTCAGACTCGC-1 Basal
TTTGCGCCATGAACCT-1 Ciliated
ACTTTCACACTGTTAG-1 Goblet

I checked whether barcodes are actually found in the bam file and that seem to be the case at least for the couple i tested:

samtools view BAM_ACE2/1B1_ACE2.bam | grep -i "AGCATACTCAATCACG-1" A00198:47:H5L7HDMXX:2:1442:24578:30373 256 X 15601544 0 37M2087N53M * GGCGCGATCTCGGCTCACTGCAAGCTCTGCCTCCCGGGTTCACGCCATTCTCCTGCCTCGGCCTCCCGAGTAGCTGGGACTACAGGCGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:7 HI:i:7 AS:i:85 nM:i:1 RE:A:I BC:Z:CCTGTGCG QT:Z:FFFFFFFF CR:Z:AGCATACTCAATCACG CY:Z:FFFFFFFFFFFFFFFF CB:Z:AGCATACTCAATCACG-1 UR:Z:CCACTTAGTT UY:Z:FFFFFFFFFF UB:Z:CCACTTAGTT RG:Z:1B1:MissingLibrary:1:H5L7HDMXX:2
Also I should i get a bam file for each cell group according to documentation but my output is only one empty BAM file

Much appreciated

Do you accept PRs?

Hi,
I have a use case where I need to append the CB tag to each read's read group ID (in addition to setting the read group's SM tag to the cell barcode). I have some working code for this and I could generate a PR against this repo. Are you interested in adding that functionality to this tool?

no fragments produced

Hello,

I tried running sinto on a non 10X ATACseq data, and the fragment file is empty. I was wondering if you could lep me with that?

here is the output for samtools view merged.bam | head

(signac_env) -bash-4.2$ samtools view AdultCTX_DNA_merge.bam | head
7001113:989:HTKVHBCX2:2:1105:15600:5528:77:15:82:15:CGGTATTTGG	0	chr1	3000049	1	23M	*	0	0	TCTTTGAAGGTCTGGTAGAACTC	DDDDDIIIIIIIIIIIIIIIIII	AS:i:0
7001113:991:HVWNKBCX2:1:2110:2683:18182:87:93:72:36:CAGGTATGGC	0	chr1	3000132	39	53M	*	0	0	GACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTTTAGTCCAT	DDDDCHIIHGFHIIIIIIIIIIHHHEHIIIIIIIDDGHHHDHHIIIIIIIIHI	AS:i:0
7001113:990:HTKL3BCX2:2:2111:14944:4116:87:93:72:36:CAGGTATGGC	0	chr1	3000134	38	52M	*	0	0	CTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTTTAGTCCATG	DDDDDIIIIIIIIIIIIIIIIIIIGIIIIIIIDFHHHHHIIIIIIIIIHIII	AS:i:0
7001113:991:HVWNKBCX2:2:2105:19568:57493:53:23:77:35:CACTATTTTG	0	chr1	3000159	0	52M	*	0	0	GGGGGGGCATGGGACTTTTAGTCCATGAATCTGATCCTGATTTAGCTTTGGT	DDDDDIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIHEHIIHGIIIIIH	AS:i:-22
7001113:993:HVWMKBCX2:2:1211:1360:3780:53:23:77:35:CACTATTTTG	0	chr1	3000353	37	53M	*	0	0	GTTAATTATAGTACAGTCCCTATGCCCTCTAGTTAGTCTGGCTAAGGGTTTAT	DDDDDIIHIIIIIIIIIIIHIIIIIIIIIIIIHIGIHIIHHIIIHHIIIHIII	AS:i:0
7001113:991:HVWNKBCX2:2:2211:8406:75832:21:06:91:12:TCATCTTTGT	16	chr1	3000464	1	52M	*	0	0	TCTTTTTGTTTCCACTTGGTTGATTTCAGCTCTGAGTTTGATTATTTCCTGC	IHIHIIIIHHFF@EHIHGCHF<1IHFEEHEIHIIHIIHHHFIIGIHIDDDDD	AS:i:0
7001113:989:HTKVHBCX2:2:1211:2292:73091:90:80:26:12:CTGTACGGCT	0	chr1	3000559	42	53M	*	0	0	CTTCTAGATTTGCTGTCAGGCTGCTAGTGTATACTCTAGTTTCCTTTTGGAGG	DDCDDIIIIIIIIHIIIIIIIIIIIIIHIIIIIIIIHIIIIIIIIIIIIHIII	AS:i:0
7001113:991:HVWNKBCX2:1:2205:17457:98814:90:80:26:12:CTGTACGGCT	0	chr1	3000633	30	53M	*	0	0	CTCTTAGGACTGCCTCATTGTGCCCCATATGTTTGGCTATGTTGTGGATTTAT	DDDDDIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIII	AS:i:0
7001113:989:HTKVHBCX2:1:2216:4512:6199:90:80:26:12:CTGTACGGCT	0	chr1	3000747	32	52M	*	0	0	ATTAAGTAGAGTATTGTTCAGTTTCCAGGTGAATGTTGGCTTTCTATTATTT	DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIH	AS:i:0
7001113:989:HTKVHBCX2:2:2209:10226:10613:58:87:87:11:TCGAATTTGT	0	chr1	3000919	32	51M	*	0	0	ATTTGGTACTGAGAAGAAGGTATATATCCTTTTGTCTTATGATAAAATGTT	DDDDDIIIIIIIHIIIIIHIIIHIIIIIIIIIIIIIIGIIIIIIIIIIIII	AS:i:0

and this is the code I ran:
sinto fragments -b merged.bam -f fragment

sinto bam filement returning a empty fragment filement

Hi, Dr. Tim,
Hope this email finds you well!
I am using the tool sinto to create a scATAC-seq fragments file from the BAM file.
However, I came across an issue, that is, my output file is empty, which means there's nothing in it. Below is my screenshot of how to use the sinto tool. Could you please tell me whether I make a mistake and how to solve it?

#!/bin/bash
#SBATCH --partition=Orion
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --mem=64GB
#SBATCH --ntasks-per-node=1

cd /scratch/qmei/wqq/mouse-TF/

export PATH=/users/qmei/anaconda3/bin/:$PATH

sinto fragments -b Cerebellum_62216.bam -f Cerebellum_62216fragments.bed

sinto filterbarcodes returning empty bams

Hello Tim,

I am trying to use sinto filterbarcodes for making bam files from my scATAC clusters, and despite running without any error I am getting empty bam files. I saw that this issue had been brought up previously, and updating the sinto version solved the problem. However, I am using the latest 0.7.2.2 version of sinto, so that is probably not the issue. In that post you had also asked the user to ensure that the "cells" file is indeed tab-delimited and I verified that is true for me.

I am running the following code:
sinto filterbarcodes -b fragments_10X_sorted.bam -c cells.csv --barcodetag "CB"

This is how the head of my bam file looks:
GACCTTCGTTATGCAC-2 0 chr1 10158 255 151M * 0 0 * *
TCAAGGTAGTGAACCG-3 0 chr1 10229 255 98M * 0 0 * *
ATTGTCTTCGAAGCCC-4 0 chr1 10335 255 219M * 0 0 * *
GTAGTACCAAGAAACT-4 0 chr1 10793 255 142M * 0 0 * *

And this is the head of my cells file:

AAACGAAAGAACGACC-4 11
AAACGAAAGACCTATC-2 0
AAACGAAAGAGGAATG-3 2
AAACGAAAGCCTATAC-3 11
AAACGAAAGCGTCAAG-4 2
AAACGAAAGCTAGCAG-1 2

Please let me know if I am missing something/what is the potential cause of the issue.

Thanks
Debbie

Normalizing the signal - creating a bigwig

Hi,

This is a very useful tool.

I am wondering what would be the best way to nomalize the data to visualize on UCSC genome browser. I saw that CoveragePlot in Signac considers total number of reads per clusters and the total number of cells per cluster. I would like to know whats the best way to get the scaling factor which then can be used to normalize the bedGraph file.

genomeCoverageBed -ibam cluster_bam.bam -bg | awk -v OFS="\t" '{ $4=$4*scaling_factor; print}' > cluster_bam.bg

Then use 'bedGraphToBigWig' to make a bigWig.

Error with addtags on a transcriptome-aligned BAM file

Hi! I love your tool and I've been using it quite extensively. I have found an error with sinto addtags on a transcriptome-aligned BAM file. I believe the nature of the error is that it treats every transcript as a chromosome, and there are many thousands compared to a 'normal' amount of chromosomes. Is there a better way to handle breaking chromosomes without recursion?

sinto addtags -b Enrichment_PCR_A1-1A_t.bam -m readname -o Enrichment_PCR_A1-1A_t.tagged.bam -p 88 -f sinto_tagfile.txt
Function run_addtags called with the following arguments:

bam     Enrichment_PCR_A1-1A_t.bam
tagfile sinto_tagfile.txt
output  Enrichment_PCR_A1-1A_t.tagged.bam
trim_suffix     False
sam     False
nproc   88
mode    readname
func    <function run_addtags at 0x7f4837808d30>
Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/sinto/utils.py", line 75, in find_chromosome_break
    return find_chromosome_break(position, chromosomes, current_chrom + 1)
  File ".../lib/python3.9/site-packages/sinto/utils.py", line 75, in find_chromosome_break
    return find_chromosome_break(position, chromosomes, current_chrom + 1)
  File ".../lib/python3.9/site-packages/sinto/utils.py", line 75, in find_chromosome_break
    return find_chromosome_break(position, chromosomes, current_chrom + 1)
  [Previous line repeated 996 more times]
  File ".../lib/python3.9/site-packages/sinto/utils.py", line 71, in find_chromosome_break
    if position <= chromosomes[current_chrom]:
RecursionError: maximum recursion depth exceeded in comparison

Attempting to run with 1 processor resulted in the same error.
bam file:

d3ce747b-2355-4385-9d2e-5d1f308a5e8b	256	MSTRG.1.1	1	0	64S85M1I10M1D46M3D78M1I83M2D4M3D30M3D12M613N22M1I144M11D5M3I20M1I29M3D62M1D4M2D3M1D111M1D13M1I23M63S	*	0	0	*	*	NM:i:65	ms:i:513	AS:i:489	nn:i:0	ts:A:+	tp:A:S	cm:i:124	s1:i:589	de:f:0.0537	MD:Z:41C37C1G13^T18A1G0A24^CCA0C41A0C9G24G3G2G75^CG4^GAG30^CAC113G4T33T25^ACCCCCCAACA7G19A26^GGG2A13G12C8A23^A4^CG3^C16G52T19A21^C6A29	rl:i:0
4909bb1f-6734-4e67-86e2-2f6dce0feff9	0	MSTRG.1.1	1	0	65S100M1D44M1D63M1D9M5D11M1D45M2I42M1I27M1I4M2D12M1D11M1D30M2D11M5D25M1D1M3I11M2D53M64S	*	0	0	TATTGTACTTCGTTCAGTTACGTATTGCTGGGGAAGCAGTGGTACAACGCAGGTACATGGGGCCTCCGGAAGTGCGGATCCCAGCGGCAGTCGTGTAGCTGAGCAGGCCTGGGGCTTGGTTCTATGTCCCTGTGGCTATGTTTCCAGTGTCCTCTGGGTGTTTCCAGAGCAACAAGAAACGAATAAATCTCTGCCCCGCAGCGCCTCCACCCAGAGACCCGGACCAAATTCACACAGGACAATCTGTGCCGTGCCCAGCGCAAGCGCCTGGATCGGCCAACGGACCTTGTGGTTCCATCCTTCGCGACACCTCCGAGGACCTGGGACTCCAGTGTGAAGCGCCGTGAACCTGGCCTTCGGGCGCCGCTGTGAGGAACTGGAGGGACGCGCGGCACAAGCTGCAGCACCACCCTGCAAGGTGGGGCACCTGAACCCCGAGACGGCTCCGGCGCCTGCCCACCACTCTGCCCTCTCATCACCTGGGCCCTCAACCCTTTTCTCCTTATATTCCCACTTGTCGAGGGACCCCAGAAGCAAGTGTCACCTCTCCATAAACCTATGTAAAACCAGGCAAAAAAAAAAAAAAAAAAAAACATTTGTAGATCTTCACGGAGCGGAAGAGATCGGAAAGAGCA	$#%'07:8==<?CA@=?AB?7;<===%2432))/.%%%;=;31-*+*../.)775=@?B@@A==@::?=:;6??46:=<4454.11'$$%4*150,,&--0.$%%,9:;9;=;74578:<<;:3128?>578880-'-%(4><+<469:>:4<::76.4/::<823'9;;<57871A=;11142:79:878859865,//,.+/...3*+-*+.333955.460/0+(24499955;?=>?>@:<;8=1379@@@>;:/.00287854221/&)')10.&#$('%$$%'*'''-((1>=9:?D;67<;B>?A<;>=99;9:;@ABE@:BB9;7/(+)&*45999-,6:5;=::?>>>896:77=>)%%7&>@8>DCEC388/29</3=@@C@==CFC5><522-*+3379:44-,+'056&-212277697400/.&',2448<=>969609861-.,)-046;:61/0132)6332.-+(')).2-+/+-%*&+6<>;<:0'(&&'&)56==8<7'/145.1>?<<>DFAE728<9>?@=<?>>?2><>AAIF?;>DB?B?ABEB;=<<B=?BG@B@D@A@C??<=7978;/59898,445:66<=96356<?8:>G;7/58>949;<1.0,'&	NM:i:46	ms:i:293	AS:i:293	nn:i:0	ts:A:+	tp:A:P	cm:i:68	s1:i:350	s2:i:350	de:f:0.0621	MD:Z:23G17C58^A44^C18G22A0C9G10^C9^CTGTG2T0G2G4^A0C1G115^CA12^C11^A30^CC11^CCTGG14A1C8^C1G8A1^CC32G20	rl:i:25
35e30db5-064c-4fb7-b116-81d590761199	256	MSTRG.1.1	1	0	76S26M1I22M4I2M2D2M1I6M1I35M1D22M2D6M3D16M2D9M7D6M6I87M1D26M1D91M1D8M3I21M1D38M1D31M1D40M63S	*	0	0	*	*	NM:i:74	ms:i:172	AS:i:172	nn:i:0	ts:A:+	tp:A:S	cm:i:42	s1:i:255	de:f:0.1035	MD:Z:23G17C8^TT8G5G28^T15G2A3^AT0A5^CTG2C1G11^CC9^CGGACCA7A16A0C9G25G2G2G18G0A1A0C2^C26^C20G0C0C27A0C0A3T0G2G19G10^C15G13^A24C0T2G9^C31^C24G15	rl:i:0
27cb0203-4357-4c90-a809-0a29640741b6	256	MSTRG.1.1	1	0	67S132M2I68M2D79M3D80M250S	*	0	0	*	*	NM:i:15	ms:i:294	AS:i:294	nn:i:0	ts:A:+	tp:A:S	cm:i:86	s1:i:322	de:f:0.0304	MD:Z:41C91G53A0C9G1^GC25G2G50^CGC1G78	rl:i:0
260f47db-336a-49ab-9fb3-93310a1377a1	256	MSTRG.1.1	1	0	62S64M2I37M5D9M4I16M1I2M3D9M3D23M1I11M1D15M1D10M1I3M1I3M2I22M1D27M1D34M1D8M5D65M2D14M5D10M1D33M1D25M2D4M1D32M95S	*	0	0	*	*	NM:i:74	ms:i:154	AS:i:154	nn:i:0	ts:A:+	tp:A:S	cm:i:58	s1:i:273	de:f:0.1024	MD:Z:41C22C1G34^AGAGC5G3C17^GCA7C1^CCC34^G4A0C4A0G0C0G1^G3G21G2G2G6^T0G10G7C2G0G3^T0G33^G0G7^TGTGA65^GA14^CCTGC6C0T2^G0C2C12G16^C25^GA4^C24G7	rl:i:22
462e1896-1f91-42b5-901c-db9f4fb92c16	256	MSTRG.1.1	1	0	97S90M1D51M1D28M1I2M2I78M3I7M1I17M2D83M613N100M5D62M1I16M3I8M1D7M2I5M1D28M2D48M2I2M1D6M3D8M1I5M3I43M1D70M82S	*	0	0	*	*	NM:i:59	ms:i:497	AS:i:473	nn:i:0	ts:A:+	tp:A:S	cm:i:166	s1:i:629	de:f:0.0536	MD:Z:90^G51^C16A0C1A89A5G16^GT27G152T0G1^AGCTG0C3C0A73G6^A4G0G6^A28^GA50^T3C2^GGA13G42^C6T0A10C0A1C0A47	rl:i:0

tagfile

d129905b-f46c-4287-be0b-78bcfbc33d41    CB      ATCTTGACCTGCAACG
0b337571-132f-4f0a-b4dd-b5df16d9654b    CB      ACGTTATTGGTCACTC
4fa9ab30-90ea-4041-a30a-18368e288a08    CB      CAACGTGGTGGAGTCT
b6105613-6ccb-466b-9896-12bba3f9b999    CB      AGCGACCAACGATATT
77ee2019-2acd-4cc9-8345-51f88617466d    CB      GTTACCTACAACTTGC

Thank you! In the meantime, I may hack together some bash script to operate on the sam file.

error filterbarcodes with 10x ATAC bam

Hi Tim, @timoast

Thanks for helping me get the --cells input figured out.
I've tested this with 10 barcodes and the function mentioned in #44 works well with this format.

Now, running the full filterbarcodes command, I get an error that I cannot understand.
I am using the ATAC data from 10X.

This is my script:

export sinto=$HOME/.local/bin/sinto
export BAM_FILE="$HOME/data/10x/pbmc_3k/pbmc_granulocyte_sorted_3k_atac_possorted_bam.bam"
export TEST_BCODES="$HOME/data/10x/pbmc_3k/test_atac_barcodes.tsv"

cd ~/data/10x/pbmc_3k/sinto_filterbarcodes

$sinto filterbarcodes \
    --bam $BAM_FILE \
    --cells $TEST_BCODES \
    --nproc 10 \
    --barcodetag "CB"

My barcodes .tsv file:

mramos@supermicro ~/data/10x/pbmc_3k/sinto_filterbarcodes $ cat $TEST_BCODES | head -3
"TTAGTCAGTCCTCCCA-1"    "A"
"CATATAGAGTCAAGAC-1"    "B"
"GATATGATCTAATCCG-1"    "C"

The output of the command:

Function run_filterbarcodes called with the following arguments:

bam     ./data/10x/pbmc_3k/pbmc_granulocyte_sorted_3k_atac_possorted_bam.bam
cells   ./data/10x/pbmc_3k/test_atac_barcodes.tsv
trim_suffix     False
nproc   10
barcode_regex   None
barcodetag      CB
func    <function run_filterbarcodes at 0x7fdd4fb03670>
[E::hts_open_format] Failed to open file "TTAGTCAGTCCTCCCA-1_L99RMI" : No such file or directory
samtools merge: fail to open "TTAGTCAGTCCTCCCA-1_L99RMI": No such file or directory
[E::hts_open_format] Failed to open file "TTAGTCAGTCCTCCCA-1.tmp" : No such file or directory
samtools reheader: fail to open file 'TTAGTCAGTCCTCCCA-1.tmp': No such file or directory
Traceback (most recent call last):
  File "/.local/bin/sinto", line 8, in <module>
    sys.exit(main())
  File "/.local/lib/python3.8/site-packages/sinto/arguments.py", line 457, in main
    options.func(options)
  File "/.local/lib/python3.8/site-packages/sinto/utils.py", line 23, in wrapper
    func(args)
  File "/.local/lib/python3.8/site-packages/sinto/cli.py", line 17, in run_filterbarcodes
    filterbarcodes.filterbarcodes(
  File "/.local/lib/python3.8/site-packages/sinto/filterbarcodes.py", line 119, in filterbarcodes
    mergeAll(idents=idents, classes=unique_classes, nproc=nproc, header = headerfile, remove=True)
  File "/.local/lib/python3.8/site-packages/sinto/filterbarcodes.py", line 58, in mergeAll
    raise Exception("samtools merge failed, temp files not deleted")
Exception: samtools merge failed, temp files not deleted

Thank you for taking a look!

Best regards,
Marcel

Key Error 'RG' with filterbarcodes

Hello,

I have some scATAC data from which I am trying to generate pseudobulk files using text files of cell barcodes. The fastq files were aligned with bowtie2 and then converted into .bams and sorted using samtools.

I am encountering the following error:

sinto filterbarcodes -b f1.sorted.bam -c fibroblast_cells.txt --outdir f1_CFs.bam -p 1
Function run_filterbarcodes called with the following arguments:

bam	f1.sorted.bam
cells	fibroblast_cells.txt
trim_suffix	False
nproc	1
barcode_regex	None
barcodetag	CB
outdir	f1_CFs.bam
sam	False
func	<function run_filterbarcodes at 0x110ca9550>
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/alexwhitehead/miniconda3/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/alexwhitehead/miniconda3/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/filterbarcodes.py", line 25, in _iterate_reads
    newhead = dict((k, header[k]) for k in ("HD", "SQ", "RG"))
  File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/filterbarcodes.py", line 25, in <genexpr>
    newhead = dict((k, header[k]) for k in ("HD", "SQ", "RG"))
KeyError: 'RG'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/alexwhitehead/miniconda3/bin/sinto", line 8, in <module>
    sys.exit(main())
  File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/arguments.py", line 472, in main
    options.func(options)
  File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/utils.py", line 23, in wrapper
    func(args)
  File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/cli.py", line 17, in run_filterbarcodes
    filterbarcodes.filterbarcodes(
  File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/filterbarcodes.py", line 111, in filterbarcodes
    idents = p.map_async(
  File "/Users/alexwhitehead/miniconda3/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 'RG'

I am unsure if this is caused by missing the read group portion of the header - when I ran

samtools view -H f1.bam

I got the following output:

@HD	VN:1.0	SO:coordinate
@SQ	SN:chr1	LN:248956422
@SQ	SN:chr2	LN:242193529
@SQ	SN:chr3	LN:198295559
@SQ	SN:chr4	LN:190214555
@SQ	SN:chr5	LN:181538259
@SQ	SN:chr6	LN:170805979
@SQ	SN:chr7	LN:159345973
@SQ	SN:chr8	LN:145138636
@SQ	SN:chr9	LN:138394717
@SQ	SN:chr10	LN:133797422
@SQ	SN:chr11	LN:135086622
@SQ	SN:chr12	LN:133275309
@SQ	SN:chr13	LN:114364328
@SQ	SN:chr14	LN:107043718
@SQ	SN:chr15	LN:101991189
@SQ	SN:chr16	LN:90338345
@SQ	SN:chr17	LN:83257441
@SQ	SN:chr18	LN:80373285
@SQ	SN:chr19	LN:58617616
@SQ	SN:chr20	LN:64444167
@SQ	SN:chr21	LN:46709983
@SQ	SN:chr22	LN:50818468
@SQ	SN:chrX	LN:156040895
@SQ	SN:chrY	LN:57227415
@SQ	SN:chrM	LN:16569
@SQ	SN:chr1_KI270706v1_random	LN:175055
@SQ	SN:chr1_KI270707v1_random	LN:32032
@SQ	SN:chr1_KI270708v1_random	LN:127682
@SQ	SN:chr1_KI270709v1_random	LN:66860
@SQ	SN:chr1_KI270710v1_random	LN:40176
@SQ	SN:chr1_KI270711v1_random	LN:42210
@SQ	SN:chr1_KI270712v1_random	LN:176043
@SQ	SN:chr1_KI270713v1_random	LN:40745
@SQ	SN:chr1_KI270714v1_random	LN:41717
@SQ	SN:chr2_KI270715v1_random	LN:161471
@SQ	SN:chr2_KI270716v1_random	LN:153799
@SQ	SN:chr3_GL000221v1_random	LN:155397
@SQ	SN:chr4_GL000008v2_random	LN:209709
@SQ	SN:chr5_GL000208v1_random	LN:92689
@SQ	SN:chr9_KI270717v1_random	LN:40062
@SQ	SN:chr9_KI270718v1_random	LN:38054
@SQ	SN:chr9_KI270719v1_random	LN:176845
@SQ	SN:chr9_KI270720v1_random	LN:39050
@SQ	SN:chr11_KI270721v1_random	LN:100316
@SQ	SN:chr14_GL000009v2_random	LN:201709
@SQ	SN:chr14_GL000225v1_random	LN:211173
@SQ	SN:chr14_KI270722v1_random	LN:194050
@SQ	SN:chr14_GL000194v1_random	LN:191469
@SQ	SN:chr14_KI270723v1_random	LN:38115
@SQ	SN:chr14_KI270724v1_random	LN:39555
@SQ	SN:chr14_KI270725v1_random	LN:172810
@SQ	SN:chr14_KI270726v1_random	LN:43739
@SQ	SN:chr15_KI270727v1_random	LN:448248
@SQ	SN:chr16_KI270728v1_random	LN:1872759
@SQ	SN:chr17_GL000205v2_random	LN:185591
@SQ	SN:chr17_KI270729v1_random	LN:280839
@SQ	SN:chr17_KI270730v1_random	LN:112551
@SQ	SN:chr22_KI270731v1_random	LN:150754
@SQ	SN:chr22_KI270732v1_random	LN:41543
@SQ	SN:chr22_KI270733v1_random	LN:179772
@SQ	SN:chr22_KI270734v1_random	LN:165050
@SQ	SN:chr22_KI270735v1_random	LN:42811
@SQ	SN:chr22_KI270736v1_random	LN:181920
@SQ	SN:chr22_KI270737v1_random	LN:103838
@SQ	SN:chr22_KI270738v1_random	LN:99375
@SQ	SN:chr22_KI270739v1_random	LN:73985
@SQ	SN:chrY_KI270740v1_random	LN:37240
@SQ	SN:chrUn_KI270302v1	LN:2274
@SQ	SN:chrUn_KI270304v1	LN:2165
@SQ	SN:chrUn_KI270303v1	LN:1942
@SQ	SN:chrUn_KI270305v1	LN:1472
@SQ	SN:chrUn_KI270322v1	LN:21476
@SQ	SN:chrUn_KI270320v1	LN:4416
@SQ	SN:chrUn_KI270310v1	LN:1201
@SQ	SN:chrUn_KI270316v1	LN:1444
@SQ	SN:chrUn_KI270315v1	LN:2276
@SQ	SN:chrUn_KI270312v1	LN:998
@SQ	SN:chrUn_KI270311v1	LN:12399
@SQ	SN:chrUn_KI270317v1	LN:37690
@SQ	SN:chrUn_KI270412v1	LN:1179
@SQ	SN:chrUn_KI270411v1	LN:2646
@SQ	SN:chrUn_KI270414v1	LN:2489
@SQ	SN:chrUn_KI270419v1	LN:1029
@SQ	SN:chrUn_KI270418v1	LN:2145
@SQ	SN:chrUn_KI270420v1	LN:2321
@SQ	SN:chrUn_KI270424v1	LN:2140
@SQ	SN:chrUn_KI270417v1	LN:2043
@SQ	SN:chrUn_KI270422v1	LN:1445
@SQ	SN:chrUn_KI270423v1	LN:981
@SQ	SN:chrUn_KI270425v1	LN:1884
@SQ	SN:chrUn_KI270429v1	LN:1361
@SQ	SN:chrUn_KI270442v1	LN:392061
@SQ	SN:chrUn_KI270466v1	LN:1233
@SQ	SN:chrUn_KI270465v1	LN:1774
@SQ	SN:chrUn_KI270467v1	LN:3920
@SQ	SN:chrUn_KI270435v1	LN:92983
@SQ	SN:chrUn_KI270438v1	LN:112505
@SQ	SN:chrUn_KI270468v1	LN:4055
@SQ	SN:chrUn_KI270510v1	LN:2415
@SQ	SN:chrUn_KI270509v1	LN:2318
@SQ	SN:chrUn_KI270518v1	LN:2186
@SQ	SN:chrUn_KI270508v1	LN:1951
@SQ	SN:chrUn_KI270516v1	LN:1300
@SQ	SN:chrUn_KI270512v1	LN:22689
@SQ	SN:chrUn_KI270519v1	LN:138126
@SQ	SN:chrUn_KI270522v1	LN:5674
@SQ	SN:chrUn_KI270511v1	LN:8127
@SQ	SN:chrUn_KI270515v1	LN:6361
@SQ	SN:chrUn_KI270507v1	LN:5353
@SQ	SN:chrUn_KI270517v1	LN:3253
@SQ	SN:chrUn_KI270529v1	LN:1899
@SQ	SN:chrUn_KI270528v1	LN:2983
@SQ	SN:chrUn_KI270530v1	LN:2168
@SQ	SN:chrUn_KI270539v1	LN:993
@SQ	SN:chrUn_KI270538v1	LN:91309
@SQ	SN:chrUn_KI270544v1	LN:1202
@SQ	SN:chrUn_KI270548v1	LN:1599
@SQ	SN:chrUn_KI270583v1	LN:1400
@SQ	SN:chrUn_KI270587v1	LN:2969
@SQ	SN:chrUn_KI270580v1	LN:1553
@SQ	SN:chrUn_KI270581v1	LN:7046
@SQ	SN:chrUn_KI270579v1	LN:31033
@SQ	SN:chrUn_KI270589v1	LN:44474
@SQ	SN:chrUn_KI270590v1	LN:4685
@SQ	SN:chrUn_KI270584v1	LN:4513
@SQ	SN:chrUn_KI270582v1	LN:6504
@SQ	SN:chrUn_KI270588v1	LN:6158
@SQ	SN:chrUn_KI270593v1	LN:3041
@SQ	SN:chrUn_KI270591v1	LN:5796
@SQ	SN:chrUn_KI270330v1	LN:1652
@SQ	SN:chrUn_KI270329v1	LN:1040
@SQ	SN:chrUn_KI270334v1	LN:1368
@SQ	SN:chrUn_KI270333v1	LN:2699
@SQ	SN:chrUn_KI270335v1	LN:1048
@SQ	SN:chrUn_KI270338v1	LN:1428
@SQ	SN:chrUn_KI270340v1	LN:1428
@SQ	SN:chrUn_KI270336v1	LN:1026
@SQ	SN:chrUn_KI270337v1	LN:1121
@SQ	SN:chrUn_KI270363v1	LN:1803
@SQ	SN:chrUn_KI270364v1	LN:2855
@SQ	SN:chrUn_KI270362v1	LN:3530
@SQ	SN:chrUn_KI270366v1	LN:8320
@SQ	SN:chrUn_KI270378v1	LN:1048
@SQ	SN:chrUn_KI270379v1	LN:1045
@SQ	SN:chrUn_KI270389v1	LN:1298
@SQ	SN:chrUn_KI270390v1	LN:2387
@SQ	SN:chrUn_KI270387v1	LN:1537
@SQ	SN:chrUn_KI270395v1	LN:1143
@SQ	SN:chrUn_KI270396v1	LN:1880
@SQ	SN:chrUn_KI270388v1	LN:1216
@SQ	SN:chrUn_KI270394v1	LN:970
@SQ	SN:chrUn_KI270386v1	LN:1788
@SQ	SN:chrUn_KI270391v1	LN:1484
@SQ	SN:chrUn_KI270383v1	LN:1750
@SQ	SN:chrUn_KI270393v1	LN:1308
@SQ	SN:chrUn_KI270384v1	LN:1658
@SQ	SN:chrUn_KI270392v1	LN:971
@SQ	SN:chrUn_KI270381v1	LN:1930
@SQ	SN:chrUn_KI270385v1	LN:990
@SQ	SN:chrUn_KI270382v1	LN:4215
@SQ	SN:chrUn_KI270376v1	LN:1136
@SQ	SN:chrUn_KI270374v1	LN:2656
@SQ	SN:chrUn_KI270372v1	LN:1650
@SQ	SN:chrUn_KI270373v1	LN:1451
@SQ	SN:chrUn_KI270375v1	LN:2378
@SQ	SN:chrUn_KI270371v1	LN:2805
@SQ	SN:chrUn_KI270448v1	LN:7992
@SQ	SN:chrUn_KI270521v1	LN:7642
@SQ	SN:chrUn_GL000195v1	LN:182896
@SQ	SN:chrUn_GL000219v1	LN:179198
@SQ	SN:chrUn_GL000220v1	LN:161802
@SQ	SN:chrUn_GL000224v1	LN:179693
@SQ	SN:chrUn_KI270741v1	LN:157432
@SQ	SN:chrUn_GL000226v1	LN:15008
@SQ	SN:chrUn_GL000213v1	LN:164239
@SQ	SN:chrUn_KI270743v1	LN:210658
@SQ	SN:chrUn_KI270744v1	LN:168472
@SQ	SN:chrUn_KI270745v1	LN:41891
@SQ	SN:chrUn_KI270746v1	LN:66486
@SQ	SN:chrUn_KI270747v1	LN:198735
@SQ	SN:chrUn_KI270748v1	LN:93321
@SQ	SN:chrUn_KI270749v1	LN:158759
@SQ	SN:chrUn_KI270750v1	LN:148850
@SQ	SN:chrUn_KI270751v1	LN:150742
@SQ	SN:chrUn_KI270752v1	LN:27745
@SQ	SN:chrUn_KI270753v1	LN:62944
@SQ	SN:chrUn_KI270754v1	LN:40191
@SQ	SN:chrUn_KI270755v1	LN:36723
@SQ	SN:chrUn_KI270756v1	LN:79590
@SQ	SN:chrUn_KI270757v1	LN:71251
@SQ	SN:chrUn_GL000214v1	LN:137718
@SQ	SN:chrUn_KI270742v1	LN:186739
@SQ	SN:chrUn_GL000216v2	LN:176608
@SQ	SN:chrUn_GL000218v1	LN:161147
@SQ	SN:chrEBV	LN:171823
@PG	ID:bowtie2	PN:bowtie2	VN:2.4.4	CL:"/Users/alexwhitehead/miniconda3/bin/bowtie2-align-s --wrapper basic-0 -p 8 -X2000 --local -x /Users/alexwhitehead/Applications/bowtie2_index/GRCh38_noalt_as -1 /Users/alexwhitehead/AW_ATAC/fetal_heart_sc_fastq/fastq/SRR11692126_S1_L001_R1_001.fastq.gz -2 /Users/alexwhitehead/AW_ATAC/fetal_heart_sc_fastq/fastq/SRR11692126_S1_L001_R2_001.fastq.gz"
@PG	ID:samtools	PN:samtools	PP:bowtie2	VN:1.15.1	CL:samtools view -H f1.sorted.bam

Do I need to edit the header somehow or is this due to another issue?

Thanks,

Alex

Missing Fragments edge case

Hi Tim, thanks for the package!

Here's a simple toy bam file on which I was running sinto fragments. The MAPQs, fragment lengths etc seem fine, but there's no output when I run:
sinto fragments -b lol.bam -f lol.frag

lol.bam.zip
This file has 5 read pairs. I see outputs when I add more read pairs.

I was tracing through the code and it's most likely an edge case here:

sinto/sinto/fragments.py

Lines 218 to 224 in 9ac3e8c

# collapse and write the remaining fragments
complete = findCompleteFragments(
fragments=fragment_dict,
max_dist=max_distance,
current_position=i.reference_start,
max_collapse_dist=-max_distance, # make sure we get them all
)

The fragment_dict looks good per chromosome but somehow doesn't make it to complete. Not entirely sure of the logic but can look into it if you'd like. Thanks again.

collapsing reads

Hi Tim,

I am trying to figure out how to merge bed files resulting from multiple sinto runs (so as to be able to run it on a split bam file and parallelize it). for this, I am trying to understand how sinto collapses reads. In the user guide, you say "Within a cell barcode, collapse fragments that share a start or end coordinate on the same chromosome." Could you explain why this is done. I would've thought that for the cell barcode, one would want to collapse only those fragments that have the same start AND end positions. why do you say "start or end"?

Why collapsing reads across cell barcode?

Hi developers,

When extracting fragments from bam files, sinto would collapse fragments that share the exact same chr start and end coordinates across all cell barcode according to the documentation. Can you please justify this? Is there a reason why different cells can't harbor exactly the same fragments?

error reading barcodes file with `read_cell_barcode_file`

Hi Tim, @timoast

Thank you for working on this useful tool.
I'm trying to make sure that my barcode file is formatted properly and is able to be read in.
Is it necessary to have a groups column in the cells file?
I am currently using a text file with only one barcode per line.

mramos@super ~/data/10x/pbmc_3k $ cat test_atac_barcodes.txt | head -3
TTAGTCAGTCCTCCCA-1
CATATAGAGTCAAGAC-1
GATATGATCTAATCCG-1

I get a "list index out of range" error.

sinto_dir = os.path.expanduser("~/gh/sinto/sinto")
os.chdir(sinto_dir)
utilspy = os.path.join(sinto_dir, "utils.py")
exec(open(utilspy).read())
read_cell_barcode_file(os.path.expanduser("~/data/10x/pbmc_3k/test_atac_barcodes.txt"))
>>> read_cell_barcode_file(os.path.expanduser("~/data/10x/pbmc_3k/test_atac_barcodes.txt"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 204, in read_cell_barcode_file
IndexError: list index out of range

The documentation reads:

File or comma-separated list of cell barcodes. Can be gzip compressed

Thank you for your help!

Best,
Marcel

How to deal with the same barcodes when merging the bam files from multiple samples?

Hi Tim,
Thanks for developing this tool! It's very handy. Just one question: If I want to merge the bam files from multiple samples after running sinto filterbarcodes, how can I avoid the issue arising from the use of same cell barcodes from different bam files? Can I add a prefix (e.g. sample ID) to cell barcode name when I run sinto filterbarcodes?
Many thanks!

sinto filterbarcodes error - OSError: truncated file

Hey there,

I am running sintp filterbarcodes to splitbam file.
Cmd: sinto filterbarcodes -b $.bam -c $cells -p 16
The error message shown as below:

[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File ".../lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File ".../lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File ".../lib/python3.9/site-packages/sinto/filterbarcodes.py", line 21, in _iterate_reads
for r in inputBam.fetch(i[0], i[1], i[2]):
File "pysam/libcalignmentfile.pyx", line 2086, in pysam.libcalignmentfile.IteratorRowRegion.next
OSError: truncated file
"""

Thank you very much for the help!
Bests,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.