morispi / lrez Goto Github PK

Standalone tool and library allowing to work with barcoded linked-reads

License: GNU Affero General Public License v3.0

Makefile 0.96% Shell 0.02% C++ 92.78% C 5.59% Python 0.65%

barcode barcodes linked-reads linked reads 10x 10xgenomics 10x-genomics index haplotagging

lrez's Introduction

LRez

LRez provides a standalone tool allowing to work with barcoded linked-reads such as 10X Genomics data, as well as library allowing to easily use it in other projects.

Presently, it is directly compatible with the following linked-reads technologies, given the barcodes are reported using the BX:Z tag (if this is not the case, pre-processing scripts are given in the utils/ directory):

10x Genomics
Haplotagging
stLFR
TELL-Seq

LRez has different functionalities such as comparing regions pairs or contigs extremities to retrieve their common barcodes and extracting barcodes from given regions of a BAM file, as well as indexing and querying both BAM and FASTQ files to quickly retrieve reads or alignments sharing a given barcode or list of barcodes. In can thus be used in different applications, such as variant calling or scaffolding.

Requirements

A Unix based operating system.
g++, minimum version 5.5.0.
CMake, minimum version 2.8.2.
zlib, minimum version 1.2.11.

Installation from source

Clone the LRez repository, along with its submodules with:

git clone --recursive https://github.com/morispi/LRez

Then run the install.sh script:

./install.sh

The installation script will build dependencies, the binary standalone in the bin folder, as well as the library, allowing to use LRez in other projects, in the lib folder.

Installation from conda

Alternatively, LRez is also distributed as a bioconda package, which can be installed with:

conda install -c bioconda lrez

Using the toolkit

Usage

LRez [SUBCOMMAND]

where [SUBCOMMAND] can be one of the following:

compare: Compute the number of common barcodes between pairs of regions, or between pairs of contigs' extremities
extract: Extract the barcodes from a given region of a BAM file
stats: Retrieve general stats from a BAM file
index bam: Index the offsets or occurrences positions of the barcodes contained in a BAM file
query bam: Query the barcodes index to retrieve alignments in a BAM file, given a barcode or list of barcodes
index fastq: Index the offsets of the barcodes contained in a fastq file
query fastq: Query the barcodes index to retrieve alignments in a fastq file, given a barcode or list of barcodes

Subcommands

A description of each subcommand as well as its options is given below.

Compare

LRez compare allows to compute the number of common barcodes between all possibles pairs of a given list of regions, or between a given contig's extremities and all other contigs' extremities.

  --bam STRING, -b STRING:      BAM file containing the alignments
  --index STRING, -i SRING:     Barcodes offsets index built with the index bam subcommand
  --region STRING, -r STRING:   File containing regions of interest in format chromosome:startPosition-endPosition
  --contig STRING, -c STRING:   Contig of interest
  --contigs STRING, -c STRING:  File containing a list of contigs of interest
  --size INT, -s INT:           Size of contigs' extremities to consider (optional, default: 1000) 
  --output STRING, -o STRING:   File where to output the results (optional, default: stdout)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Extract

LRez extract allows to extract the list of barcodes in a given region of a BAM file.

  --bam STRING, -b STRING:      BAM file to extract barcodes from
  --region STRING, -r STRING:   Region of interest in format chromosome:startPosition-endPosition
  --all, -a:                    Extract all barcodes
  --output STRING, -o STRING:   File where to output the extracted barcodes (optional, default: stdout)
  --duplicates, -d:             Include duplicate barcodes (optional, default: false)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Stats

LRez stats allows to retrieve general stats from the BAM file.

  --bam STRING, -b STRING:      BAM file to extract barcodes from
  --regions INT, -r INT:        Number of regions to consider to define stats (optional, default: 1000)
  --size INT, -s INT:           Size of the regions to consider (optional, default: 1000)
  --output STRING, -o STRING:   File where to output the extracted barcodes (optional, default: stdout)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Index BAM

LRez index bam allows to index the offsets or occurrences positions of the barcodes contained in a BAM file.

  --bam STRING, -b STRING:      BAM file to index
  --output STRING, -o STRING:   File where to store the index
  --offsets, -f:                Index the offsets of the barcodes in the BAM file
  --positions, -p:              Index the (chromosome, begPosition) occurrences positions of the barcodes
  --primary, -r:                Only index barcodes that appear in a primary alignment (optional, default: false)
  --quality INT, -q INT:        Only index barcodes that appear in an alignment of quality higher than this number (optional, default: 0)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Query BAM

LRez query bam allows to query a barcodes index and a BAM file to retrieve alignments containing the query barcodes.

  --bam STRING, -b STRING:      BAM file to search
  --index STRING, -i STRING:    Barcodes offsets index, built with the index bam subcommand, using the -f option.
  ---query STRING, -q STRING:   Query barcode to search in the BAM / index
  --list STRING, -l STRING:     File containing a list of barcodes to search in the BAM / index
  --output STRING, -o STRING:   File where to output the extracted alignments (optional, default: stdout)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Index fastq

LRez index fastq allows to index the offsets of the barcodes contained in a fastq file.

  --fastq STRING, -f STRING:    Fastq file to index
  --output STRING, -o STRING:   File where to store the index
  --gzip, -g:                   Fastq file is gzipped (optional, default: false)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Query fastq

LRez query fastq allows to query a barcodes index and a fastq file to retrieve alignments containing the query barcodes.

  --fastq STRING, -f STRING:                Fastq file to search
  --index STRING, -i STRING:                Barcodes index, built with the index fastq subcommand
  --query STRING, -q STRING:                Query barcode to search in the fastq file and the index
  --list STRING, -l STRING:                 File containing a list of barcodes to search in the fastq file and the index
  --collectionOfLists STRING, -c STRING:    File of files (FOF) e.g. file containing files' names of lists of barcodes to search in the fastq file and the index
  --output STRING, -o STRING:               File where to output the extracted reads (optional, default: stdout)
  --gzip, -g:                               Fastq file is gzipped (optional, default: false)
  --threads INT, -t INT:                    Number of threads to use (optional, default: 1)

Using the API

Complete documentation of the different API functions is provided at https://morispi.github.io/LRez/files.html. Additionnal information and usage examples are provided on the Wiki page https://github.com/morispi/LRez/wiki.

Notes

LRez has been developed and tested on x86-64 GNU/Linux.
Support for any other platform has not been tested.

Authors

Pierre Morisse, Claire Lemaitre and Fabrice Legeai.

Reference

Pierre Morisse, Claire Lemaitre, Fabrice Legeai. LRez: C++ API and toolkit for analyzing and managing Linked-Reads data. Bioinformatics Advances, vbab022, https://doi.org/10.1093/bioadv/vbab022

Contact

You can report problems and bugs as issues on this repository : https://github.com/morispi/LRez/issues

lrez's People

Contributors

Stargazers

Watchers

Forkers

mbargull milot-mirdita anne-gcd pontushojer

lrez's Issues

"LRez index fastq": unable to index a FASTQ file not gzipped

I am trying to index a FASTQ file not gzipped with the command LRez index fastq but it returns an error.
I have tried for two different FASTQ files, but it returns an error for both.

FASTQ file 1:
LRez index fastq --fastq NA24385_phased_possorted.fastq --output NA24385_phased.shelve
Error returned:
"gzIndex: Untable to open gzip index for file NA24385_phased_possorted.fastqi for reading. Please make sure the gzip index file exists.: iostream error"

FASTQ file 1:
LRez index fastq --fastq stLFR_NA24385.sort.rmdup_barcodes_extracted.fastq.gz --output stLFR_NA24385.shelve
Error returned:
"gzIndex: could not open stLFR_NA24385.sort.rmdup_barcodes_extracted.fastq.gz for reading. Please make sure the file exists.: iostream error"

"stoi" error when indexing bam positions

This Leviathan issue is in fact an LRez issue.

Unrecognized sequencing technology

Hello,

I'm trying to use LEVIATHAN and require the barcode indices from LRez. Introduced a step in my workflow that uses samtools view to filter reads on mapping quality, and it seems that doing so has created issues with LRez no longer recognizing the BX:Z: tags (where it did previously). These are haplotagging data, where the index is AXXCXXBXXDXX.

$ LRez index bam -p -b 2A_3_221221_15x.bam -o test.bci
determineSequencingTechnology: Unrecognized sequencing technology. Please make sure your barcodes originate from a compatible technology or are reported as nucleotides in the BX:Z tag.

Unless I'm mistaken, my bam files are formatted normally:

$ samtools view -h 2A_3_221221_15x.bam | head -18
@HD     VN:1.6  SO:coordinate
@SQ     SN:2L   LN:23513712
@SQ     SN:2R   LN:25286936
@SQ     SN:3L   LN:28110227
@SQ     SN:3R   LN:32079331
@SQ     SN:4    LN:1348131
@SQ     SN:X    LN:23542271
@SQ     SN:Y    LN:3667352
@RG     ID:2A_3_221221_15x      SM:2A_3_221221_15x
@PG     ID:bwa  PN:bwa  CL:bwa mem -C -t 6 -M -R @RG\tID:2A_3_221221_15x\tSM:2A_3_221221_15x Assembly/Assembly/dmel.trunc.fa Trimming/2A_3_221221_15x.R1.fq.gz Trimming/2A_3_221221_15x.R2.fq.gz  VN:0.7.17-r1188
@PG     ID:samtools     PN:samtools     CL:samtools view -h -F 4 -q 30 -t Assembly/Assembly/dmel.trunc.fa.fai -T Assembly/Assembly/dmel.trunc.fa -   PP:bwa  VN:1.17
@PG     ID:samtools.1   PN:samtools     CL:samtools sort -T Alignments/bwa/2A_3_221221_15x --reference Assembly/Assembly/dmel.trunc.fa -O bam -m 4G -o Alignments/bwa/2A_3_221221_15x.sort.bam -  PP:samtools     VN:1.17
@PG     ID:sambamba     CL:markdup -t 4 -l 4 Alignments/bwa/2A_3_221221_15x.sort.bam Alignments/bwa/2A_3_221221_15x.bam      PP:samtools.1   VN:1.0
@PG     ID:samtools.2   PN:samtools     PP:sambamba     VN:1.17 CL:samtools view -h 2A_3_221221_15x.bam
A00470:481:HNYFWDRX2:1:2177:6090:16266  1123    2L      4831    40      80M     =       5088407      CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGA    FF,FFFF::FFFFF:FFFF:FFF:FFFFFF:FFFFFFFFFFF:FFF:FFFFF,FFFF,F,FFFFFF:FFFFFFFFFFFFF NM:i:0  MD:Z:80      MC:Z:150M       AS:i:80 XS:i:80 RG:Z:2A_3_221221_15x    BX:Z:A95C26B84D96 1:N:0:TATCAGTA+TTACTACT
A00470:481:HNYFWDRX2:1:2207:9426:24267  1123    2L      4831    40      80M     =       5112407      CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGA    FF:F,F,FFF,FFFF,FFFFF::FFFFFFF,FFFF:FFFFFFF:FFFFF:FFF:F:FF,FFF,FF:F,FF,:FFFF,:,F NM:i:0  MD:Z:80      MC:Z:22S126M    AS:i:80 XS:i:80 RG:Z:2A_3_221221_15x    BX:Z:A95C26B84D96 1:N:0:TATCAGTA+TTACTACT
A00470:481:HNYFWDRX2:1:2254:21902:6699  1123    2L      4831    40      80M     =       5088407      CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGA    FFF,F:FFF,FFFFF,FFFFFF:FFFFFFF:FFFFFFFFFF:FFFFFF:FFFFFFFF,FFFF,F::,FFFFF,:FF,FFF NM:i:0  MD:Z:80      MC:Z:150M       AS:i:80 XS:i:80 RG:Z:2A_3_221221_15x    BX:Z:A95C26B84D96 1:N:0:TATCAGTA+TTACTACT
A00470:481:HNYFWDRX2:1:2273:24334:6433  1123    2L      4831    40      80M     =       5088407      CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGA    FFFFFF,F:FFFFFFF:FF:F:F,FFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFF:FF,FFFF:FFFFFFF NM:i:0  MD:Z:80      MC:Z:150M       AS:i:80 XS:i:80 RG:Z:2A_3_221221_15x    BX:Z:A95C26B84D96 1:N:0:TATCAGTA+TTACTACT

Do you have insights to provide on this that may reveal a mistake on my end or a bug in LRez?
While the workflow is listed in the @PG tags, the steps are:

map with bwa mem
filter with samtools view
sort and convert to bam with samtools sort
mark duplicates with sambamba markdup

LRez does not handle Haplotagging and stLFR barcodes ending with "-1"

LRez supposes "-1" ending of the barcode in the BX tag of the bam file is specific to 10X genomics linked-reads data.
However, it seems that some mappers (such as EMA) add "-1" at the end of barcodes whatever the linked-read technology.
Be carefull : this does not issue any error... but results in all barcodes begin encoded in the same key value (so no indexing).

Could we handle better "-1" endings in LRez ?

Claire

bam query slow

I have indexed a quite large barcoded BAM (~220 Gb) file using LRez and now I want to perform queries for barcodes. I have several lists of barcodes with about 2000 entries in each. Unfortunately it is very slow. If I read the paper correctly queries of about 1000 barcodes took at most 10 min. For me it has been running for almost 3 hours with files of 2000 queries without finishing.

Commands

# Index
LRez index bam- b file.bam -o file.bam.bci -f -t 10

# Query
LRez query bam -b file.bam -i file.bam.bci -l list.bxu -o list.bam -t 10 -H

Below is the memory/CPU usage. I am run two query commands in parallel with 10 threads each.

I guess the initial sharp memory-incline is from loading the index (size about 55Gb on disk), this seams to take about 10 min or so. Then it is presumably doing index lookups for the list of barcodes which is taking much longer that I would expect. Any idea why this is so slow?

As a side-note it seems that core utilisation is quite poor with only about 1 core per process being used.

query bam with option `-H` generates malformatted SAM

I am using LRez query bam with the option -H to include a header in the output. It however adds a blank line inbetween header and alignments causing the SAM to be malformatted. See example below:

...
@PG	ID:samtools.3	PN:samtools	CL:samtools cat -o final.bam chunks/chrA.calling.bam unmapped.bam	PP:samtools.2	VN:1.12

A00621:130:HN5HWDSXX:4:1368:15555:22889	163	chrA	26885	60	137M	=	26959	137	CAAGTGATCTGCCCACCTCGGCCTCCCAAAGTGCTGGGATTACACGTGTGAACCACCATGCCTGGTCTCTAATTTTTCTGATTCTATAAAATTACATTCTATTTGCTGAAAGAGTACTTTAGAGTTGAAGAAAAAGA	FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	XF:i:0	PG:Z:MarkDuplicates	RG:Z:1	XG:f:1	NM:i:0	BX:Z:CTTGGTCATTCATACAGTCC-1	MI:i:198
...

Include barcode integer suffix in index.

Relates to #6.

As noted in the longranger docs (below) the suffix number can be any integer, not just "-1", as it is mean to allow for merging of different 10X libraries into the same BAM.

The BX tag includes a suffix with a dash separator followed by a number:
AGAATGGTCTGCATCG-1
This number denotes what we call a GEM group, and is used to virtualize barcodes in order to achieve a higher effective barcode diversity when combining samples generated from separate GEM chip channel runs. Normally, this number will be "1" across all barcodes when analyzing a sample generated from a single GEM chip channel. It can either be left in place and treated as part of a unique barcode identifier, or explicitly parsed out to leave only the barcode sequence itself.

I run into this issue when trying to run LRez index bam on a BAM with multiple libraries which resulted in the following error:

determineSequencingTechnology: Unrecognized sequencing technology. Please make sure your barcodes originate from a compatible technology or are reported as nucleotides in the BX:Z tag.

From what I can understand from the code this suffix is currently not include in the index. For LRez to work with BAMs that contain multiple libraries this would need to be fixed.