Giter VIP home page Giter VIP logo

pyfastx's Introduction

pyfastx

Action

Readthedocs

Codecov

PyPI

Language

Pyver

Wheel

Codacy

Downloads

License

Bioconda

Citation: Lianming Du, Qin Liu, Zhenxin Fan, Jie Tang, Xiuyue Zhang, Megan Price, Bisong Yue, Kelei Zhao. Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files. Briefings in Bioinformatics, 2021, 22(4):bbaa368.

Table of Contents

Introduction

The pyfastx is a lightweight Python C extension that enables users to randomly access to sequences from plain and gzipped FASTA/Q files. This module aims to provide simple APIs for users to extract seqeunce from FASTA and reads from FASTQ by identifier and index number. The pyfastx will build indexes stored in a sqlite3 database file for random access to avoid consuming excessive amount of memory. In addition, the pyfastx can parse standard (sequence is spread into multiple lines with same length) and nonstandard (sequence is spread into one or more lines with different length) FASTA format. This module used kseq.h written by @attractivechaos in klib project to parse plain FASTA/Q file and zran.c written by @pauldmccarthy in project indexed_gzip to index gzipped file for random access.

This project was heavily inspired by @mdshw5's project pyfaidx and @brentp's project pyfasta.

Features

  • Single file for the Python extension
  • Lightweight, memory efficient for parsing FASTA/Q file
  • Fast random access to sequences from gzipped FASTA/Q file
  • Read sequences from FASTA file line by line
  • Calculate N50 and L50 of sequences in FASTA file
  • Calculate GC content and nucleotides composition
  • Extract reverse, complement and antisense sequences
  • Excellent compatibility, support for parsing nonstandard FASTA file
  • Support for FASTQ quality score conversion
  • Provide command line interface for splitting FASTA/Q file

Installation

Currently, pyfastx supports Python 3.6, 3.7, 3.8, 3.9, 3.10, 3.11. Make sure you have installed both pip and Python before starting.

You can install pyfastx via the Python Package Index (PyPI)

pip install pyfastx

Update pyfastx module

pip install -U pyfastx

FASTX

New in pyfastx 0.8.0.

Pyfastx provide a simple and fast python binding for kseq.h to iterate over sequences or reads in fasta/q file. The FASTX object will automatically detect the input sequence format (fasta or fastq) to return different tuple.

FASTA sequences iteration

When iterating over sequences on FASTX object, a tuple (name, seq) will be returned.

If you want the sequence comment, you can set comment to True, New in pyfastx 0.9.0.

The comment is the content of header line after the first white space or tab character.

FASTQ reads iteration

When iterating over reads on FASTX object, a tuple (name, seq, qual) will be returned.

If you want the read comment, you can set comment to True, New in pyfastx 0.9.0.

The comment is the content of header line after the first white space or tab character.

FASTA

Read FASTA file

Read plain or gzipped FASTA file and build index, support for random access to FASTA.

Note

Building index may take some times. The time required to build index depends on the size of FASTA file. If index built, you can randomly access to any sequences in FASTA file. The index file can be reused to save time when you read seqeunces from FASTA file next time.

FASTA records iteration

The fastest way to iterate plain or gzipped FASTA file without building index, the iteration will return a tuple contains name and sequence.

You can also iterate sequence object from FASTA object like this:

Iteration with build_index=True (default) return sequence object which allows you to access attributions of sequence. New in pyfastx 0.6.3.

Get FASTA information

Get longest and shortest sequence

New in pyfastx 0.3.0

Calculate N50 and L50

New in pyfastx 0.3.0

Calculate assembly N50 and L50, return (N50, L50), learn more about N50,L50

Get sequence mean and median length

New in pyfastx 0.3.0

Get sequence counts

New in pyfastx 0.3.0

Get counts of sequences whose length >= specified length

Get subsequences

Subsequences can be retrieved from FASTA file by using a list of [start, end] coordinates

Key function

New in pyfastx 0.5.1

Sometimes your fasta will have a long header which contains multiple identifiers and description, for example, ">JZ822577.1 contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence". In this case, both "JZ822577.1" and "contig1" can be used as identifer. you can specify the key function to select one as identifier.

Sequence

Get a sequence from FASTA

Get sequence information

Sequence slice

Sequence object can be sliced like a python string

Note

Slicing start and end coordinates are 0-based. Currently, pyfastx does not support an optional third step or stride argument. For example ss[::-1]

Reverse and complement sequence

Read sequence line by line

New in pyfastx 0.3.0

The sequence object can be iterated line by line as they appear in FASTA file.

Note

Sliced sequence (e.g. fa[0][10:50]) cannot be read line by line

Search for subsequence

New in pyfastx 0.3.6

Search for subsequence from given sequence and get one-based start position of the first occurrence

FastaKeys

New in pyfastx 0.8.0. We have changed Identifier object to FastaKeys object.

Get keys

Get all names of sequence as a list-like object.

Sort keys

Sort keys by sequence id, name, or length for iteration

New in pyfastx 0.5.0

Filter keys

Filter keys by sequence length and name

New in pyfastx 0.5.10

FASTQ

New in pyfastx 0.4.0

Read FASTQ file

Read plain or gzipped file and build index, support for random access to reads from FASTQ.

FASTQ records iteration

The fastest way to parse plain or gzipped FASTQ file without building index, the iteration will return a tuple contains read name, seq and quality.

You can also iterate read object from FASTQ object like this:

Iteration with build_index=True (default) return read object which allows you to access attribution of read. New in pyfastx 0.6.3.

Get FASTQ information

Read

Get read from FASTQ

Get read information

>>> r = fq[-10]
>>> r
<Read> A00129:183:H77K2DMXX:1:1101:1750:4711 with length of 150

>>> # get read order number in FASTQ file
>>> r.id
791

>>> # get read name
>>> r.name
'A00129:183:H77K2DMXX:1:1101:1750:4711'

>>> # get read full header line, New in pyfastx 0.6.3
>>> r.description
'@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG'

>>> # get read length
>>> len(r)
150

>>> # get read sequence
>>> r.seq
'CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA'

>>> # get raw string of read, New in pyfastx 0.6.3
>>> print(r.raw)
@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG
CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:

>>> # get read quality ascii string
>>> r.qual
'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:'

>>> # get read quality integer value, ascii - 33 or 64
>>> r.quali
[37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25]

>>> # get read length
>>> len(r)
150

FastqKeys

New in pyfastx 0.8.0.

Get fastq keys

Get all names of read as a list-like object.

Command line interface

New in pyfastx 0.5.0

Build index

New in pyfastx 0.6.10

Show statistics information

Split FASTA/Q file

Convert FASTQ to FASTA file

Get subsequence with region

Sample sequences

Extract sequences

New in pyfastx 0.6.10

Drawbacks

If you intensively check sequence names exists in FASTA file using in operator on FASTA object like:

This will take a long time to finish. Becuase, pyfastx does not load the index into memory, the in operating is corresponding to sql query existence from index database. The faster alternative way to do this is:

Testing

The pyfaidx module was used to test pyfastx. First, make sure you have a suitable version installed:

pip install pyfastx

To test pyfastx, you should also install pyfaidx 0.5.8:

pip install pyfaidx==0.5.8

Then, to run the tests:

$ python setup.py test

Acknowledgements

kseq.h and zlib was used to parse FASTA format. Sqlite3 was used to store built indexes. pyfastx can randomly access to sequences from gzipped FASTA file mainly attributed to indexed_gzip.

pyfastx's People

Contributors

cmorganl avatar corneliusroemer avatar emollier avatar khillion avatar lmdu avatar telatin avatar tsibley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyfastx's Issues

Error indexing FASTQ; are there header length or character limitations?

Hello @lmdu, thanks for your tool. I'm looking for fast random access to both fasta and fastq files, therefore I'm giving pyfastx a try. I'm running into the following issue with indexing FASTQ file, I'm guessing its either related to length or perhaps the brackets in the header??

>>> import pyfastx
>>> fa = pyfastx.Fasta('test.fasta')
>>> fa[1]
<Sequence> MK674167.1 with length of 1328
>>> fa['MK674167.1']
<Sequence> MK674167.1 with length of 1328
>>> fa['MK674167.1'].seq
'TAATAAGTGTTTTATGGCACTTTTTTAAATCCATATCCACCTTGTGTGCAATGTCAGGGTTGGTTTCTCTCTTTTGAGAGATCAACCCCAAACATCAACTCTATCTTAACTCTTTGTCTGAAAAATATTATGAATAAAACAATTCAAAATACAACTTTCAACAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCATATTGCGCTCTTTGGTATTCCGAAGAGCATGCTTGTTTGAGTATCAGTAAACACCTCAAAGCCTTTTTTCTTTTTTGAAGAAAGGCTTTGGACTTGAGCAATCCCAACACCAATCTTTTAAAGAGAGGGGGCGGGTTGCTTGAAATGCAGGTGCAGCTGGACATTCTCCTGAGCTAAAAGCATATTCATTTAGTCCCGTCAAACGGATTATTACTTTTGCTGCAGCTAACATAAAGGGAGTTTGACCATTTTGGCTGACTGATGCAGGATTTTCACAAGAGTCTTCAAAACCTCTTGTTAAACTCGATCTCAAATCAAGTAAGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACAAGGATTCCCCTAGTAACGGCGAGTGAAGCGGGATGAGCTCAAATTTAAAATCTGGCTGGTTTACTAGTCCGAATTGTAGTTTATAGAAGCGTTTTCGGTGGCAGCCTGGGTATAAATCCTTTGGAATAGGGTATCATAGAGGGTGAGAATCCCGTCTTTGACTCAGTGCATGCTACTATGTGATACGTTTTCAAAGAGTCGAGTTGTTTGGGAATGCAGCTCAAAATGGGTGGTAAATTTCATCTAAAGCTAAATATTGGCGAGAGACCGATAGCGAACAAGTACCGTGAGGGAAAGATGAAAAGAACTTTGAAAAGAGAGTTAAACAGTGCGTGAAATTGTTGAAAGGGAAACGCTTGACACCAGTCATGCGAGTGGAAAATCAGTCTTTTGCAATGGGGAGTTGTGGGCGTTCAGACCGCAAGGCTGGCGTTTGCTTCATCTTTGTTGTAAGTGATGCACTTTTTCATTTGCAGGTCAACATCAGTTTGCTTTGCTGGACAAAACCCCAAGGGAAGGTGGCAGCTTAGGCTGTGTTATAGCCCTGGGGCGATACAGTGGAGTGGACTGAGGTTTTCGCAGTGTGTGCTCTCTGGGCAAGGCTGACTGGGTGCTATGGGATCGTTCGGCGTACAATGCATGCATTTTGCGTCGTGTCTTTTTCATACTCGCTCAACTCGGCTCTTCCACACTT'
>>> fq = pyfastx.Fastq('test.fastq')
>>> fq[1]
<Read> 2;barcodelabel=SRR10121280;flow_cell_id=unknown;len=1429;avgqual=7.49;primers=[71,1319];orient=minus with length of 1429
>>> fq['2;barcodelabel=SRR10121280;flow_cell_id=unknown;len=1429;avgqual=7.49;primers=[71,1319];orient=minus']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '2;barcodelabel=SRR10121280;flow_cell_id=unknown;len=1429;avgqual=7.49;primers=[71,1319];orient=minus does not exist in fastq file'
>>> 

pyfastx sometimes gives an erroneously shifted subsequence from hg19 reference

Hi!

I have tried for a few days now to understand why I sometimes get different subsequences than using 'bedtools getfasta' from hg19 reference. I was wondering if you could elaborate the indexing used by pyfastx in case I am making some error. I demonstrate this by fetching these two subsequences from chromosome 20:

seq1: start=60757997, end=60758117
seq2: start=60757999, end=60758119

both sequences are 120bp long and seq2 is shifted downstream by 2bp from seq1, meaning they should overlap with 118bp.

Using bedtools getfasta, I get the expected result:

$ cat test.bed 
chr20   60757997        60758117
chr20   60757999        60758119
$ bedtools getfasta -fi /home/work/genomes/hg19/hg19.fa -bed test.bed -fo bedtools.fasta
$ cat bedtools.fasta 
>chr20:60757997-60758117
GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA
>chr20:60757999-60758119
CGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGT

However, when trying to fetch the same sequences with pyfastx, I get a seq2 that is shifted by 3bp instead of 2bp. The commands issued in ipython:

In [11]: genome = pyfastx.Fasta('/home/work/genomes/hg19/hg19.fa')

In [12]: str(genome["chr20"][60757997:60758118])
Out[12]: 'GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA'

In [13]: str(genome["chr20"][60757999:60758120])
Out[13]: 'GGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGTG'

So seq1 is identical to what bedtools returns but for some reason seq2 starts from different position even though the start position is defined to be the same as in bedtools example.

I also replicated the test with third tool, seqkit, and it agrees with bedtools:

$ seqkit grep -p 'chr20' /home/work/genomes/hg19/hg19.fa | seqkit subseq --bed test.bed
[INFO] read BED file ...
[INFO] 2 BED features loaded
>chr20_60757998-60758117:. 
GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGC
CCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA
>chr20_60758000-60758119:. 
CGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCC
CGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGT

I have also deleted both the .fai and .fxi index files and rebuilt them but the issue persists. Can you explain why I get the different sequence than with bedtools? And why sometimes I get the same sequence as with other tools?

Let me know if I was unclear or some information is missing from my report.

Best regards,
Tuomo

Add support to release aarch64 wheels

Problem

On aarch64, pip install pyfastx builds the wheels from source code and then install it. It requires user to have development environment installed on his system. also, it take some time to build the wheels than downloading and extracting the wheels from pypi.

Resolution

On aarch64, pip install pyfastx should download the wheels from pypi

@lmdu, Please let me know your interest on releasing aarch64 wheels. I can help in this.

streaming multiple fastq files at the same time?

Hi, is it possible to open multiple fastq files (and process) at the same time using pyfastx.Fastq? With file handle, I could do something like this:

with gzip.open(R1_fastq_file, 'r') as R1: 
    with gzip.open(R2_fastq_file,'r') as R2:
        with open(output_path, 'w') as out:
            DO SOMETHING HERE...

Is it possible to achieve the same thing using pyfastx? The situation here is I need information from both read1 and read2 fastq file, and I need to scan through a really large pair-end fastq file.

Thank you!

Can an index be "shared" by multiple processes?

Thanks again for pyfastx, it is easy and fast to use. I'm wondering if it is possible to generate an index from FASTQ file and if that index can be used by multiple processes at the same time, ie accessing different sequences from the index? I've tried this and haven't gotten any exceptions/errors, however, the data is garbled coming back (mixture of seq and quality scores from a call to idx.seq). I think I read somewhere in the docs that the index is SQLite3 database?

Difference in content when parsing fastq files.

Hi, tyvm for pyfastx, I really like it.

I came across a issue when I was writing tests for my script, and hoped you might be
able to help me.

I am processing fastq files, both regular and gzipped using:

for oisRecord in pyfastx.Fastq(input_file):

And get the headers and sequences like this:

oisRecord.name
oisRecord.seq

The issue is that I got different read counts for the same fastq file.
When I write the headers and sequences to a file within the for loop, and then
count the lines using wc -l, they differ.
For comparison, I also converted the fastq to fasta, and gzipped the fasta.
Both the regular fasta and the gzipped fasta give me the same result as the gzipped fastq.
The outlier is the regular fastq file.

Any idea what could be the cause of this?

System:
Ubuntu 20.04
Python 3.8
Pyfastx 0.6.12

test.fastq.gz (quality in this fastq is fake)

SystemError with fa.fetch

I am getting the following error with fa.fetch:

SystemError: /tmp/build/80754af9/python_1578510683607/work/Objects/longobject.c:415: bad argument to internal function

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/ikaplow/RegulatoryElementEvolutionProject/src/RegElEvoCode/prepareDataForPrank.py", line 160, in
prepareDataForPrank(options)
File "/home/ikaplow/RegulatoryElementEvolutionProject/src/RegElEvoCode/prepareDataForPrank.py", line 149, in prepareDataForPrank
currentSequence = fa.fetch(bedDict[peak][0], (bedDict[peak][1], bedDict[peak][2]))

I have checked that the chromosome is in the fasta file and that the coordinates are within the chromosome boundaries. I am using pyfastx version 0.8.4 with python 3.7.6.

It would be great if someone could help me figure out what to do about this. Thanks!

[Feature Request] Read from stdin option

It would be awesome if pyfastx could read from stdin for streaming purposes between programs.

import sys, pyfastx

def main():
    for id, seq in pyfastx.Fasta(sys.stdin, build_index=False):
        print(id)
if __name__ == "__main__":
    main()

Here's the error:

Traceback (most recent call last):
  File "/Users/jespinoz/test.py", line 7, in <module>
    main()
  File "/Users/jespinoz/test.py", line 4, in main
    for id, seq in pyfastx.Fasta(sys.stdin, build_index=False):
TypeError: a bytes-like object is required, not '_io.TextIOWrapper

Python 3.10 compatiblity

Can future versions of this incorporate Python 3.10 compatibility? The bioconda install at least requires less than 3.10

Return full sequence name without building index

Hi Lianming,

Thank you for writing such a great tool! I'm using it in several projects I'm involved in now.

I was wondering if it is possible to return the full sequence name instead of the string split on whitespace when build_index is False? It would be great if it could be controlled using the key function as it is when an index is built. This is critical to the work that I'm doing where sequence names do not reliably have unique strings for the first part of the header.

I hope I haven't just missed something in the documentation!

Thanks!
Connor

Support xz-compressed files

Hello @lmdu,

I am planning to use pyfastx within Nextstrain's Augur to support a new data curation command and it would be really helpful to be able to support xz-compressed files. Would you be open to extending pyfastx to support xz-compressed files?

Groups working with large files are using xz to save space because xz has a better compression ratio than gzip. For example, Nextstrain hosts a file of all GenBank SARS-CoV-2 genomes that is xz-compressed.

With the condition that the file was originally compressed in multiple short blocks, it is possible to randomly access xz-compressed files. python-xz is an example of this in pure Python and xz-random-access is an example of this in C.

Thank you!

Read comment

Hey nice tool - quite useful to extend pyfaidx to Fastq. Is there any chance you could implement to read the comment on a read header?

Currently the only accessible attribute is read.name when iterating over pyfastx.Fastq

Failure to properly parse ENCODE's GRCh38 reference

Hello,

I wanted to use pyfastx for some ENCODE analysis and tried to use the Fasta class on the GRCh38 reference found here (more info here).. When I iterate through the fa object I get unexpected results (i.e. the "description" from chr1 is the header for chr2). Interestingly enough, opening the file as a Fastx object seems to work as expected. Unfortunately, I really want to use the fetch method which isn't implemented for Fastx.seq objects.

To troubleshoot I went ahead and tested the hg38.fa.gz fasta from UCSC found here. Unfortunately, I'm getting odd results with this file too. For example many of the description fields are short nucleotide sequences (i.e. for chr10 i get "CGGGC").

The description field errors may not have been an issue, except when you hit the end of the Fasta file you get a UnicodeDecodeError when the description field is called.

I am using version 0.8.4.

Here is a Google Colab notebook replicating the issue.

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position

image
python version: 3.8.12
pyfastx version: 0.8.4

import pyfastx 
fq2 = pyfastx.Fastq("./tmp/tmp2_2.fq.gz")
for read in fq2:
    print(read.seq)

here is the header of my fq file:

@A01225:536:HFNG7DSX5:3:1101:1217:1000 2:N:0:TAAGGCGA+GCGATCTA
CAAGCGTTGGCTTCTCGCATCTGTACGGTGTCAACGCTTAATCCACGTGCTTGAGAGGCCAGAGCATTCGACACAGAATTTTTTTTTTTTTTCGGTTAGTACGAGATGGCTTCACACCTCGCGGTGCATCCAACTCTCGCCTATCGAAGT
+
,FFFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FF:FFFF:,FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFF:F,FF::FFF:FF,:FF,:F,FFFFF,FFF:FFFF,FFFF,F,FFF:,F:FFF,FF:F
@A01225:536:HFNG7DSX5:3:1101:1253:1000 2:N:0:TAAGGCGA+GCGATCTA
CAAGCGTTGGCTTCTCGCATCTTCGTCCATCATACTAGTAATCAACGTGCTTGAGAGGCCAGAGCATTCGAAGATCTTTTTTTTATTTTTTTTTGTTTTTGTTTTAAAAAATTTATGAAAAAATTGATACGCCAATTATTACTAAAACTC
+
:F:,F,FFFFFF,FFFFFFFFFF:,FF:FFFF,F,FFFFFFFFF,FFFFFFFFFFFF,F,FFF,FFFFF,F,F,,,:FFFFFFF,F:F:FFFF:,F:,FF,,::F:,FFFF,FF,,,,F,FF,F:::,:,,F,:F,:F,F:,:F:,F,F,

Multiprocessing with apply_async does not work

Hi @lmdu ,

Sorry again, found one more issue while trying to parallelize and share the index with multiple processes.

Here is the code:

from multiprocessing import Manager, Pool
from pyfastx import Fasta

def print_seq_names(fasta_obj, lock):
    for i in range(5):
        with lock:
            print(fasta_obj[i].name)

def error_call(err):
    print(err)

fasta_obj = Fasta("uniprot_sprot.fasta.gz")
pool = Pool(5)
lock = Manager().Lock()

for i in range(4):
    pool.apply_async(print_seq_names, args=(fasta_obj, lock), error_callback=error_call)

pool.close()
pool.join()

Error:
<multiprocessing.pool.ApplyResult object at 0x7f7605085090>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f20390>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f201d0>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f20390>
can't pickle Fasta objects

Is it not possible to share the Fasta object or the index or the identifier objects with multiple processes anymore?

Also, if I make the fasta object and the identifier object in my code (the above is a sample dummy code) as a global variable, I could see only 1 process/core at a time be running (out of 64 cores in real code) and the rest of them are in the sleep state. Do you know why this is the behaviour?

Any help here would be great as well,

Thank you!

P.S:
OS: CentOS 7
Python : 3.7.7
pyfastx: 0.8.3

pyfastx.Fasta(build_index=False) using all available threads

What is pyfastx.Fasta using that defaults to using all threads (eg., BLAS)? I'm simply running:

for name, seq in pyfastx.Fasta(fasta_file, build_index=False)

...and all 80 threads on my server are used. I obviously want to reduce this.

env

OS: Ubuntu 18.04.4
python: 3.6.10
blas: 1.1
pyfastx : 0.6.14

iteration on record?

Thanks for developing pyfastx, that's a really nice tool! However, I have a basic question. When you iterate over the fasta object or the pyfastx.Fasta(command, does it always report name and sequences as string. What if I want to iterate over sequence objects to rename them using the description field. Maybe I missed something. Thanks again!!!

ValueError in test suite with pyfaidx 0.7.1

Hi,

While working on packaging pyfastx for Debian, in order to fulfill the new dependency requirements of augur, I noticed that the way pyfaidx is called to produce the reference object at the negative index is somehow broken (at least with pyfaidx 0.7.1, python3.10 and python3.11 in Debian sid). It raises a -probably undue- test error:

ERROR: test_seq_by_index (tests.test_sequence.SequenceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/build/pyfastx-LEa3AU/pyfastx-0.8.4/tests/test_sequence.py", line 54, in test_seq_by_index
    expect = self.faidx[idx][:]
  File "/usr/lib/python3/dist-packages/pyfaidx/__init__.py", line 1099, in __getitem__
    rname = next(islice(self.records.keys(), rname, None))
ValueError: Indices for islice() must be None or an integer: 0 <= x <= sys.maxsize.

Manually implementing the negative index wrap around works around the problem and allows me to proceed to the actual negative index test on pyfastx, which in turn works as expected.

--- a/tests/test_sequence.py
+++ b/tests/test_sequence.py
@@ -51,7 +51,7 @@ class SequenceTest(unittest.TestCase):
 
 		#test negative index
 		idx = (self.get_random_index() + 1) * -1
-		expect = self.faidx[idx][:]
+		expect = self.faidx[self.count+idx][:]
 		result = self.fastx[idx]
 
 		self.assertEqual(expect.name, result.name)

I'm not entirely sure the problem lies in pyfastx, or if this is the new behavior of pyfaidx which is buggy. Thus, I haven't prepared a full fledged merge request, but I thought you might be interested in being aware of the problem anyway.

Have a nice day, :)
Étienne.

Questions about pyfastx obtaining sequence

  1. When the last line of the fasta file is not a blank line, the program crashes when processing the last sequence.(fa['test'].seq)
  2. When the input sequence name does not exist, a more human prompt should be given instead of KeyError.

Segfault when using full_name=True for Fasta iteration

Hi Lianming,

I'm gettting a segfault when the full_name flag is set to True and build_index is set to False for any fasta file. This behaviour is in 0.8.0 and 0.8.1, while 0.7.0 is functioning fine.

Here's an example line:

for name, seq in Fasta(fasta_input, build_index=False, full_name=True):

It segfaults on the next line.

Let me know if you need any more information.

Thanks!
Connor

Seems to have trouble with large fasta files

Wanted to benchmark this against pyfaidx for getting subsequences, but I can't get pyfastx to work with my hg19 genome. pyfasta indexes the file ok, and then says it has 28 sequences, but when I try to access subsequences, I get either empty strings or crashes.

>>> f = pyfastx.Fasta('hg19.fa')
>>> f.fetch('chr1', (100000,110000))
''
>>> f.fetch('chr1', (1000000,1100000))
SystemError: Objects/longobject.c:244: bad argument to internal function
>>> s = f[-1]
SystemError: Objects/longobject.c:244: bad argument to internal function
>>> f.keys()
<Identifier> contains 28 identifiers
[~]# pyfastx info hg19.fa
Sequence counts: 28
Total bases: 3095694378
GC content: 40.90%
A counts: 844868179
T counts: 846097379
C counts: 585018013
G counts: 585360526
N counts: 234350281
Mean length: 110560513.50
Median length: 111259709.00
Max length: 249250621
Min length: 131
N50, L50: 155270560, 8
length >= 1000: 25

This is Python 2.7 running on CentOS. This is version 0.5.6 of pyfastx.

segfaults

I seem to be getting memory allocation issues ... I would love to use this package but not sure how to fix this.

python(9050,0x10f3a9dc0) malloc: Incorrect checksum for freed object 0x7f917f498558: probably modified after being freed.
Corrupt value: 0x0
python(9050,0x10f3a9dc0) malloc: *** set a breakpoint in malloc_error_break to debug
[1] 9050 abort python test.py

Its always at the same read.

@FS10000899:17:BPC29511-2322:1:1101:15960:1860 2:N:0:1
GGAAGATCGAGTAGATCAAATACATACGATATGGAAGTGGGAACTACCATGCGAACGGAAACGTTCGAACTACATGGCCCACTTCCTAAGTCGTATGTAAAAGAAACAACAACAACAACCCACCATTTTGT
+
FF,,F:F,FFFF::F:F,:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFFFF

fq1 = pyfastx.Fastq("output/Mapping_Files/test_mate1_val_1.fq")
count = 0
total = 0
for r1 in fq1:
    total += 1
    print(r1.name)
    print(r1.seq)
    seq_sub = r1.seq
    found = 0
    for c in codes.values():
        if seq_sub.find(c) != -1:
            found = 1
            break
    count += 1
print(count, total)

[BUG] Unable to get sequences after encountering `KeyError`

I'm trying to access sequences with a try/except block like so:

# sequences.fasta contains three sequences: 'seq_a', 'seq_b', 'seq_c'
sequences = pyfastx.Fasta('sequences.fasta')
for seq_id in ['seq_a', 'bad_sequence', 'seq_a', 'seq_b', 'seq_c']:
    try:
         sequence_record = sequences[seq_id]
     except KeyError:
         ...

Once I encounter a KeyError, all subsequent tries to get a sequence fails with a KeyError.
In my example, I am able to access seq_a the first time, but I get an error for any of the sequences after bad_sequence.

Python version 3.7
pyfastx version 0.8.4

Add full_name=True option to pyfastx.Fastq

Thanks again for great library -- would it be possible to add full_name=True to the Fastq parser? ONT (nanopore) reads default as having spaces in the FASTQ header that contain useful information. Currently, doesn't seem to be a way to keep that information as the parser seems to be truncating header at first space.

Thanks,
Jon

Writing fast

Hey @lmdu
Thank you for your package. Can you suggest an efficient way to write fasta files.

If would need to read a fasta file and rename the headers and write them to one or multiple files.
reading is very easy in pyfastx, but then to write it to files I found no better way than the following.

        fa = pyfastx.Fasta(input.fasta)

        representatives= fa.keys()

        with open(output.faa,'w') as outf:
            for gene in fa:
                new_name= map_names[gene.name]
                lines='\n'.join(str2multiline(gene.seq))
                outf.write(">{new_name} {gene.description}\n{lines}\n")

If I don't need to rename I can use the seq.raw attribute but I don't know if there would be a faster way.

Segfault when loading index from multiple processes

I am dispatching several hundred jobs on an HPC that all begin by loading a previously-created pyfastx index file. Around a third of the jobs will immediately segfault:

(gdb) bt
#0  0x00002aaaaaf5e580 in fileno_unlocked () from /lib64/libc.so.6
#1  0x00002aaab35de26e in is_readonly (fd=fd@entry=0x0) at src/zran.c:2639
#2  zran_import_index (index=index@entry=0x5555565f4b20, fd=fd@entry=0x0) at src/zran.c:2639
#3  0x00002aaab35d76be in pyfastx_load_gzip_index (index_file=<optimized out>, gzip_index=0x5555565f4b20, index_db=0x5555566a0508) at src/util.c:513
#4  0x00002aaab35db42c in pyfastx_fastq_load_index (self=self@entry=0x2aaadfe79938) at src/fastq.c:223
#5  0x00002aaab35dc210 in pyfastx_fastq_new (type=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at src/fastq.c:322
#6  0x00005555556f78e5 in type_call () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Objects/typeobject.c:895
#7  0x000055555566787b in _PyObject_FastCallDict () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Objects/abstract.c:2331
#8  0x00005555556f774e in call_function () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4861
#9  0x000055555571a2ba in _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:3335
#10 0x00005555556c60ab in _PyFunction_FastCall (globals=<optimized out>, nargs=0, args=<optimized out>, co=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4919
#11 fast_function () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4954
#12 0x00005555556f76d5 in call_function () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4858
#13 0x000055555571a2ba in _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:3335
#14 0x00005555556ca669 in _PyEval_EvalCodeWithName (qualname=0x0, name=0x0, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2,
    kwcount=<optimized out>, kwargs=0x0, kwnames=0x0, argcount=0, args=0x0, locals=0x2aaaaaba35e8, globals=0x2aaaaaba35e8, _co=0x2aaab2374ae0)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4166
#15 PyEval_EvalCodeEx () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4187
#16 0x00005555556cb3fc in PyEval_EvalCode (co=co@entry=0x2aaab2374ae0, globals=globals@entry=0x2aaaaaba35e8, locals=locals@entry=0x2aaaaaba35e8)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:731
#17 0x000055555576cc94 in run_mod () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:1025
#18 0x000055555576d091 in PyRun_FileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:978
#19 0x000055555576d293 in PyRun_SimpleFileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:419
#20 0x000055555576d39d in PyRun_AnyFileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:81
#21 0x0000555555770dc9 in run_file (p_cf=0x7fffffff3f9c, filename=0x5555558ab340 L"/nethome/schrammca/SONAR/annotate/find_umis.py", fp=0x5555559243f0)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Modules/main.c:340
#22 Py_Main () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Modules/main.c:810
#23 0x000055555563844e in main () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Programs/python.c:69
#24 0x00002aaaaaf0cb15 in __libc_start_main () from /lib64/libc.so.6
#25 0x0000555555720d7f in _start () at ../sysdeps/x86_64/elf/start.S:103

This happens even if I only start 20 worker jobs concurrently, although the frequency of segfaults is lower (~2/20 jobs vs ~70/200 jobs vs ~450/650 jobs).

Corrupt quality values

Hi

Ive got an unaligned BAM file from a pac bio long-read run which ive converted to a fq using bedtools. I am then trying to use pyfastx to play around with the file and ive noticed something (that could be my own fault).

When i load the fq.gz file from within python, pufastx will open the file, index it and then i can start calling the stats on it.
If however i use the commandline argument then it will index fine, but when i try to run the stats will give the following error

Traceback (most recent call last):
  File ".../bin/pyfastx", line 8, in <module>
    sys.exit(main())
  File ".../lib/python3.9/site-packages/pyfastxcli.py", line 578, in main
    args.func(args)
  File ".../lib/python3.9/site-packages/pyfastxcli.py", line 99, in fastx_info
    fq.maxlen, fq.minlen, fq.maxqual, fq.minqual, ",".join(fq.encoding_type)]
TypeError: Quality values corrupt. found [34, 126] where [33, 104] was expected

Any ideas what might be going on?

Iterating over sequences without comments results in comments from previous records

Using pyfastx version 0.8.4, I tried iterating over records but the "comment" attribute is not reset.
Is there a safer way to iterate (I would like not to assume FASTA format) or did I just miss something obvious?

File (test.fa):

>1 comment
TTTTTTTT
>3
atatat

Code:

import pyfastx
fa = pyfastx.Fastx("test.fa")
for name,seq,comment in fa:
    print(comment)

Output:

comment
comment

Thanks for your help

get seq count and length error

I got an error in pyfastx 0.6.5 installed with conda on a linux system


>>> import pyfastx
>>> fa=pyfastx.Fasta('all_bins/ERR2059947_metabat_58.fasta')
RuntimeError: get seq count and length error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: <class 'Fasta'> returned a result with an error set

some sequences are missing in pyfastx.Fasta object

I loaded a fasta file containing 4542 sequences with average length of 2.5kb, however only 4539 sequences were in the pyfastx.Fasta object.

fa = pyfastx.Fasta('assembly.fasta')
fa['contig_4540'] # keyError

Besides, I could access a sequence e.g. fa['contig_999'] for the first time. But when I try to access it again I got keyError.

The version of pyfastx I used is 0.8.4, Python version 3.7

Possible memory leak in keys

Hi,

I have downloaded the trembl dataset from UniProtKB (48GiB) and created an index (23GiB).

from pyfastx import Fasta

# Create index
fobj = Fasta('uniprot_trembl.fasta.gz', build_index=True)

# Extract keys
sample_ids = fobj.keys()

# Apply length filter
sample_ids.filter(sample_ids >= 11)

# Iterate through the filtered sample ids (before the end of the iteration a total of 12.4 GiB of memory is seen to be consumed)
dummy_count = 0
for idx, key in enumerate(sample_ids, start=1):
    dummy_count += 1

The above iteration is shown here for reproducible purposes and to demonstrate that the memory is consumed here no matter what. Can you please take a look at this and offer a solution?

Thanks in advance!

P.S:
System info:
OS: CentOS 7
Python 3.7.7
pyfastx version: 0.8.3

How to safely close file handles with Fastx?

It is not clear how to safely close any file handles that are opened by Fastx (or any of the other classes provided by this library).

It would be nice to be able to use Fastx with contextlib for example.

Select index output folder

Nice package! Is there a way to add a parameter to select an output folder for the index file? Sometimes the user might not have writing permission to the source folder.

Silent errors when reading other compressed formats

Currently pyfastx checks to see if a file is gzip compressed, and if it is not it seems to interpret it as text. An unfortunate side effect of this is that if another compressed format is used, the program will happily return garbage reads:

>>> reads_1 = [x for x in pyfastx.Fastq("sample.fq.gz")]
>>> reads_1[:5]
[<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1 with length of 16,
 <Read> 140313_SN743_0432_AC3TTHACXX:4:1101:6580:2239:1#0/1 with length of 16,
 <Read> 140313_SN743_0432_AC3TTHACXX:4:1101:6929:2242:1#0/1 with length of 16,
 <Read> 140313_SN743_0432_AC3TTHACXX:4:1101:13004:2221:1#0/1 with length of 16,
 <Read> 140313_SN743_0432_AC3TTHACXX:4:1101:14034:2219:1#0/1 with length of 16]
>>> reads_2 = [x for x in pyfastx.Fastq("sample.fq.bz2")]
>>> reads_2[:5]
[<Read> Zh91AY&SY��7a��}^ with length of 4,
 <Read>  with length of 2,
 <Read>  with length of 257,
 <Read> Q4��TH��*���m�R�@�6�+� with length of 0,
 <Read> E��hĶ
  with length of 206]

Some basic error checking to make sure that the input file isn't in binary format, or perhaps a sterner warning in the README that this can happen, would be helpful!

pyfastx.Fasta leaking memory?

I am opening a large fasta file via

import pyfastx
fasta_sequences = pyfastx.Fasta('data/uniparc_active.fasta', build_index=True)

and I am seeing memory utilization in top increase until 100%.

Is this intended behavior, or is there a memory leak in the library?

Thanks,
Martin

Slice feature for keys

Hi,

Thank you for this handy tool, really useful!

I am writing a tool that uses pyfastx and was wondering that it would be really nice to have a slice feature for keys instead of iterating over every key and saving it in a list object.

from pyfastx import Fasta
fasta_obj = Fasta("fasta_file.fasta", build_index=True)
fasta_keys = fasta_obj.keys()

fasta_keys[0:10]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sequence index must be integer, not 'slice'

Thanks in advance!

Poor performance when reading gzipped FASTQ sequentially

As part of publishing my pure Python FASTQ/FASTA package I ran some benchmarks/comparisons with pyfastx, which you can find in the package documentation: https://fqfa.readthedocs.io/en/latest/benchmark.html

pyfastx was unexpectedly slow when processing a gzipped FASTQ file sequentially, although it did fine when the file was uncompressed. I don't know what the root cause of this might be (maybe it's getting a new block for each read?), but I wanted to bring it to your attention.

pip install broken in version 0.8.0

schrammca@ai-hpcn151:~$ pip3 install pyfastx --user
Collecting pyfastx
  Using cached pyfastx-0.8.0.tar.gz (74 kB)
    ERROR: Command errored out with exit status 1:
     command: /sysapps/cluster/software/Anaconda3/5.3.0/envs/python_3.7_env/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-9lau04qa/pyfastx_100ad86c0e9b4cd691b9ebe7ad33ee39/setup.py'"'"'; __file__='"'"'/tmp/pip-install-9lau04qa/pyfastx_100ad86c0e9b4cd691b9ebe7ad33ee39/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-2jdcamx4
         cwd: /tmp/pip-install-9lau04qa/pyfastx_100ad86c0e9b4cd691b9ebe7ad33ee39/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-9lau04qa/pyfastx_100ad86c0e9b4cd691b9ebe7ad33ee39/setup.py", line 51, in <module>
        with open(os.path.join('src', 'version.h')) as fh:
    FileNotFoundError: [Errno 2] No such file or directory: 'src/version.h'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I was able to pip3 install pyfastx==0.7.0 --user successfully...

Does pyfastx split fasta files sequentially?

I want to split paired-end fastq files, if I run pyfastx split -n 30 on each will the pair order be preserved such that subset 1 of forward reads and subset 1 of reverse reads will be correctly paired?

处理fastq文件获得序列全名

因为fastq 压缩文件巨大,建索引需要很久,而且占用空间。如果只是简单遍历fastq文件,不做随机读取,就没有必要对fastq建索引。但是不建立索引,使用元祖索引获取序列id时只能获得短名,程序可以不建索引返回序列全名么?

>>>head -1 test.fq
>>>@A00838:157:H2Y5TDSXY:4:1101:1398:1000 1:N:0:ATTCAGAA+CCTATCCT

for read in pyfastx.Fastq('test.fq',build_index=False):
   print(read[0])
>>>A00838:157:H2Y5TDSXY:4:1101:1398:1000

small caps

Hi -
Thanks for developing pyfastx, very useful and convenient. I had a question: how is pyfastx handling soft masked genomes with lower-case characters, such as soft-masked genomes. I notice sometimes such files generate errors or at least that everything is reverted to capital. Thanks!

Filter fastq file (performance)

I have a paired-end single-cell RNA-seq dataset. R1 consists of all the reads, and R2 of the barcodes necessary to identify which cell a read belongs to. If I now only trim R1 to keep high-quality reads, my R1 and R2 are out of sync..

It seems like pyfastx can solve this problem for me, by simply only keeping the reads in R2 of which we have one in R1:

import gzip

with gzip.open(output, 'wt') as f:
    for read in reads:
        barcode = barcodes[read.id]
        f.write(barcode.raw)

However from a really sloppy benchmark, it seems like just getting barcode.raw takes around 0.0015 seconds per read, the lookup of the read is fast: 1e-6. This is would mean I have to wait two days to filter my fastq. Is there an easier/better/faster way of doing this?

[Feature Request] Get barcodes from Fastx function

Hi, I would like to get barcode counts using the Fastx function.
If this already is an option please let me know.

Right now I am calling:
for currentRead in pyfastx.Fastx(fastqFile): ...
To iterate through rows to get some statistics, but I would also like barcode counts. With normal python code I do it like so, but its quite slow:

barcodes = {}
with gzip.open(myFastq) as fastq:
        for line in fastq:
                if not line.startswith(b'@'): continue
                bc = line.decode("utf-8").split(':')[-1].strip()
                # print(bc)
                if bc not in barcodes:
                        barcodes[bc] = 1
                else:
                        barcodes[bc]+=1

Fastx has sped up some of my other data collection functions so I was hopeful it could do this too!
Thank you

Unexpected behavior when a Sequence object is returned from a function

Hello!

I made a trivial function that returns a sequence from a Fasta class expecting it to return an object of the Sequence class.

However, it returns meaningless and different characters every time the script is run.

On the contrary, if I don't use the function, but extract the gene in the main routine, I can handle the Sequence class object as expected.

Is this behavior to be expected or is it a bug?

For the test I used this file https://github.com/lmdu/pyfastx/blob/master/tests/data/test.fa

Here is the script:

import pyfastx
import sys

def extract_gene(file, gene):
    fasta = pyfastx.Fasta(file)
    return fasta[gene]

if __name__ == '__main__':
    print(f'Python version: {sys.version}')
    print(f'pyfastx version: {pyfastx.version()}\n')
    print('Extracting in __main__\n')
    fasta = pyfastx.Fasta('test.fa')
    gene = fasta['JZ822577.1']
    print(gene)
    print('\n\nExtracting in extract_gene function\n')
    gene = extract_gene('test.fa', 'JZ822577.1')
    print(gene)

output is as follows:

Python version: 3.9.1 | packaged by conda-forge | (default, Jan 26 2021, 01:34:10)
[GCC 9.3.0]
pyfastx version: 0.8.4

Extracting in __main__

CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT


Extracting in extract_gene function

@B5��@B5��� ��U� ��U@B5�

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.