lmdu / pyfastx Goto Github PK

View Code? Open in Web Editor NEW

238.0 7.0 19.0 9.59 MB

a python package for fast random access to sequences from plain and gzipped FASTA/Q files

Home Page: https://pyfastx.readthedocs.io

License: MIT License

Python 23.75% C 66.37% R 2.66% Shell 6.54% Perl 0.67%

fasta python biology dna genome assembly sequence fastq bioinformatics

pyfastx's Introduction

pyfastx

Citation: Lianming Du, Qin Liu, Zhenxin Fan, Jie Tang, Xiuyue Zhang, Megan Price, Bisong Yue, Kelei Zhao. Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files. Briefings in Bioinformatics, 2021, 22(4):bbaa368.

Table of Contents

Introduction

The pyfastx is a lightweight Python C extension that enables users to randomly access to sequences from plain and gzipped FASTA/Q files. This module aims to provide simple APIs for users to extract seqeunce from FASTA and reads from FASTQ by identifier and index number. The pyfastx will build indexes stored in a sqlite3 database file for random access to avoid consuming excessive amount of memory. In addition, the pyfastx can parse standard (sequence is spread into multiple lines with same length) and nonstandard (sequence is spread into one or more lines with different length) FASTA format. This module used kseq.h written by @attractivechaos in klib project to parse plain FASTA/Q file and zran.c written by @pauldmccarthy in project indexed_gzip to index gzipped file for random access.

This project was heavily inspired by @mdshw5's project pyfaidx and @brentp's project pyfasta.

Features

Single file for the Python extension
Lightweight, memory efficient for parsing FASTA/Q file
Fast random access to sequences from gzipped FASTA/Q file
Read sequences from FASTA file line by line
Calculate N50 and L50 of sequences in FASTA file
Calculate GC content and nucleotides composition
Extract reverse, complement and antisense sequences
Excellent compatibility, support for parsing nonstandard FASTA file
Support for FASTQ quality score conversion
Provide command line interface for splitting FASTA/Q file

Installation

Currently, pyfastx supports Python 3.6, 3.7, 3.8, 3.9, 3.10, 3.11. Make sure you have installed both pip and Python before starting.

You can install pyfastx via the Python Package Index (PyPI)

pip install pyfastx

Update pyfastx module

pip install -U pyfastx

FASTX

New in pyfastx 0.8.0.

Pyfastx provide a simple and fast python binding for kseq.h to iterate over sequences or reads in fasta/q file. The FASTX object will automatically detect the input sequence format (fasta or fastq) to return different tuple.

FASTA sequences iteration

When iterating over sequences on FASTX object, a tuple (name, seq) will be returned.

>>> fa = pyfastx.Fastx('tests/data/test.fa.gz')
>>> for name,seq in fa:
>>>     print(name)
>>>     print(seq)

>>> #always output uppercase sequence
>>> for item in pyfastx.Fastx('tests/data/test.fa', uppercase=True):
>>>     print(item)

>>> #Manually specify sequence format
>>> for item in pyfastx.Fastx('tests/data/test.fa', format="fasta"):
>>>     print(item)

If you want the sequence comment, you can set comment to True, New in pyfastx 0.9.0.

>>> fa = pyfastx.Fastx('tests/data/test.fa.gz', comment=True)
>>> for name,seq,comment in fa:
>>>     print(name)
>>>     print(seq)
>>>     print(comment)

The comment is the content of header line after the first white space or tab character.

FASTQ reads iteration

When iterating over reads on FASTX object, a tuple (name, seq, qual) will be returned.

>>> fq = pyfastx.Fastx('tests/data/test.fq.gz')
>>> for name,seq,qual in fq:
>>>     print(name)
>>>     print(seq)
>>>     print(qual)

If you want the read comment, you can set comment to True, New in pyfastx 0.9.0.

>>> fq = pyfastx.Fastx('tests/data/test.fq.gz', comment=True)
>>> for name,seq,qual,comment in fq:
>>>     print(name)
>>>     print(seq)
>>>     print(qual)
>>>     print(comment)

The comment is the content of header line after the first white space or tab character.

FASTA

Read FASTA file

Read plain or gzipped FASTA file and build index, support for random access to FASTA.

>>> import pyfastx
>>> fa = pyfastx.Fasta('test/data/test.fa.gz')
>>> fa
<Fasta> test/data/test.fa.gz contains 211 seqs

Note

Building index may take some times. The time required to build index depends on the size of FASTA file. If index built, you can randomly access to any sequences in FASTA file. The index file can be reused to save time when you read seqeunces from FASTA file next time.

FASTA records iteration

The fastest way to iterate plain or gzipped FASTA file without building index, the iteration will return a tuple contains name and sequence.

>>> import pyfastx
>>> for name, seq in pyfastx.Fasta('test/data/test.fa.gz', build_index=False):
>>>     print(name, seq)

You can also iterate sequence object from FASTA object like this:

>>> import pyfastx
>>> for seq in pyfastx.Fasta('test/data/test.fa.gz'):
>>>     print(seq.name)
>>>     print(seq.seq)
>>>     print(seq.description)

Iteration with build_index=True (default) return sequence object which allows you to access attributions of sequence. New in pyfastx 0.6.3.

Get FASTA information

>>> # get sequence counts in FASTA
>>> len(fa)
211

>>> # get total sequence length of FASTA
>>> fa.size
86262

>>> # get GC content of DNA sequence of FASTA
>>> fa.gc_content
43.529014587402344

>>> # get GC skew of DNA sequences in FASTA
>>> # New in pyfastx 0.3.8
>>> fa.gc_skews
0.004287730902433395

>>> # get composition of nucleotides in FASTA
>>> fa.composition
{'A': 24534, 'C': 18694, 'G': 18855, 'T': 24179}

>>> # get fasta type (DNA, RNA, or protein)
>>> fa.type
'DNA'

>>> # check fasta file is gzip compressed
>>> fa.is_gzip
True

Get longest and shortest sequence

New in pyfastx 0.3.0

>>> # get longest sequence
>>> s = fa.longest
>>> s
<Sequence> JZ822609.1 with length of 821

>>> s.name
'JZ822609.1'

>>> len(s)
821

>>> # get shortest sequence
>>> s = fa.shortest
>>> s
<Sequence> JZ822617.1 with length of 118

>>> s.name
'JZ822617.1'

>>> len(s)
118

Calculate N50 and L50

New in pyfastx 0.3.0

Calculate assembly N50 and L50, return (N50, L50), learn more about N50,L50

>>> # get FASTA N50 and L50
>>> fa.nl(50)
(516, 66)

>>> # get FASTA N90 and L90
>>> fa.nl(90)
(231, 161)

>>> # get FASTA N75 and L75
>>> fa.nl(75)
(365, 117)

Get sequence mean and median length

New in pyfastx 0.3.0

>>> # get sequence average length
>>> fa.mean
408

>>> # get seqeunce median length
>>> fa.median
430

Get sequence counts

New in pyfastx 0.3.0

Get counts of sequences whose length >= specified length

>>> # get counts of sequences with length >= 200 bp
>>> fa.count(200)
173

>>> # get counts of sequences with length >= 500 bp
>>> fa.count(500)
70

Get subsequences

Subsequences can be retrieved from FASTA file by using a list of [start, end] coordinates

>>> # get subsequence with start and end position
>>> interval = (1, 10)
>>> fa.fetch('JZ822577.1', interval)
'CTCTAGAGAT'

>>> # get subsequences with a list of start and end position
>>> intervals = [(1, 10), (50, 60)]
>>> fa.fetch('JZ822577.1', intervals)
'CTCTAGAGATTTTAGTTTGAC'

>>> # get subsequences with reverse strand
>>> fa.fetch('JZ822577.1', (1, 10), strand='-')
'ATCTCTAGAG'

Key function

New in pyfastx 0.5.1

Sometimes your fasta will have a long header which contains multiple identifiers and description, for example, ">JZ822577.1 contig1 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence". In this case, both "JZ822577.1" and "contig1" can be used as identifer. you can specify the key function to select one as identifier.

>>> #default use JZ822577.1 as identifier
>>> #specify key_func to select contig1 as identifer
>>> fa = pyfastx.Fasta('tests/data/test.fa.gz', key_func=lambda x: x.split()[1])
>>> fa
<Fasta> tests/data/test.fa.gz contains 211 seqs

Sequence

Get a sequence from FASTA

>>> # get sequence like a dictionary by identifier
>>> s1 = fa['JZ822577.1']
>>> s1
<Sequence> JZ822577.1 with length of 333

>>> # get sequence like a list by index
>>> s2 = fa[2]
>>> s2
<Sequence> JZ822579.1 with length of 176

>>> # get last sequence
>>> s3 = fa[-1]
>>> s3
<Sequence> JZ840318.1 with length of 134

>>> # check a sequence name weather in FASTA file
>>> 'JZ822577.1' in fa
True

Get sequence information

>>> s = fa[-1]
>>> s
<Sequence> JZ840318.1 with length of 134

>>> # get sequence order number in FASTA file
>>> # New in pyfastx 0.3.7
>>> s.id
211

>>> # get sequence name
>>> s.name
'JZ840318.1'

>>> # get sequence description
>>> # New in pyfastx 0.3.1
>>> s.description
'R283 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence'

>>> # get sequence string
>>> s.seq
'ACTGGAGGTTCTTCTTCCTGTGGAAAGTAACTTGTTTTGCCTTCACCTGCCTGTTCTTCACATCAACCTTGTTCCCACACAAAACAATGGGAATGTTCTCACACACCCTGCAGAGATCACGATGCCATGTTGGT'

>>> # get sequence raw string, New in pyfastx 0.6.3
>>> print(s.raw)
>JZ840318.1 R283 cDNA library of flower petals in tree peony by suppression subtractive hybridization Paeonia suffruticosa cDNA, mRNA sequence
ACTGGAGGTTCTTCTTCCTGTGGAAAGTAACTTGTTTTGCCTTCACCTGCCTGTTCTTCACATCAACCTT
GTTCCCACACAAAACAATGGGAATGTTCTCACACACCCTGCAGAGATCACGATGCCATGTTGGT

>>> # get sequence length
>>> len(s)
134

>>> # get GC content if dna sequence
>>> s.gc_content
46.26865768432617

>>> # get nucleotide composition if dna sequence
>>> s.composition
{'A': 31, 'C': 37, 'G': 25, 'T': 41, 'N': 0}

Sequence slice

Sequence object can be sliced like a python string

>>> # get a sub seq from sequence
>>> s = fa[-1]
>>> ss = s[10:30]
>>> ss
<Sequence> JZ840318.1 from 11 to 30

>>> ss.name
'JZ840318.1:11-30'

>>> ss.seq
'CTTCTTCCTGTGGAAAGTAA'

>>> ss = s[-10:]
>>> ss
<Sequence> JZ840318.1 from 125 to 134

>>> ss.name
'JZ840318.1:125-134'

>>> ss.seq
'CCATGTTGGT'

Note

Slicing start and end coordinates are 0-based. Currently, pyfastx does not support an optional third step or stride argument. For example ss[::-1]

Reverse and complement sequence

>>> # get sliced sequence
>>> fa[0][10:20].seq
'GTCAATTTCC'

>>> # get reverse of sliced sequence
>>> fa[0][10:20].reverse
'CCTTTAACTG'

>>> # get complement of sliced sequence
>>> fa[0][10:20].complement
'CAGTTAAAGG'

>>> # get reversed complement sequence, corresponding to sequence in antisense strand
>>> fa[0][10:20].antisense
'GGAAATTGAC'

Read sequence line by line

New in pyfastx 0.3.0

The sequence object can be iterated line by line as they appear in FASTA file.

>>> for line in fa[0]:
...     print(line)
...
CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCG
AGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATC
ATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCC
CCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCC
AAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT

Note

Sliced sequence (e.g. fa[0][10:50]) cannot be read line by line

Search for subsequence

New in pyfastx 0.3.6

Search for subsequence from given sequence and get one-based start position of the first occurrence

>>> # search subsequence in sense strand
>>> fa[0].search('GCTTCAATACA')
262

>>> # check subsequence weather in sequence
>>> 'GCTTCAATACA' in fa[0]
True

>>> # search subsequence in antisense strand
>>> fa[0].search('CCTCAAGT', '-')
301

FastaKeys

New in pyfastx 0.8.0. We have changed Identifier object to FastaKeys object.

Get keys

Get all names of sequence as a list-like object.

>>> ids = fa.keys()
>>> ids
<FastaKeys> contains 211 keys

>>> # get count of sequence
>>> len(ids)
211

>>> # get key by index
>>> ids[0]
'JZ822577.1'

>>> # check key whether in fasta
>>> 'JZ822577.1' in ids
True

>>> # iterate over keys
>>> for name in ids:
>>>     print(name)

>>> # convert to a list
>>> list(ids)

Sort keys

Sort keys by sequence id, name, or length for iteration

New in pyfastx 0.5.0

>>> # sort keys by length with descending order
>>> for name in ids.sort(by='length', reverse=True):
>>>     print(name)

>>> # sort keys by name with ascending order
>>> for name in ids.sort(by='name'):
>>>     print(name)

>>> # sort keys by id with descending order
>>> for name in ids.sort(by='id', reverse=True)
>>>     print(name)

Filter keys

Filter keys by sequence length and name

New in pyfastx 0.5.10

>>> # get keys with length > 600
>>> ids.filter(ids > 600)
<FastaKeys> contains 48 keys

>>> # get keys with length >= 500 and <= 700
>>> ids.filter(ids>=500, ids<=700)
<FastaKeys> contains 48 keys

>>> # get keys with length > 500 and < 600
>>> ids.filter(500<ids<600)
<FastaKeys> contains 22 keys

>>> # get keys contain JZ8226
>>> ids.filter(ids % 'JZ8226')
<FastaKeys> contains 90 keys

>>> # get keys contain JZ8226 with length > 550
>>> ids.filter(ids % 'JZ8226', ids>550)
<FastaKeys> contains 17 keys

>>> # clear sort order and filters
>>> ids.reset()
<FastaKeys> contains 211 keys

>>> # list a filtered result
>>> ids.filter(ids % 'JZ8226', ids>730)
>>> list(ids)
['JZ822609.1', 'JZ822650.1', 'JZ822664.1', 'JZ822699.1']

>>> # list a filtered result with sort order
>>> ids.filter(ids % 'JZ8226', ids>730).sort('length', reverse=True)
>>> list(ids)
['JZ822609.1', 'JZ822699.1', 'JZ822664.1', 'JZ822650.1']

>>> ids.filter(ids % 'JZ8226', ids>730).sort('name', reverse=True)
>>> list(ids)
['JZ822699.1', 'JZ822664.1', 'JZ822650.1', 'JZ822609.1']

FASTQ

New in pyfastx 0.4.0

Read FASTQ file

Read plain or gzipped file and build index, support for random access to reads from FASTQ.

>>> import pyfastx
>>> fq = pyfastx.Fastq('tests/data/test.fq.gz')
>>> fq
<Fastq> tests/data/test.fq.gz contains 100 reads

FASTQ records iteration

The fastest way to parse plain or gzipped FASTQ file without building index, the iteration will return a tuple contains read name, seq and quality.

>>> import pyfastx
>>> for name,seq,qual in pyfastx.Fastq('tests/data/test.fq.gz', build_index=False):
>>>     print(name)
>>>     print(seq)
>>>     print(qual)

You can also iterate read object from FASTQ object like this:

>>> import pyfastx
>>> for read in pyfastx.Fastq('test/data/test.fq.gz'):
>>>     print(read.name)
>>>     print(read.seq)
>>>     print(read.qual)
>>>     print(read.quali)

Iteration with build_index=True (default) return read object which allows you to access attribution of read. New in pyfastx 0.6.3.

Get FASTQ information

>>> # get read counts in FASTQ
>>> len(fq)
800

>>> # get total bases
>>> fq.size
120000

>>> # get GC content of FASTQ file
>>> fq.gc_content
66.17471313476562

>>> # get composition of bases in FASTQ
>>> fq.composition
{'A': 20501, 'C': 39705, 'G': 39704, 'T': 20089, 'N': 1}

>>> # New in pyfastx 0.6.10
>>> # get average length of reads
>>> fq.avglen
150.0

>>> # get maximum lenth of reads
>>> fq.maxlen
150

>>> # get minimum length of reas
>>> fq.minlen
150

>>> # get maximum quality score
>>> fq.maxqual
70

>>> # get minimum quality score
>>> fq.minqual
35

>>> # get phred which affects the quality score conversion
>>> fq.phred
33

>>> # Guess fastq quality encoding system
>>> # New in pyfastx 0.4.1
>>> fq.encoding_type
['Sanger Phred+33', 'Illumina 1.8+ Phred+33']

Read

Get read from FASTQ

>>> #get read like a dict by read name
>>> r1 = fq['A00129:183:H77K2DMXX:1:1101:4752:1047']
>>> r1
<Read> A00129:183:H77K2DMXX:1:1101:4752:1047 with length of 150

>>> # get read like a list by index
>>> r2 = fq[10]
>>> r2
<Read> A00129:183:H77K2DMXX:1:1101:18041:1078 with length of 150

>>> # get the last read
>>> r3 = fq[-1]
>>> r3
<Read> A00129:183:H77K2DMXX:1:1101:31575:4726 with length of 150

>>> # check a read weather in FASTQ file
>>> 'A00129:183:H77K2DMXX:1:1101:4752:1047' in fq
True

Get read information

>>> r = fq[-10]
>>> r
<Read> A00129:183:H77K2DMXX:1:1101:1750:4711 with length of 150

>>> # get read order number in FASTQ file
>>> r.id
791

>>> # get read name
>>> r.name
'A00129:183:H77K2DMXX:1:1101:1750:4711'

>>> # get read full header line, New in pyfastx 0.6.3
>>> r.description
'@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG'

>>> # get read length
>>> len(r)
150

>>> # get read sequence
>>> r.seq
'CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA'

>>> # get raw string of read, New in pyfastx 0.6.3
>>> print(r.raw)
@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG
CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:

>>> # get read quality ascii string
>>> r.qual
'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:'

>>> # get read quality integer value, ascii - 33 or 64
>>> r.quali
[37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25]

>>> # get read length
>>> len(r)
150

FastqKeys

New in pyfastx 0.8.0.

Get fastq keys

Get all names of read as a list-like object.

>>> ids = fq.keys()
>>> ids
<FastqKeys> contains 800 keys

>>> # get count of read
>>> len(ids)
800

>>> # get key by index
>>> ids[0]
'A00129:183:H77K2DMXX:1:1101:6804:1031'

>>> # check key whether in fasta
>>> 'A00129:183:H77K2DMXX:1:1101:14416:1031' in ids
True

Command line interface

New in pyfastx 0.5.0

$ pyfastx -h

usage: pyfastx COMMAND [OPTIONS]

A command line tool for FASTA/Q file manipulation

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

Commands:

    index        build index for fasta/q file
    stat         show detailed statistics information of fasta/q file
    split        split fasta/q file into multiple files
    fq2fa        convert fastq file to fasta file
    subseq       get subsequences from fasta file by region
    sample       randomly sample sequences from fasta or fastq file
    extract      extract full sequences or reads from fasta/q file

Build index

New in pyfastx 0.6.10

$ pyfastx index -h

usage: pyfastx index [-h] [-f] fastx [fastx ...]

positional arguments:
  fastx       fasta or fastq file, gzip support

optional arguments:
  -h, --help  show this help message and exit
  -f, --full  build full index, base composition will be calculated

Show statistics information

$ pyfastx stat -h

usage: pyfastx info [-h] fastx

positional arguments:
  fastx       input fasta or fastq file, gzip support

optional arguments:
  -h, --help  show this help message and exit

Split FASTA/Q file

$ pyfastx split -h

usage: pyfastx split [-h] (-n int | -c int) [-o str] fastx

positional arguments:
  fastx                 fasta or fastq file, gzip support

optional arguments:
  -h, --help            show this help message and exit
  -n int                split a fasta/q file into N new files with even size
  -c int                split a fasta/q file into multiple files containing the same sequence counts
  -o str, --out-dir str
                        output directory, default is current folder

Convert FASTQ to FASTA file

$ pyfastx fq2fa -h

usage: pyfastx fq2fa [-h] [-o str] fastx

positional arguments:
  fastx                 fastq file, gzip support

optional arguments:
  -h, --help            show this help message and exit
  -o str, --out-file str
                        output file, default: output to stdout

Get subsequence with region

$ pyfastx subseq -h

usage: pyfastx subseq [-h] [-r str | -b str] [-o str] fastx [region [region ...]]

positional arguments:
  fastx                 input fasta file, gzip support
  region                format is chr:start-end, start and end position is 1-based, multiple names were separated by space

optional arguments:
  -h, --help            show this help message and exit
  -r str, --region-file str
                        tab-delimited file, one region per line, both start and end position are 1-based
  -b str, --bed-file str
                        tab-delimited BED file, 0-based start position and 1-based end position
  -o str, --out-file str
                        output file, default: output to stdout

Sample sequences

$ pyfastx sample -h

usage: pyfastx sample [-h] (-n int | -p float) [-s int] [--sequential-read] [-o str] fastx

positional arguments:
  fastx                 fasta or fastq file, gzip support

optional arguments:
  -h, --help            show this help message and exit
  -n int                number of sequences to be sampled
  -p float              proportion of sequences to be sampled, 0~1
  -s int, --seed int    random seed, default is the current system time
  --sequential-read     start sequential reading, particularly suitable for sampling large numbers of sequences
  -o str, --out-file str
                        output file, default: output to stdout

Extract sequences

New in pyfastx 0.6.10

$ pyfastx extract -h

usage: pyfastx extract [-h] [-l str] [--reverse-complement] [--out-fasta] [-o str] [--sequential-read]
                       fastx [name [name ...]]

positional arguments:
  fastx                 fasta or fastq file, gzip support
  name                  sequence name or read name, multiple names were separated by space

optional arguments:
  -h, --help            show this help message and exit
  -l str, --list-file str
                        a file containing sequence or read names, one name per line
  --reverse-complement  output reverse complement sequence
  --out-fasta           output fasta format when extract reads from fastq, default output fastq format
  -o str, --out-file str
                        output file, default: output to stdout
  --sequential-read     start sequential reading, particularly suitable for extracting large numbers of sequences

Drawbacks

If you intensively check sequence names exists in FASTA file using in operator on FASTA object like:

>>> fa = pyfastx.Fasta('tests/data/test.fa.gz')
>>> # Suppose seqnames has 100000 names
>>> for seqname in seqnames:
>>>     if seqname in fa:
>>>         do something

This will take a long time to finish. Becuase, pyfastx does not load the index into memory, the in operating is corresponding to sql query existence from index database. The faster alternative way to do this is:

>>> fa = pyfastx.Fasta('tests/data/test.fa.gz')
>>> # load all sequence names into a set object
>>> all_names = set(fa.keys())
>>> for seqname in seqnames:
>>>     if seqname in all_names:
>>>         do something

Testing

The pyfaidx module was used to test pyfastx. First, make sure you have a suitable version installed:

pip install pyfastx

To test pyfastx, you should also install pyfaidx 0.5.8:

pip install pyfaidx==0.5.8

Then, to run the tests:

$ python setup.py test

Acknowledgements

kseq.h and zlib was used to parse FASTA format. Sqlite3 was used to store built indexes. pyfastx can randomly access to sequences from gzipped FASTA file mainly attributed to indexed_gzip.

pyfastx's People

Contributors

Stargazers

Watchers

Forkers

hmwill cmorganl altingia wx904 chenwong dishsri1 odidev hirnimeshrampuresoftware sclamons ajunlonglive lgtm-migrator sadikmu emollier corneliusroemer kkchau barrymoore wook2014 telatin

pyfastx's Issues

Error indexing FASTQ; are there header length or character limitations?

Hello @lmdu, thanks for your tool. I'm looking for fast random access to both fasta and fastq files, therefore I'm giving pyfastx a try. I'm running into the following issue with indexing FASTQ file, I'm guessing its either related to length or perhaps the brackets in the header??

>>> import pyfastx
>>> fa = pyfastx.Fasta('test.fasta')
>>> fa[1]
<Sequence> MK674167.1 with length of 1328
>>> fa['MK674167.1']
<Sequence> MK674167.1 with length of 1328
>>> fa['MK674167.1'].seq
'TAATAAGTGTTTTATGGCACTTTTTTAAATCCATATCCACCTTGTGTGCAATGTCAGGGTTGGTTTCTCTCTTTTGAGAGATCAACCCCAAACATCAACTCTATCTTAACTCTTTGTCTGAAAAATATTATGAATAAAACAATTCAAAATACAACTTTCAACAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGCGAAATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCATATTGCGCTCTTTGGTATTCCGAAGAGCATGCTTGTTTGAGTATCAGTAAACACCTCAAAGCCTTTTTTCTTTTTTGAAGAAAGGCTTTGGACTTGAGCAATCCCAACACCAATCTTTTAAAGAGAGGGGGCGGGTTGCTTGAAATGCAGGTGCAGCTGGACATTCTCCTGAGCTAAAAGCATATTCATTTAGTCCCGTCAAACGGATTATTACTTTTGCTGCAGCTAACATAAAGGGAGTTTGACCATTTTGGCTGACTGATGCAGGATTTTCACAAGAGTCTTCAAAACCTCTTGTTAAACTCGATCTCAAATCAAGTAAGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACAAGGATTCCCCTAGTAACGGCGAGTGAAGCGGGATGAGCTCAAATTTAAAATCTGGCTGGTTTACTAGTCCGAATTGTAGTTTATAGAAGCGTTTTCGGTGGCAGCCTGGGTATAAATCCTTTGGAATAGGGTATCATAGAGGGTGAGAATCCCGTCTTTGACTCAGTGCATGCTACTATGTGATACGTTTTCAAAGAGTCGAGTTGTTTGGGAATGCAGCTCAAAATGGGTGGTAAATTTCATCTAAAGCTAAATATTGGCGAGAGACCGATAGCGAACAAGTACCGTGAGGGAAAGATGAAAAGAACTTTGAAAAGAGAGTTAAACAGTGCGTGAAATTGTTGAAAGGGAAACGCTTGACACCAGTCATGCGAGTGGAAAATCAGTCTTTTGCAATGGGGAGTTGTGGGCGTTCAGACCGCAAGGCTGGCGTTTGCTTCATCTTTGTTGTAAGTGATGCACTTTTTCATTTGCAGGTCAACATCAGTTTGCTTTGCTGGACAAAACCCCAAGGGAAGGTGGCAGCTTAGGCTGTGTTATAGCCCTGGGGCGATACAGTGGAGTGGACTGAGGTTTTCGCAGTGTGTGCTCTCTGGGCAAGGCTGACTGGGTGCTATGGGATCGTTCGGCGTACAATGCATGCATTTTGCGTCGTGTCTTTTTCATACTCGCTCAACTCGGCTCTTCCACACTT'
>>> fq = pyfastx.Fastq('test.fastq')
>>> fq[1]
<Read> 2;barcodelabel=SRR10121280;flow_cell_id=unknown;len=1429;avgqual=7.49;primers=[71,1319];orient=minus with length of 1429
>>> fq['2;barcodelabel=SRR10121280;flow_cell_id=unknown;len=1429;avgqual=7.49;primers=[71,1319];orient=minus']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '2;barcodelabel=SRR10121280;flow_cell_id=unknown;len=1429;avgqual=7.49;primers=[71,1319];orient=minus does not exist in fastq file'
>>>

The sample function with option --sequential-read went wrong

$pyfastx sample -n 100 --sequential-read -o tt SRR039541.3_1.clean.fq.gz
Segmentation fault (core dumped)
I am not sure whether my fastq file is in a incompatible format.

pyfastx sometimes gives an erroneously shifted subsequence from hg19 reference

Hi!

I have tried for a few days now to understand why I sometimes get different subsequences than using 'bedtools getfasta' from hg19 reference. I was wondering if you could elaborate the indexing used by pyfastx in case I am making some error. I demonstrate this by fetching these two subsequences from chromosome 20:

seq1: start=60757997, end=60758117
seq2: start=60757999, end=60758119

both sequences are 120bp long and seq2 is shifted downstream by 2bp from seq1, meaning they should overlap with 118bp.

Using bedtools getfasta, I get the expected result:

$ cat test.bed 
chr20   60757997        60758117
chr20   60757999        60758119

$ bedtools getfasta -fi /home/work/genomes/hg19/hg19.fa -bed test.bed -fo bedtools.fasta

$ cat bedtools.fasta 
>chr20:60757997-60758117
GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA
>chr20:60757999-60758119
CGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGT

However, when trying to fetch the same sequences with pyfastx, I get a seq2 that is shifted by 3bp instead of 2bp. The commands issued in ipython:

In [11]: genome = pyfastx.Fasta('/home/work/genomes/hg19/hg19.fa')

In [12]: str(genome["chr20"][60757997:60758118])
Out[12]: 'GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA'

In [13]: str(genome["chr20"][60757999:60758120])
Out[13]: 'GGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGTG'

So seq1 is identical to what bedtools returns but for some reason seq2 starts from different position even though the start position is defined to be the same as in bedtools example.

I also replicated the test with third tool, seqkit, and it agrees with bedtools:

$ seqkit grep -p 'chr20' /home/work/genomes/hg19/hg19.fa | seqkit subseq --bed test.bed
[INFO] read BED file ...
[INFO] 2 BED features loaded
>chr20_60757998-60758117:. 
GTCGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGC
CCCGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGA
>chr20_60758000-60758119:. 
CGGCGTGTTTTCTGCACCTGCGCACTGGGCAAGCCTTGGGCTTCCGGAGCCAAAAGGCCC
CGGAAGTGGTCGGCCGCTTGCGACGCTCCGGAAGTGACGTGCTTTCCCGAGCCGGTGAGT

I have also deleted both the .fai and .fxi index files and rebuilt them but the issue persists. Can you explain why I get the different sequence than with bedtools? And why sometimes I get the same sequence as with other tools?

Let me know if I was unclear or some information is missing from my report.

Best regards,
Tuomo

Add support to release aarch64 wheels

Problem

On aarch64, pip install pyfastx builds the wheels from source code and then install it. It requires user to have development environment installed on his system. also, it take some time to build the wheels than downloading and extracting the wheels from pypi.

Resolution

On aarch64, pip install pyfastx should download the wheels from pypi

@lmdu, Please let me know your interest on releasing aarch64 wheels. I can help in this.

streaming multiple fastq files at the same time?

Hi, is it possible to open multiple fastq files (and process) at the same time using pyfastx.Fastq? With file handle, I could do something like this:

with gzip.open(R1_fastq_file, 'r') as R1: 
    with gzip.open(R2_fastq_file,'r') as R2:
        with open(output_path, 'w') as out:
            DO SOMETHING HERE...

Is it possible to achieve the same thing using pyfastx? The situation here is I need information from both read1 and read2 fastq file, and I need to scan through a really large pair-end fastq file.

Thank you!

Can an index be "shared" by multiple processes?

Thanks again for pyfastx, it is easy and fast to use. I'm wondering if it is possible to generate an index from FASTQ file and if that index can be used by multiple processes at the same time, ie accessing different sequences from the index? I've tried this and haven't gotten any exceptions/errors, however, the data is garbled coming back (mixture of seq and quality scores from a call to idx.seq). I think I read somewhere in the docs that the index is SQLite3 database?

Difference in content when parsing fastq files.

Hi, tyvm for pyfastx, I really like it.

I came across a issue when I was writing tests for my script, and hoped you might be
able to help me.

I am processing fastq files, both regular and gzipped using:

for oisRecord in pyfastx.Fastq(input_file):

And get the headers and sequences like this:

oisRecord.name
oisRecord.seq

The issue is that I got different read counts for the same fastq file.
When I write the headers and sequences to a file within the for loop, and then
count the lines using wc -l, they differ.
For comparison, I also converted the fastq to fasta, and gzipped the fasta.
Both the regular fasta and the gzipped fasta give me the same result as the gzipped fastq.
The outlier is the regular fastq file.

Any idea what could be the cause of this?

System:
Ubuntu 20.04
Python 3.8
Pyfastx 0.6.12

test.fastq.gz (quality in this fastq is fake)

SystemError with fa.fetch

I am getting the following error with fa.fetch:

SystemError: /tmp/build/80754af9/python_1578510683607/work/Objects/longobject.c:415: bad argument to internal function

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/ikaplow/RegulatoryElementEvolutionProject/src/RegElEvoCode/prepareDataForPrank.py", line 160, in
prepareDataForPrank(options)
File "/home/ikaplow/RegulatoryElementEvolutionProject/src/RegElEvoCode/prepareDataForPrank.py", line 149, in prepareDataForPrank
currentSequence = fa.fetch(bedDict[peak][0], (bedDict[peak][1], bedDict[peak][2]))

I have checked that the chromosome is in the fasta file and that the coordinates are within the chromosome boundaries. I am using pyfastx version 0.8.4 with python 3.7.6.

It would be great if someone could help me figure out what to do about this. Thanks!

[Feature Request] Read from stdin option

It would be awesome if pyfastx could read from stdin for streaming purposes between programs.

import sys, pyfastx

def main():
    for id, seq in pyfastx.Fasta(sys.stdin, build_index=False):
        print(id)
if __name__ == "__main__":
    main()

Here's the error:

Traceback (most recent call last):
  File "/Users/jespinoz/test.py", line 7, in <module>
    main()
  File "/Users/jespinoz/test.py", line 4, in main
    for id, seq in pyfastx.Fasta(sys.stdin, build_index=False):
TypeError: a bytes-like object is required, not '_io.TextIOWrapper

Python 3.10 compatiblity

Can future versions of this incorporate Python 3.10 compatibility? The bioconda install at least requires less than 3.10

Return full sequence name without building index

Hi Lianming,

Thank you for writing such a great tool! I'm using it in several projects I'm involved in now.

I was wondering if it is possible to return the full sequence name instead of the string split on whitespace when build_index is False? It would be great if it could be controlled using the key function as it is when an index is built. This is critical to the work that I'm doing where sequence names do not reliably have unique strings for the first part of the header.

I hope I haven't just missed something in the documentation!

Thanks!
Connor

Support xz-compressed files

Hello @lmdu,

I am planning to use pyfastx within Nextstrain's Augur to support a new data curation command and it would be really helpful to be able to support xz-compressed files. Would you be open to extending pyfastx to support xz-compressed files?

Groups working with large files are using xz to save space because xz has a better compression ratio than gzip. For example, Nextstrain hosts a file of all GenBank SARS-CoV-2 genomes that is xz-compressed.

With the condition that the file was originally compressed in multiple short blocks, it is possible to randomly access xz-compressed files. python-xz is an example of this in pure Python and xz-random-access is an example of this in C.

Thank you!

Read comment

Hey nice tool - quite useful to extend pyfaidx to Fastq. Is there any chance you could implement to read the comment on a read header?

Currently the only accessible attribute is read.name when iterating over pyfastx.Fastq

Failure to properly parse ENCODE's GRCh38 reference

Hello,

I wanted to use pyfastx for some ENCODE analysis and tried to use the Fasta class on the GRCh38 reference found here (more info here).. When I iterate through the fa object I get unexpected results (i.e. the "description" from chr1 is the header for chr2). Interestingly enough, opening the file as a Fastx object seems to work as expected. Unfortunately, I really want to use the fetch method which isn't implemented for Fastx.seq objects.

To troubleshoot I went ahead and tested the hg38.fa.gz fasta from UCSC found here. Unfortunately, I'm getting odd results with this file too. For example many of the description fields are short nucleotide sequences (i.e. for chr10 i get "CGGGC").

The description field errors may not have been an issue, except when you hit the end of the Fasta file you get a UnicodeDecodeError when the description field is called.

I am using version 0.8.4.

Here is a Google Colab notebook replicating the issue.

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position

python version: 3.8.12
pyfastx version: 0.8.4

import pyfastx 
fq2 = pyfastx.Fastq("./tmp/tmp2_2.fq.gz")
for read in fq2:
    print(read.seq)

here is the header of my fq file:

@A01225:536:HFNG7DSX5:3:1101:1217:1000 2:N:0:TAAGGCGA+GCGATCTA
CAAGCGTTGGCTTCTCGCATCTGTACGGTGTCAACGCTTAATCCACGTGCTTGAGAGGCCAGAGCATTCGACACAGAATTTTTTTTTTTTTTCGGTTAGTACGAGATGGCTTCACACCTCGCGGTGCATCCAACTCTCGCCTATCGAAGT
+
,FFFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FF:FFFF:,FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFF:F,FF::FFF:FF,:FF,:F,FFFFF,FFF:FFFF,FFFF,F,FFF:,F:FFF,FF:F
@A01225:536:HFNG7DSX5:3:1101:1253:1000 2:N:0:TAAGGCGA+GCGATCTA
CAAGCGTTGGCTTCTCGCATCTTCGTCCATCATACTAGTAATCAACGTGCTTGAGAGGCCAGAGCATTCGAAGATCTTTTTTTTATTTTTTTTTGTTTTTGTTTTAAAAAATTTATGAAAAAATTGATACGCCAATTATTACTAAAACTC
+
:F:,F,FFFFFF,FFFFFFFFFF:,FF:FFFF,F,FFFFFFFFF,FFFFFFFFFFFF,F,FFF,FFFFF,F,F,,,:FFFFFFF,F:F:FFFF:,F:,FF,,::F:,FFFF,FF,,,,F,FF,F:::,:,,F,:F,:F,F:,:F:,F,F,

Multiprocessing with apply_async does not work

Hi @lmdu ,

Sorry again, found one more issue while trying to parallelize and share the index with multiple processes.

Here is the code:

from multiprocessing import Manager, Pool
from pyfastx import Fasta

def print_seq_names(fasta_obj, lock):
    for i in range(5):
        with lock:
            print(fasta_obj[i].name)

def error_call(err):
    print(err)

fasta_obj = Fasta("uniprot_sprot.fasta.gz")
pool = Pool(5)
lock = Manager().Lock()

for i in range(4):
    pool.apply_async(print_seq_names, args=(fasta_obj, lock), error_callback=error_call)

pool.close()
pool.join()

Error:
<multiprocessing.pool.ApplyResult object at 0x7f7605085090>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f20390>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f201d0>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f20390>
can't pickle Fasta objects

Is it not possible to share the Fasta object or the index or the identifier objects with multiple processes anymore?

Also, if I make the fasta object and the identifier object in my code (the above is a sample dummy code) as a global variable, I could see only 1 process/core at a time be running (out of 64 cores in real code) and the rest of them are in the sleep state. Do you know why this is the behaviour?

Any help here would be great as well,

Thank you!

P.S:
OS: CentOS 7
Python : 3.7.7
pyfastx: 0.8.3

pyfastx.Fasta(build_index=False) using all available threads

What is pyfastx.Fasta using that defaults to using all threads (eg., BLAS)? I'm simply running:

for name, seq in pyfastx.Fasta(fasta_file, build_index=False)

...and all 80 threads on my server are used. I obviously want to reduce this.

env

OS: Ubuntu 18.04.4
python: 3.6.10
blas: 1.1
pyfastx : 0.6.14

iteration on record?

Thanks for developing pyfastx, that's a really nice tool! However, I have a basic question. When you iterate over the fasta object or the pyfastx.Fasta(command, does it always report name and sequences as string. What if I want to iterate over sequence objects to rename them using the description field. Maybe I missed something. Thanks again!!!

ValueError in test suite with pyfaidx 0.7.1

Hi,

While working on packaging pyfastx for Debian, in order to fulfill the new dependency requirements of augur, I noticed that the way pyfaidx is called to produce the reference object at the negative index is somehow broken (at least with pyfaidx 0.7.1, python3.10 and python3.11 in Debian sid). It raises a -probably undue- test error:

ERROR: test_seq_by_index (tests.test_sequence.SequenceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/build/pyfastx-LEa3AU/pyfastx-0.8.4/tests/test_sequence.py", line 54, in test_seq_by_index
    expect = self.faidx[idx][:]
  File "/usr/lib/python3/dist-packages/pyfaidx/__init__.py", line 1099, in __getitem__
    rname = next(islice(self.records.keys(), rname, None))
ValueError: Indices for islice() must be None or an integer: 0 <= x <= sys.maxsize.

Manually implementing the negative index wrap around works around the problem and allows me to proceed to the actual negative index test on pyfastx, which in turn works as expected.

--- a/tests/test_sequence.py
+++ b/tests/test_sequence.py
@@ -51,7 +51,7 @@ class SequenceTest(unittest.TestCase):
 
 		#test negative index
 		idx = (self.get_random_index() + 1) * -1
-		expect = self.faidx[idx][:]
+		expect = self.faidx[self.count+idx][:]
 		result = self.fastx[idx]
 
 		self.assertEqual(expect.name, result.name)

I'm not entirely sure the problem lies in pyfastx, or if this is the new behavior of pyfaidx which is buggy. Thus, I haven't prepared a full fledged merge request, but I thought you might be interested in being aware of the problem anyway.

Have a nice day, :)
Étienne.

Questions about pyfastx obtaining sequence

When the last line of the fasta file is not a blank line, the program crashes when processing the last sequence.（fa['test'].seq）
When the input sequence name does not exist, a more human prompt should be given instead of KeyError.

Segfault when using full_name=True for Fasta iteration

Hi Lianming,

I'm gettting a segfault when the full_name flag is set to True and build_index is set to False for any fasta file. This behaviour is in 0.8.0 and 0.8.1, while 0.7.0 is functioning fine.

Here's an example line:

for name, seq in Fasta(fasta_input, build_index=False, full_name=True):

It segfaults on the next line.

Let me know if you need any more information.

Thanks!
Connor

why it append "phred" behind read.name


for read in pyfastx.Fastq(f'{u}.fastq'):
    read_id = read.name # read_id = original read_id plus phred

Seems to have trouble with large fasta files

Wanted to benchmark this against pyfaidx for getting subsequences, but I can't get pyfastx to work with my hg19 genome. pyfasta indexes the file ok, and then says it has 28 sequences, but when I try to access subsequences, I get either empty strings or crashes.

>>> f = pyfastx.Fasta('hg19.fa')
>>> f.fetch('chr1', (100000,110000))
''
>>> f.fetch('chr1', (1000000,1100000))
SystemError: Objects/longobject.c:244: bad argument to internal function
>>> s = f[-1]
SystemError: Objects/longobject.c:244: bad argument to internal function
>>> f.keys()
<Identifier> contains 28 identifiers

[~]# pyfastx info hg19.fa
Sequence counts: 28
Total bases: 3095694378
GC content: 40.90%
A counts: 844868179
T counts: 846097379
C counts: 585018013
G counts: 585360526
N counts: 234350281
Mean length: 110560513.50
Median length: 111259709.00
Max length: 249250621
Min length: 131
N50, L50: 155270560, 8
length >= 1000: 25

This is Python 2.7 running on CentOS. This is version 0.5.6 of pyfastx.

Does current pyfastx support py2.7 version?

Hi,

I want to use pyfastx in my py2.7-based project. But I don't know does pyfastx support py2.7?

Thanks!

segfaults

I seem to be getting memory allocation issues ... I would love to use this package but not sure how to fix this.

python(9050,0x10f3a9dc0) malloc: Incorrect checksum for freed object 0x7f917f498558: probably modified after being freed.
Corrupt value: 0x0
python(9050,0x10f3a9dc0) malloc: *** set a breakpoint in malloc_error_break to debug
[1] 9050 abort python test.py

Its always at the same read.

@FS10000899:17:BPC29511-2322:1:1101:15960:1860 2:N:0:1
GGAAGATCGAGTAGATCAAATACATACGATATGGAAGTGGGAACTACCATGCGAACGGAAACGTTCGAACTACATGGCCCACTTCCTAAGTCGTATGTAAAAGAAACAACAACAACAACCCACCATTTTGT
+
FF,,F:F,FFFF::F:F,:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFFFF

fq1 = pyfastx.Fastq("output/Mapping_Files/test_mate1_val_1.fq")
count = 0
total = 0
for r1 in fq1:
    total += 1
    print(r1.name)
    print(r1.seq)
    seq_sub = r1.seq
    found = 0
    for c in codes.values():
        if seq_sub.find(c) != -1:
            found = 1
            break
    count += 1
print(count, total)

fasta records that don't have comments have comments attached to them after conversion to tabular format

i am running this script on miniconda with python 3.8.13 on windows 10 version 21H1( OS build 19043.1889): fasta_to_tab_gui.py using the 2nd option with full.fasta as input. So i get this weir output with fasta records that don't have comments repeating the comment of the 1st sequence: full_tabular.txt. Is there a way to fix this?

[BUG] Unable to get sequences after encountering `KeyError`

I'm trying to access sequences with a try/except block like so:

# sequences.fasta contains three sequences: 'seq_a', 'seq_b', 'seq_c'
sequences = pyfastx.Fasta('sequences.fasta')
for seq_id in ['seq_a', 'bad_sequence', 'seq_a', 'seq_b', 'seq_c']:
    try:
         sequence_record = sequences[seq_id]
     except KeyError:
         ...

Once I encounter a KeyError, all subsequent tries to get a sequence fails with a KeyError.
In my example, I am able to access seq_a the first time, but I get an error for any of the sequences after bad_sequence.

Python version 3.7
pyfastx version 0.8.4

Add full_name=True option to pyfastx.Fastq

Thanks again for great library -- would it be possible to add full_name=True to the Fastq parser? ONT (nanopore) reads default as having spaces in the FASTQ header that contain useful information. Currently, doesn't seem to be a way to keep that information as the parser seems to be truncating header at first space.

Thanks,
Jon

Writing fast

Hey @lmdu
Thank you for your package. Can you suggest an efficient way to write fasta files.

If would need to read a fasta file and rename the headers and write them to one or multiple files.
reading is very easy in pyfastx, but then to write it to files I found no better way than the following.

        fa = pyfastx.Fasta(input.fasta)

        representatives= fa.keys()

        with open(output.faa,'w') as outf:
            for gene in fa:
                new_name= map_names[gene.name]
                lines='\n'.join(str2multiline(gene.seq))
                outf.write(">{new_name} {gene.description}\n{lines}\n")

If I don't need to rename I can use the seq.raw attribute but I don't know if there would be a faster way.

Segfault when loading index from multiple processes

I am dispatching several hundred jobs on an HPC that all begin by loading a previously-created pyfastx index file. Around a third of the jobs will immediately segfault:

(gdb) bt
#0  0x00002aaaaaf5e580 in fileno_unlocked () from /lib64/libc.so.6
#1  0x00002aaab35de26e in is_readonly (fd=fd@entry=0x0) at src/zran.c:2639
#2  zran_import_index (index=index@entry=0x5555565f4b20, fd=fd@entry=0x0) at src/zran.c:2639
#3  0x00002aaab35d76be in pyfastx_load_gzip_index (index_file=<optimized out>, gzip_index=0x5555565f4b20, index_db=0x5555566a0508) at src/util.c:513
#4  0x00002aaab35db42c in pyfastx_fastq_load_index (self=self@entry=0x2aaadfe79938) at src/fastq.c:223
#5  0x00002aaab35dc210 in pyfastx_fastq_new (type=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at src/fastq.c:322
#6  0x00005555556f78e5 in type_call () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Objects/typeobject.c:895
#7  0x000055555566787b in _PyObject_FastCallDict () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Objects/abstract.c:2331
#8  0x00005555556f774e in call_function () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4861
#9  0x000055555571a2ba in _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:3335
#10 0x00005555556c60ab in _PyFunction_FastCall (globals=<optimized out>, nargs=0, args=<optimized out>, co=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4919
#11 fast_function () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4954
#12 0x00005555556f76d5 in call_function () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4858
#13 0x000055555571a2ba in _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:3335
#14 0x00005555556ca669 in _PyEval_EvalCodeWithName (qualname=0x0, name=0x0, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2,
    kwcount=<optimized out>, kwargs=0x0, kwnames=0x0, argcount=0, args=0x0, locals=0x2aaaaaba35e8, globals=0x2aaaaaba35e8, _co=0x2aaab2374ae0)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4166
#15 PyEval_EvalCodeEx () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4187
#16 0x00005555556cb3fc in PyEval_EvalCode (co=co@entry=0x2aaab2374ae0, globals=globals@entry=0x2aaaaaba35e8, locals=locals@entry=0x2aaaaaba35e8)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:731
#17 0x000055555576cc94 in run_mod () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:1025
#18 0x000055555576d091 in PyRun_FileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:978
#19 0x000055555576d293 in PyRun_SimpleFileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:419
#20 0x000055555576d39d in PyRun_AnyFileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:81
#21 0x0000555555770dc9 in run_file (p_cf=0x7fffffff3f9c, filename=0x5555558ab340 L"/nethome/schrammca/SONAR/annotate/find_umis.py", fp=0x5555559243f0)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Modules/main.c:340
#22 Py_Main () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Modules/main.c:810
#23 0x000055555563844e in main () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Programs/python.c:69
#24 0x00002aaaaaf0cb15 in __libc_start_main () from /lib64/libc.so.6
#25 0x0000555555720d7f in _start () at ../sysdeps/x86_64/elf/start.S:103

This happens even if I only start 20 worker jobs concurrently, although the frequency of segfaults is lower (~2/20 jobs vs ~70/200 jobs vs ~450/650 jobs).

Calculate Q30？

How to Calculate the % read >= Q30 or Q20?

Thanks!

Corrupt quality values

Ive got an unaligned BAM file from a pac bio long-read run which ive converted to a fq using bedtools. I am then trying to use pyfastx to play around with the file and ive noticed something (that could be my own fault).

When i load the fq.gz file from within python, pufastx will open the file, index it and then i can start calling the stats on it.
If however i use the commandline argument then it will index fine, but when i try to run the stats will give the following error

Traceback (most recent call last):
  File ".../bin/pyfastx", line 8, in <module>
    sys.exit(main())
  File ".../lib/python3.9/site-packages/pyfastxcli.py", line 578, in main
    args.func(args)
  File ".../lib/python3.9/site-packages/pyfastxcli.py", line 99, in fastx_info
    fq.maxlen, fq.minlen, fq.maxqual, fq.minqual, ",".join(fq.encoding_type)]
TypeError: Quality values corrupt. found [34, 126] where [33, 104] was expected

Any ideas what might be going on?

Iterating over sequences without comments results in comments from previous records

Using pyfastx version 0.8.4, I tried iterating over records but the "comment" attribute is not reset.
Is there a safer way to iterate (I would like not to assume FASTA format) or did I just miss something obvious?

File (test.fa):

>1 comment
TTTTTTTT
>3
atatat

Code:

import pyfastx
fa = pyfastx.Fastx("test.fa")
for name,seq,comment in fa:
    print(comment)

Output:

comment
comment

Thanks for your help

get seq count and length error

I got an error in pyfastx 0.6.5 installed with conda on a linux system


>>> import pyfastx
>>> fa=pyfastx.Fasta('all_bins/ERR2059947_metabat_58.fasta')
RuntimeError: get seq count and length error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: <class 'Fasta'> returned a result with an error set

some sequences are missing in pyfastx.Fasta object

I loaded a fasta file containing 4542 sequences with average length of 2.5kb, however only 4539 sequences were in the pyfastx.Fasta object.

fa = pyfastx.Fasta('assembly.fasta')
fa['contig_4540'] # keyError

Besides, I could access a sequence e.g. fa['contig_999'] for the first time. But when I try to access it again I got keyError.

The version of pyfastx I used is 0.8.4, Python version 3.7

Possible memory leak in keys

Hi,

I have downloaded the trembl dataset from UniProtKB (48GiB) and created an index (23GiB).

from pyfastx import Fasta

# Create index
fobj = Fasta('uniprot_trembl.fasta.gz', build_index=True)

# Extract keys
sample_ids = fobj.keys()

# Apply length filter
sample_ids.filter(sample_ids >= 11)

# Iterate through the filtered sample ids (before the end of the iteration a total of 12.4 GiB of memory is seen to be consumed)
dummy_count = 0
for idx, key in enumerate(sample_ids, start=1):
    dummy_count += 1

The above iteration is shown here for reproducible purposes and to demonstrate that the memory is consumed here no matter what. Can you please take a look at this and offer a solution?

Thanks in advance!

P.S:
System info:
OS: CentOS 7
Python 3.7.7
pyfastx version: 0.8.3

Extend reverse, complement, and antisense to fastq?

Please and thank you!

How to safely close file handles with Fastx?

It is not clear how to safely close any file handles that are opened by Fastx (or any of the other classes provided by this library).

It would be nice to be able to use Fastx with contextlib for example.

Select index output folder

Nice package! Is there a way to add a parameter to select an output folder for the index file? Sometimes the user might not have writing permission to the source folder.

Silent errors when reading other compressed formats

Currently pyfastx checks to see if a file is gzip compressed, and if it is not it seems to interpret it as text. An unfortunate side effect of this is that if another compressed format is used, the program will happily return garbage reads:

>>> reads_1 = [x for x in pyfastx.Fastq("sample.fq.gz")]
>>> reads_1[:5]
[<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1 with length of 16,
 <Read> 140313_SN743_0432_AC3TTHACXX:4:1101:6580:2239:1#0/1 with length of 16,
 <Read> 140313_SN743_0432_AC3TTHACXX:4:1101:6929:2242:1#0/1 with length of 16,
 <Read> 140313_SN743_0432_AC3TTHACXX:4:1101:13004:2221:1#0/1 with length of 16,
 <Read> 140313_SN743_0432_AC3TTHACXX:4:1101:14034:2219:1#0/1 with length of 16]

>>> reads_2 = [x for x in pyfastx.Fastq("sample.fq.bz2")]
>>> reads_2[:5]
[<Read> Zh91AY&SY��7a��}^ with length of 4,
 <Read>  with length of 2,
 <Read>  with length of 257,
 <Read> Q4��TH��*���m�R�@�6�+� with length of 0,
 <Read> E��hĶ
  with length of 206]

Some basic error checking to make sure that the input file isn't in binary format, or perhaps a sterner warning in the README that this can happen, would be helpful!

pyfastx.Fasta leaking memory?

I am opening a large fasta file via

import pyfastx
fasta_sequences = pyfastx.Fasta('data/uniparc_active.fasta', build_index=True)

and I am seeing memory utilization in top increase until 100%.

Is this intended behavior, or is there a memory leak in the library?

Thanks,
Martin

Slice feature for keys

Hi,

Thank you for this handy tool, really useful!

I am writing a tool that uses pyfastx and was wondering that it would be really nice to have a slice feature for keys instead of iterating over every key and saving it in a list object.

from pyfastx import Fasta
fasta_obj = Fasta("fasta_file.fasta", build_index=True)
fasta_keys = fasta_obj.keys()

fasta_keys[0:10]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sequence index must be integer, not 'slice'

Thanks in advance!

Poor performance when reading gzipped FASTQ sequentially

As part of publishing my pure Python FASTQ/FASTA package I ran some benchmarks/comparisons with pyfastx, which you can find in the package documentation: https://fqfa.readthedocs.io/en/latest/benchmark.html

pyfastx was unexpectedly slow when processing a gzipped FASTQ file sequentially, although it did fine when the file was uncompressed. I don't know what the root cause of this might be (maybe it's getting a new block for each read?), but I wanted to bring it to your attention.

pip install broken in version 0.8.0

schrammca@ai-hpcn151:~$ pip3 install pyfastx --user
Collecting pyfastx
  Using cached pyfastx-0.8.0.tar.gz (74 kB)
    ERROR: Command errored out with exit status 1:
     command: /sysapps/cluster/software/Anaconda3/5.3.0/envs/python_3.7_env/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-9lau04qa/pyfastx_100ad86c0e9b4cd691b9ebe7ad33ee39/setup.py'"'"'; __file__='"'"'/tmp/pip-install-9lau04qa/pyfastx_100ad86c0e9b4cd691b9ebe7ad33ee39/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-2jdcamx4
         cwd: /tmp/pip-install-9lau04qa/pyfastx_100ad86c0e9b4cd691b9ebe7ad33ee39/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-9lau04qa/pyfastx_100ad86c0e9b4cd691b9ebe7ad33ee39/setup.py", line 51, in <module>
        with open(os.path.join('src', 'version.h')) as fh:
    FileNotFoundError: [Errno 2] No such file or directory: 'src/version.h'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I was able to pip3 install pyfastx==0.7.0 --user successfully...

Does pyfastx split fasta files sequentially?

I want to split paired-end fastq files, if I run pyfastx split -n 30 on each will the pair order be preserved such that subset 1 of forward reads and subset 1 of reverse reads will be correctly paired?

处理fastq文件获得序列全名

因为fastq 压缩文件巨大，建索引需要很久，而且占用空间。如果只是简单遍历fastq文件，不做随机读取，就没有必要对fastq建索引。但是不建立索引，使用元祖索引获取序列id时只能获得短名，程序可以不建索引返回序列全名么？

>>>head -1 test.fq
>>>@A00838:157:H2Y5TDSXY:4:1101:1398:1000 1:N:0:ATTCAGAA+CCTATCCT

for read in pyfastx.Fastq('test.fq',build_index=False):
   print(read[0])
>>>A00838:157:H2Y5TDSXY:4:1101:1398:1000

small caps

Hi -
Thanks for developing pyfastx, very useful and convenient. I had a question: how is pyfastx handling soft masked genomes with lower-case characters, such as soft-masked genomes. I notice sometimes such files generate errors or at least that everything is reverted to capital. Thanks!

Filter fastq file (performance)

I have a paired-end single-cell RNA-seq dataset. R1 consists of all the reads, and R2 of the barcodes necessary to identify which cell a read belongs to. If I now only trim R1 to keep high-quality reads, my R1 and R2 are out of sync..

It seems like pyfastx can solve this problem for me, by simply only keeping the reads in R2 of which we have one in R1:

import gzip

with gzip.open(output, 'wt') as f:
    for read in reads:
        barcode = barcodes[read.id]
        f.write(barcode.raw)

However from a really sloppy benchmark, it seems like just getting barcode.raw takes around 0.0015 seconds per read, the lookup of the read is fast: 1e-6. This is would mean I have to wait two days to filter my fastq. Is there an easier/better/faster way of doing this?

[Feature Request] Get barcodes from Fastx function

Hi, I would like to get barcode counts using the Fastx function.
If this already is an option please let me know.

Right now I am calling:
for currentRead in pyfastx.Fastx(fastqFile): ...
To iterate through rows to get some statistics, but I would also like barcode counts. With normal python code I do it like so, but its quite slow:

barcodes = {}
with gzip.open(myFastq) as fastq:
        for line in fastq:
                if not line.startswith(b'@'): continue
                bc = line.decode("utf-8").split(':')[-1].strip()
                # print(bc)
                if bc not in barcodes:
                        barcodes[bc] = 1
                else:
                        barcodes[bc]+=1

Fastx has sped up some of my other data collection functions so I was hopeful it could do this too!
Thank you

Unexpected behavior when a Sequence object is returned from a function

Hello!

I made a trivial function that returns a sequence from a Fasta class expecting it to return an object of the Sequence class.

However, it returns meaningless and different characters every time the script is run.

On the contrary, if I don't use the function, but extract the gene in the main routine, I can handle the Sequence class object as expected.

Is this behavior to be expected or is it a bug?

For the test I used this file https://github.com/lmdu/pyfastx/blob/master/tests/data/test.fa

Here is the script:

import pyfastx
import sys

def extract_gene(file, gene):
    fasta = pyfastx.Fasta(file)
    return fasta[gene]

if __name__ == '__main__':
    print(f'Python version: {sys.version}')
    print(f'pyfastx version: {pyfastx.version()}\n')
    print('Extracting in __main__\n')
    fasta = pyfastx.Fasta('test.fa')
    gene = fasta['JZ822577.1']
    print(gene)
    print('\n\nExtracting in extract_gene function\n')
    gene = extract_gene('test.fa', 'JZ822577.1')
    print(gene)

output is as follows:

Python version: 3.9.1 | packaged by conda-forge | (default, Jan 26 2021, 01:34:10)
[GCC 9.3.0]
pyfastx version: 0.8.4

Extracting in __main__

CTCTAGAGATTACTTCTTCACATTCCAGATCACTCAGGCTCTTTGTCATTTTAGTTTGACTAGGATATCGAGTATTCAAGCTCATCGCTTTTGGTAATCTTTGCGGTGCATGCCTTTGCATGCTGTATTGCTGCTTCATCATCCCCTTTGACTTGTGTGGCGGTGGCAAGACATCCGAAGAGTTAAGCGATGCTTGTCTAGTCAATTTCCCCATGTACAGAATCATTGTTGTCAATTGGTTGTTTCCTTGATGGTGAAGGGGCTTCAATACATGAGTTCCAAACTAACATTTCTTGACTAACACTTGAGGAAGAAGGACAAGGGTCCCCATGT


Extracting in extract_gene function

@B5��@B5��� ��U� ��U@B5�

lmdu / pyfastx Goto Github PK

pyfastx's Introduction

pyfastx

Introduction

Features

Installation

FASTX

FASTA sequences iteration

FASTQ reads iteration

FASTA

Read FASTA file

FASTA records iteration

Get FASTA information

Get longest and shortest sequence

Calculate N50 and L50

Get sequence mean and median length

Get sequence counts

Get subsequences

Key function

Sequence

Get a sequence from FASTA

Get sequence information

Sequence slice

Reverse and complement sequence

Read sequence line by line

Search for subsequence

FastaKeys

Get keys

Sort keys

Filter keys

FASTQ

Read FASTQ file

FASTQ records iteration

Get FASTQ information

Read

Get read from FASTQ

Get read information

FastqKeys

Get fastq keys

Command line interface

Build index

Show statistics information

Split FASTA/Q file

Convert FASTQ to FASTA file

Get subsequence with region

Sample sequences

Extract sequences

Drawbacks

Testing

Acknowledgements

pyfastx's People

Contributors

Stargazers

Watchers

Forkers

pyfastx's Issues

Problem

Resolution

env

Recommend Projects

Recommend Topics

Recommend Org