Giter VIP home page Giter VIP logo

biobear's Introduction

biobear

biobear is a Python library designed for reading and searching bioinformatic file formats, using Rust as its backend and producing Arrow Batch Readers and other downstream formats (like polars or duckdb).

The python package has minimal dependencies and only requires Polars. Biobear can be used to read various bioinformatic file formats, including FASTA, FASTQ, VCF, BAM, and GFF locally or from an object store like S3. It can also query some indexed file formats locally like VCF and BAM.

Release

Please see the documentation for information on how to get started using biobear.

Quickstart

To install biobear, run:

pip install biobear
pip install polars # needed for `to_polars` method

Create a file with some GFF data:

echo "chr1\t.\tgene\t1\t100\t.\t+\t.\tgene_id=1;gene_name=foo" > test.gff
echo "chr1\t.\tgene\t200\t300\t.\t+\t.\tgene_id=2;gene_name=bar" >> test.gff

Then you can use biobear to read a file:

import biobear as bb

session = bb.connect()
df = session.sql("""
    SELECT * FROM gff_scan('test.gff')
""").to_polars()

print(df)

This will print:

┌─────────┬────────┬──────┬───────┬───┬───────┬────────┬───────┬───────────────────────────────────┐
│ seqname ┆ source ┆ type ┆ start ┆ … ┆ score ┆ strand ┆ phase ┆ attributes                        │
│ ---     ┆ ---    ┆ ---  ┆ ---   ┆   ┆ ---   ┆ ---    ┆ ---   ┆ ---                               │
│ str     ┆ str    ┆ str  ┆ i64   ┆   ┆ f32   ┆ str    ┆ str   ┆ list[struct[2]]                   │
╞═════════╪════════╪══════╪═══════╪═══╪═══════╪════════╪═══════╪═══════════════════════════════════╡
│ chr1    ┆ .      ┆ gene ┆ 1     ┆ … ┆ null  ┆ +      ┆ null  ┆ [{"gene_id","1"}, {"gene_name","… │
│ chr1    ┆ .      ┆ gene ┆ 200   ┆ … ┆ null  ┆ +      ┆ null  ┆ [{"gene_id","2"}, {"gene_name","… │
└─────────┴────────┴──────┴───────┴───┴───────┴────────┴───────┴───────────────────────────────────┘

Using a Session w/ Exon

BioBear exposes a session object that can be used with exon to work with files directly in SQL, then eventually convert them to a DataFrame if needed.

See the BioBear Docs for more information, but in short, you can use the session like this:

import biobear as bb

session = bb.connect()

session.sql("""
CREATE EXTERNAL TABLE gene_annotations_s3 STORED AS GFF LOCATION 's3://BUCKET/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")

df = session.sql("""
    SELECT * FROM gene_annotations_s3 WHERE score > 50
""").to_polars()
df.head()
# shape: (5, 9)
# ┌──────────────┬─────────────────┬──────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname      ┆ source          ┆ type ┆ start ┆ … ┆ score      ┆ strand ┆ phase ┆ attributes                        │
# │ ---          ┆ ---             ┆ ---  ┆ ---   ┆   ┆ ---        ┆ ---    ┆ ---   ┆ ---                               │
# │ str          ┆ str             ┆ str  ┆ i64   ┆   ┆ f32        ┆ str    ┆ str   ┆ list[struct[2]]                   │
# ╞══════════════╪═════════════════╪══════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2     ┆ … ┆ 54.5       ┆ -      ┆ 0     ┆ [{"ID",["Ga0451106_01_2_238"]}, … │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 228   ┆ … ┆ 114.0      ┆ -      ┆ 0     ┆ [{"ID",["Ga0451106_01_228_941"]}… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 1097  ┆ … ┆ 224.399994 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_1097_2257"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2261  ┆ … ┆ 237.699997 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_2261_3787"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 3784  ┆ … ┆ 114.400002 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_3784_4548"… │
# └──────────────┴─────────────────┴──────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘

Ecosystem

BioBear aims to make it simple to move easily to and from different prominent data tools in Python. Generally, if the tool can read Arrow or Polars, it can read BioBear's output. To see examples of how to use BioBear with other tools, see:

Performance

Please see the exon's performance metrics for thorough benchmarks, but in short, biobear is generally faster than other Python libraries for reading bioinformatic file formats.

For example, here's quick benchmarks for reading one FASTA file with 1 million records and reading 5 FASTA files each with 1 million records for the local file system on an M1 MacBook Pro:

Library 1 file (s) 5 files (s)
BioBear 4.605 s ± 0.166 s 6.420 s ± 0.113 s
BioPython 6.654 s ± 0.003 s 34.254 s ± 0.053 s

The larger difference multiple files is due to biobear's ability to read multiple files in parallel.

biobear's People

Contributors

abearab avatar dependabot[bot] avatar ghuls avatar tshauck avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

biobear's Issues

Integer Encoding

Tasks

Bam file loading functionality

Loading and sending data from a bam file to a polars dataframe can take up to an hour for larger files. Often times, not all columns are necessary for downstream processing. Having the option to only load a select number of columns would not only reduce computation time, but also memory consumption.

Why is specifying the extension required when reading files?

Why is specifying the extension required when reading files?

When the wrong one is specified, an empty dataframe is returned.

In [41]: fasta = session.read_fasta_file("Ns.fa", options=bb.FASTAReadOptions()).to_polars()

In [42]: fasta
Out[42]: 
shape: (0, 3)
┌─────┬─────────────┬──────────┐
│ iddescriptionsequence │
│ ---------      │
│ strstrstr      │
╞═════╪═════════════╪══════════╡
└─────┴─────────────┴──────────┘

In [43]: fasta = session.read_fasta_file("Ns.fa", options=bb.FASTAReadOptions(file_extension="fa").to_polars()

In [44]: fasta
Out[44]: 
shape: (1, 3)
┌──────┬─────────────┬─────────────────────────────────┐
│ iddescriptionsequence                        │
│ ---------                             │
│ strstrstr                             │
╞══════╪═════════════╪═════════════════════════════════╡
│ chr1nullNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN… │
└──────┴─────────────┴─────────────────────────────────┘

In [47]: fasta = session.read_fasta_file("/home/luna.kuleuven.be/u0079808/Downloads/Ns.fa", options=bb.FASTAReadOptions(file_extension="fasta",)).to_polars()

In [48]: fasta
Out[48]: 
shape: (0, 3)
┌─────┬─────────────┬──────────┐
│ iddescriptionsequence │
│ ---------      │
│ strstrstr      │
╞═════╪═════════════╪══════════╡
└─────┴─────────────┴──────────┘

It would be nice if the extension and compression would be automatically be detected by checking the extension(s) if not explicitly set.

Also at the moment: fasta_sequence_data_type option can't be set as FastaSequenceDataType is not exposed.

Fastq files are not fully read

I am trying to read a fastq.gz file and convert into a polars data frame but no matter what fastq file I chose, the number of lines read are just small set that more or less scales up with the length of the read. For example, in files with shorter reads (28bp) I can open up to ~600 entries; ~250 some with longer (100) ones.

It seems to me as if there is a limit of bytes being read that caps the reader.

import biobear as bb
from biobear.compression import Compression

# an example with short reads
for file in files:
    print(bb.FastqReader(file, compression=Compression.GZIP).to_polars().shape)
(591, 4)
(591, 4)
(591, 4)
(589, 4)
(592, 4)
(591, 4)
(591, 4)
(591, 4)
(591, 4)
(591, 4)
(591, 4)
(591, 4)
(591, 4)
(591, 4)

Reduce Dependencies

With the recent aligner addition, new dependencies (rust-bio) can burden the user to install/compile biobear. Obviously, we'd like to avoid that if possible. See #106.

Loading in of large BAM files

Hello,

I have an issue opened that seems to be related to BioBear handling of large .bam files. The stackframe points to the use of polars-u64-idx as a potential solution.

Last line stack frame pyo3_runtime.PanicException: polars' maximum length reached. Consider installing 'polars-u64-idx'.: TryFromIntError(())

Issue opened by a user on my github: TRISTAN-ORF/RiboTIE#8

Link to code line using biobear: https://github.com/TRISTAN-ORF/transcript_transformer/blob/235fde98f0719ed86f244a0853d3b9c8f4e9edce/transcript_transformer/data.py#L132

Furthermore, today I had transcript-transformer hang also on the loading step of the .bam file for a medium sized file (4GB). Could be related.

Deprecation points to non-existent docs

DeprecationWarning: FastaReader is being deprecated, please use a table function via the session.
See https://www.wheretrue.dev/docs/exon/exondb/api-reference/table-functions for more info.

That link leads to a "Page Not Found" error

Generated from this code:

bb.FastaReader(seqfile).to_arrow()

(So not using exon)

How to run polars dataframe methods on large FASTQ files in a memory efficient way

As I mentioned in this comment, I'm interested in applying "groupby" function on polars dataframe. My goal is counting all unique sequences in a FASTQ file. I have a few questions:

  1. How can I take advantage of biobear to read large FASTQ files without loading the whole data into memory?
  2. How can I use multiprocessing to accelerate the computation time? e.g. https://docs.pola.rs/user-guide/misc/multiprocessing/

Thanks for you time.


related #89 and ArcInstitute/ScreenPro2#28

`FastaReader` returns empty pandas dataframe

I'm trying to load this FASTA data using FastaReader module but it seems there is something wrong! Any thoughts?!

image


Python implementation: CPython
Python version       : 3.9.18
IPython version      : 8.18.1

Compiler    : GCC 11.2.0
OS          : Linux
Release     : 5.10.0-28-cloud-amd64
Machine     : x86_64
Processor   : 
CPU cores   : 16
Architecture: 64bit

matplotlib: 3.6.3
screenpro : 0.2.6
anndata   : 0.10.5.post1
biobear   : 0.14.6
sys       : 3.9.18 (main, Sep 11 2023, 13:41:44) 
[GCC 11.2.0]
pandas    : 1.5.3
polars    : 0.20.10

Why different handling between GFF and mzml/genbank in polars.

All have maps, but GFF is converted ok into a polars df, but not the latter to.

name column in BED file is limited to 255 bytes.

In [80]: %time  b = session.read_bed_file("name_256bytes.one.bed")
CPU times: user 87 µs, sys: 13 µs, total: 100 µs
Wall time: 103 µs

In [81]: %time c = b.to_polars()
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
File <timed exec>:1

File ~/software/anaconda3/envs/biobear/lib/python3.11/site-packages/pyarrow/table.pxi:4755, in pyarrow.lib.Table.from_batches()

File ~/software/anaconda3/envs/biobear/lib/python3.11/site-packages/pyarrow/ipc.pxi:666, in pyarrow.lib.RecordBatchReader.__next__()

File ~/software/anaconda3/envs/biobear/lib/python3.11/site-packages/pyarrow/ipc.pxi:700, in pyarrow.lib.RecordBatchReader.read_next_batch()

File ~/software/anaconda3/envs/biobear/lib/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: C Data interface error: External error: Arrow error: External error: Io error: invalid record: invalid name

BED file with name column of more than 255 bytes, gives this error:

$ cat name_256bytes.one.bed
chr1	1000160	1000660	PURK_peak_11,INH_SST_peak_18b,INH_SST_peak_18c,GC_peak_40a,GC_peak_40b,GC_peak_40c,OPC_peak_30b,OPC_peak_30c,MOL_peak_32c,INH_VIP_peak_18a,NFOL_peak_32e,NFOL_peak_32f,AST_CER_peak_48e,AST_CER_peak_48f,ENDO_peak_7,AST_peak_17c,GP_peak_20,ASTP_peak_19e,INH_V	500	.

How to load pairs of FASTQ files from paired-end reads

As you know, it's very common to have paired-end reads mainly in Illumina NGS technologies. I was wondering to know your recommendation for having a single session for processing paired-end FASTQ files. Looking forward to hear your thoughts.

Infer extension and compression from file path.

Following up on #152, this is to fill out the file types for which it's possible to infer the extension and compression from the file path. As of the creation of this ticket fastq and fasta are supported.

  • fasta
  • fastq

Relatively important given more diverse extensions used in practice.

  • gff (#155)
  • bed (#156)
  • hmm dom tab
  • mzml

Other

  • bam
  • bcf
  • bigwig
  • cram
  • fcs
  • genbank
  • sam
  • vcf

Random subsampling of reads from BAM files

I'm trying something like this https://www.biostars.org/p/76791/

How can I do it using biobear?

I started from this

session = bb.connect()

subsampling_seed=42
number_of_reads=50000

bam = session.sql(f"""
    SELECT *
    FROM bam_scan('{BAM}') bam
""")

Then I could also do this which is not exactly random selection

bam = session.sql(f"""
    SELECT *
    FROM bam_scan('{BAM}') bam
    LIMIT {number_of_reads}
""")

Finally I'm trying this which is not exactly what I should do:
image


@tshauck – what do you think? Thanks for your helps!

Support reading BED files with less than 12 column.

It would be nice if BED files with less than 12 columns could be read.

For example if in BEDReadOptions, you can specify how many of the BED columns follow the spec.
Additional columns could be read as String columns.

Similarily to UCSC bigBed: BED3 or -type=bedN[+[P]], where N is an integer between 3 and 12 and the optional +[P] parameter specifies the number of extra fields, not required, but preferred
http://genome.ucsc.edu/goldenPath/help/bigBed.html

Have Missing Return Empty no `ValueError`

For w/e reason pl.DataFrame.from_arrow doesn't seem to recognize the schema on pa.Table... at least when constructed inside of rust :/

In [7]: df.to_polars()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-69a42c23f8e0> in <cell line: 0>()
----> 1 df.to_polars()

~/miniconda3/envs/biobear/lib/python3.11/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_batches()

ValueError: Must pass schema, or at least one RecordBatch

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.