pyranges / bamread Goto Github PK

View Code? Open in Web Editor NEW

12.0 3.0 5.0 121 KB

Bam to Pandas DataFrame, quickly

Python 36.42% Cython 63.58%

bamread's Issues

Strand information from flag

Hi,

bamread/bamread/read.py

Line 14 in 4b3d15b

strands = pd.Series(strands).replace({16: "+", 0: "-"}).astype("category")

This line should be like that, isn't it?

strands = pd.Series(strands).replace({0: "+", 16: "-"}).astype("category")

flag 16 means that SEQ being reverse complemented so negative-strand?

https://en.wikipedia.org/wiki/SAM_(file_format)

Cannot install, no module named Cython

Hello,

I tried to install bamread via pip 20.3.3 and Python 3.7.4 but got the following error:

Collecting bamread==0.0.5
  Using cached bamread-0.0.5.tar.gz (109 kB)
    ERROR: Command errored out with exit status 1:
     command: /Users/paul/gitclones/chipseq-visualization/venv/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/0b/xvv26_mx5fv8sxp90csjrcm00000gp/T/pip-install-u8klatml/bamread_37c49647db44430d9b00730baf40e458/setup.py'"'"'; __file__='"'"'/private/var/folders/0b/xvv26_mx5fv8sxp90csjrcm00000gp/T/pip-install-u8klatml/bamread_37c49647db44430d9b00730baf40e458/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/0b/xvv26_mx5fv8sxp90csjrcm00000gp/T/pip-pip-egg-info-0ilv3x53
         cwd: /private/var/folders/0b/xvv26_mx5fv8sxp90csjrcm00000gp/T/pip-install-u8klatml/bamread_37c49647db44430d9b00730baf40e458/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/0b/xvv26_mx5fv8sxp90csjrcm00000gp/T/pip-install-u8klatml/bamread_37c49647db44430d9b00730baf40e458/setup.py", line 7, in <module>
        from Cython.Build import cythonize
    ModuleNotFoundError: No module named 'Cython'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Seems like this might be a fix: https://luminousmen.com/post/resolve-cython-and-numpy-dependencies . It looks like bamread's setup.py requires importing Cython to execute, but cython itself is a dependency.

Best,

Paul

suggested API

@endrebak asked for suggestions on a possible API to make this a more general purpose read-my-bam tool.

My thought is a read_bam function that returns a pandas.DataFrame (similar to what is already here) but that

Reads all alignments and all fields by default (including unmapped reads)
Supports subselecting the fields (columns) being read for efficiency using a parameter, say, fields. For example fields=["Chromosome", "Start", "End", "Strand"] would only read in the specified columns and return a DataFrame with only those columns. Similar to usecols in pandas.read_csv.
Supports subselecting the alignments (rows) being read to specified regions (and uses the BAM index for doing this). E.g. regions=[("chr1", 100, 10000)] would subselect to chr1:100-10000.
Supports subselecting the alignments (rows) being read according to the BAM record flags. I think adding particular parameters for each of these would be the most user friendly. E.g. only_mapped=True would be the equivalent of passing -F 4 to samtools. I think really helpful to use named parameters here rather than making the user do bit arithmetic with binary flag codes. Basically implement this as named arguments.
Has a max_alignments argument so the user can read just the first 10 records by passing max_alignments=10

I think one function that implements this would handle the majority of my use cases for reading BAMs in Python, and provide a much simpler API to get started with and use than pysam

compatibility with polars

Given the size of the dataframes, polars is generally faster to perform data transformations. Are there any plans to include the option to send the dataframe to either a polars or pandas framework?

Issue with filter_flag while using pyranges.read_bam

Hi,

Thanks for this package, it's been super useful to my needs. Also hoping everyone is safe during these hard times.

I want to use pyranges to read in a bam file and do some downstream intersect calculations with a bed file. I wish to read in the entire bam file, however I note the filter_flag default is 1540, which excludes unmapped reads. I wish to include the unmapped read count stats in my calculations, so I tried filter_flag = 1536 (so I exclude flag = 4, which refers to unmapped reads, per the popular Broad Institute tool) - but that doesn't work. Below is a code snippet which works when I exclude unmapped reads (flag = 1540, the default) but doesn't when I try to include them (flag = 1536).

>>> import pyranges
>>> import bamread
>>> bam_file = pyranges.read_bam(bamfile, filter_flag=1540)
>>> bam_file.df.shape
(19979065, 5)
>>> bam_file = pyranges.read_bam(bamfile, filter_flag=1536)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sradhakrishnan/venv/lib/python3.6/site-packages/pyranges/readers.py", line 190, in read_bam
    df = bamread.read_bam(f, mapq, required_flag, filter_flag)
  File "/home/sradhakrishnan/venv/lib/python3.6/site-packages/bamread/read.py", line 9, in read_bam
    f, mapq, required_flag, filter_flag)
  File "bamread/src/bamread.pyx", line 17, in bamread.src.bamread._bamread
    cpdef _bamread(filename, uint32_t mapq=0, uint64_t required_flag=0, uint64_t filter_flag=1540):
  File "bamread/src/bamread.pyx", line 69, in bamread.src.bamread._bamread
    end = a.reference_end
TypeError: an integer is required

Looking at the .pyx file in the bamread repo, it looks in lines 21 and 22 that the start and end variables need to be integers - which is likely violated for unmapped reads that don't have coordinates?

Are there any workarounds in my case where I can include unmapped reads in my calculations?

If this is not possible, my fallback approach is to read in the bam file directly using pysam.AlignmentFile (and skip bamread for the unmapped read inclusion, while still using it to calculate intersection stats because that's really nice) but that's a little messy. Any help is highly appreciated!

Stay safe and thanks,
Srihari

loading bam with many contigs (~2^16) fails

Contigs ids are represented as an int16, which fails for bams aligned to a large number of contigs (e.g. transcriptome alignments). This looks like an easy fix and I can submit a PR.

~/anaconda3/lib/python3.8/site-packages/bamread/read.py in read_bam_full(f, mapq, required_flag, filter_flag)
     26 def read_bam_full(f, mapq=0, required_flag=0, filter_flag=1540):
     27 
---> 28     chromosomes, starts, ends, strands, flags, chrmap, qstarts, qends, query_names, query_sequences, cigarstrings, query_qualities = _bamread_all(
     29         f, mapq, required_flag, filter_flag)
     30 

~/anaconda3/lib/python3.8/site-packages/bamread/src/bamread.pyx in bamread.src.bamread._bamread_all()

~/anaconda3/lib/python3.8/site-packages/bamread/src/bamread.pyx in bamread.src.bamread._bamread_all()

OverflowError: value too large to convert to int16_t

feature request: include sequences

Thanks for a great library. It would be very helpful as a general purpose tool if bamread.read_bam_full had the ability to include the query sequences.

pyranges / bamread Goto Github PK

bamread's Issues

Strand information from flag

Cannot install, no module named Cython

suggested API

compatibility with polars

Issue with filter_flag while using pyranges.read_bam

loading bam with many contigs (~2^16) fails

feature request: include sequences

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent