Giter VIP home page Giter VIP logo

bamread's Issues

Cannot install, no module named Cython

Hello,

I tried to install bamread via pip 20.3.3 and Python 3.7.4 but got the following error:

Collecting bamread==0.0.5
  Using cached bamread-0.0.5.tar.gz (109 kB)
    ERROR: Command errored out with exit status 1:
     command: /Users/paul/gitclones/chipseq-visualization/venv/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/0b/xvv26_mx5fv8sxp90csjrcm00000gp/T/pip-install-u8klatml/bamread_37c49647db44430d9b00730baf40e458/setup.py'"'"'; __file__='"'"'/private/var/folders/0b/xvv26_mx5fv8sxp90csjrcm00000gp/T/pip-install-u8klatml/bamread_37c49647db44430d9b00730baf40e458/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/0b/xvv26_mx5fv8sxp90csjrcm00000gp/T/pip-pip-egg-info-0ilv3x53
         cwd: /private/var/folders/0b/xvv26_mx5fv8sxp90csjrcm00000gp/T/pip-install-u8klatml/bamread_37c49647db44430d9b00730baf40e458/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/0b/xvv26_mx5fv8sxp90csjrcm00000gp/T/pip-install-u8klatml/bamread_37c49647db44430d9b00730baf40e458/setup.py", line 7, in <module>
        from Cython.Build import cythonize
    ModuleNotFoundError: No module named 'Cython'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Seems like this might be a fix: https://luminousmen.com/post/resolve-cython-and-numpy-dependencies . It looks like bamread's setup.py requires importing Cython to execute, but cython itself is a dependency.

Best,

Paul

suggested API

@endrebak asked for suggestions on a possible API to make this a more general purpose read-my-bam tool.

My thought is a read_bam function that returns a pandas.DataFrame (similar to what is already here) but that

  1. Reads all alignments and all fields by default (including unmapped reads)
  2. Supports subselecting the fields (columns) being read for efficiency using a parameter, say, fields. For example fields=["Chromosome", "Start", "End", "Strand"] would only read in the specified columns and return a DataFrame with only those columns. Similar to usecols in pandas.read_csv.
  3. Supports subselecting the alignments (rows) being read to specified regions (and uses the BAM index for doing this). E.g. regions=[("chr1", 100, 10000)] would subselect to chr1:100-10000.
  4. Supports subselecting the alignments (rows) being read according to the BAM record flags. I think adding particular parameters for each of these would be the most user friendly. E.g. only_mapped=True would be the equivalent of passing -F 4 to samtools. I think really helpful to use named parameters here rather than making the user do bit arithmetic with binary flag codes. Basically implement this as named arguments.
  5. Has a max_alignments argument so the user can read just the first 10 records by passing max_alignments=10

I think one function that implements this would handle the majority of my use cases for reading BAMs in Python, and provide a much simpler API to get started with and use than pysam

compatibility with polars

Given the size of the dataframes, polars is generally faster to perform data transformations. Are there any plans to include the option to send the dataframe to either a polars or pandas framework?

Issue with filter_flag while using pyranges.read_bam

Hi,

Thanks for this package, it's been super useful to my needs. Also hoping everyone is safe during these hard times.

I want to use pyranges to read in a bam file and do some downstream intersect calculations with a bed file. I wish to read in the entire bam file, however I note the filter_flag default is 1540, which excludes unmapped reads. I wish to include the unmapped read count stats in my calculations, so I tried filter_flag = 1536 (so I exclude flag = 4, which refers to unmapped reads, per the popular Broad Institute tool) - but that doesn't work. Below is a code snippet which works when I exclude unmapped reads (flag = 1540, the default) but doesn't when I try to include them (flag = 1536).

>>> import pyranges
>>> import bamread
>>> bam_file = pyranges.read_bam(bamfile, filter_flag=1540)
>>> bam_file.df.shape
(19979065, 5)
>>> bam_file = pyranges.read_bam(bamfile, filter_flag=1536)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sradhakrishnan/venv/lib/python3.6/site-packages/pyranges/readers.py", line 190, in read_bam
    df = bamread.read_bam(f, mapq, required_flag, filter_flag)
  File "/home/sradhakrishnan/venv/lib/python3.6/site-packages/bamread/read.py", line 9, in read_bam
    f, mapq, required_flag, filter_flag)
  File "bamread/src/bamread.pyx", line 17, in bamread.src.bamread._bamread
    cpdef _bamread(filename, uint32_t mapq=0, uint64_t required_flag=0, uint64_t filter_flag=1540):
  File "bamread/src/bamread.pyx", line 69, in bamread.src.bamread._bamread
    end = a.reference_end
TypeError: an integer is required

Looking at the .pyx file in the bamread repo, it looks in lines 21 and 22 that the start and end variables need to be integers - which is likely violated for unmapped reads that don't have coordinates?

Are there any workarounds in my case where I can include unmapped reads in my calculations?

If this is not possible, my fallback approach is to read in the bam file directly using pysam.AlignmentFile (and skip bamread for the unmapped read inclusion, while still using it to calculate intersection stats because that's really nice) but that's a little messy. Any help is highly appreciated!

Stay safe and thanks,
Srihari

loading bam with many contigs (~2^16) fails

Contigs ids are represented as an int16, which fails for bams aligned to a large number of contigs (e.g. transcriptome alignments). This looks like an easy fix and I can submit a PR.

~/anaconda3/lib/python3.8/site-packages/bamread/read.py in read_bam_full(f, mapq, required_flag, filter_flag)
     26 def read_bam_full(f, mapq=0, required_flag=0, filter_flag=1540):
     27 
---> 28     chromosomes, starts, ends, strands, flags, chrmap, qstarts, qends, query_names, query_sequences, cigarstrings, query_qualities = _bamread_all(
     29         f, mapq, required_flag, filter_flag)
     30 

~/anaconda3/lib/python3.8/site-packages/bamread/src/bamread.pyx in bamread.src.bamread._bamread_all()

~/anaconda3/lib/python3.8/site-packages/bamread/src/bamread.pyx in bamread.src.bamread._bamread_all()

OverflowError: value too large to convert to int16_t

feature request: include sequences

Thanks for a great library. It would be very helpful as a general purpose tool if bamread.read_bam_full had the ability to include the query sequences.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.