Comments (7)
Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example.
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
...: two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
...:
...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
...: FROM one
...: JOIN two
...: ON one.trimmed_name = two.trimmed_name
...:
...: """).to_polars()
Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing /1
or /2
), then the body joins the two files together on that new name and returns a bit of info about both reads. You could add quality score or description if you want those.
WITH one AS (
SELECT REPLACE(name, '/1', '') trimmed_name, *
FROM fastq_scan('paired.1.fastq')
), two AS (
SELECT REPLACE(name, '/2', '') trimmed_name, *
FROM fastq_scan('paired.2.fastq')
)
SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
JOIN two
ON one.trimmed_name = two.trimmed_name
The results of which looks like:
shape: (4, 5)
┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐
│ trimmed_name ┆ one_name ┆ one_sequence ┆ two_name ┆ two_sequence │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str │
╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ read5 ┆ read5 ┆ TTATTTGTCTCCAGC ┆ read5 ┆ CAACAGGCCACATTAGACATATCGGATGGT │
│ read1 ┆ read1/1 ┆ TTATTTGTCTCCAGC ┆ read1/2 ┆ GCTGGAGACAAATAA │
│ read4 ┆ read4/1 ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2 ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │
│ read3 ┆ read3/1 ┆ CCAACTTGATATTAATAACA ┆ read3/2 ┆ TGTTATTAATATCAAGTTGG │
└──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘
I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.
import biobear as bb
s = bb.connect()
s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")
s.sql("SELECT * FROM test").to_polars()
┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐
│ name ┆ description ┆ sequence ┆ quality_scores │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡
│ read1/1 ┆ some text ┆ TTATTTGTCTCCAGC ┆ ##HHHHHHHHHHHHH │
│ read3/1 ┆ null ┆ CCAACTTGATATTAATAACA ┆ HHHHHHHHHHHHHHHHHHHH │
│ read4/1 ┆ null ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │
│ read5 ┆ null ┆ TTATTTGTCTCCAGC ┆ #####HHHHHHHHHH │
└─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘
from biobear.
Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.
from biobear.
@abearab Nice, congrats on the release! And glad that worked well enough.
I am still working on the adapter/quality trimming info. I'm trying to add some functionality as I go to make it easier to use, e.g. SELECT quality_scores_to_list('!!!')
returning [0, 0, 0
]. (https://github.com/wheretrue/exon/pull/427/files)
from biobear.
There are several common computational task with paired-end reads, here are some examples:
1. trimming task
2. merge-paired-end-sequences
- https://github.com/dstreett/FLASH2
- https://cme.h-its.org/exelixis/web/software/pear/
- https://micca.readthedocs.io/en/1.6.2/pairedend.html#merge-paired-end-sequences
It might be a separate conversation but it can be a good idea if your tool can have guidance for these tasks. Based on what I have learned about your tool so far, I assume these task can be very efficient with the SQL core in your tool and it can be combined with other tasks at the same time.
from biobear.
Hi @tshauck – thank you so much for your response! I'll look more carefully and will get back to you.
Actually, my main goal for loading both R1/R2 reads at the same time is doing this:
- trimming each read from certain position
- counting unique sequences based on both reads
Basically, I'm going to expand this function to work with more complex inputs: https://github.com/ArcInstitute/ScreenPro2/blob/master/screenpro/ngs.py#L26-L50
Do you have any suggestion? Thanks agin for your insightful responses.
from biobear.
Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.
Awesome! Looking forward to that.
I'm also excited to hear your thoughts about this #102 (comment) but there is no rush; whenever you have time please let me know what you think. I already released a new sub-bersion for my tool which is now using biobear
– ArcInstitute/ScreenPro2@85445c6. I'll add paired-end read support in the future (hopefully next few weeks).
from biobear.
I could follow your example here and it worked well. Thanks!!
ArcInstitute/ScreenPro2@fe9a27d
Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example.
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq >>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq >>> >>> import biobear as bb >>> session = bb.connect() >>> >>> session.sql(""" ...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')), ...: two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq')) ...: ...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence ...: FROM one ...: JOIN two ...: ON one.trimmed_name = two.trimmed_name ...: ...: """).to_polars()Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing
/1
or/2
), then the body joins the two files together on that new name and returns a bit of info about both reads. You could add quality score or description if you want those.WITH one AS ( SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq') ), two AS ( SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq') ) SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence FROM one JOIN two ON one.trimmed_name = two.trimmed_nameThe results of which looks like:
shape: (4, 5) ┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐ │ trimmed_name ┆ one_name ┆ one_sequence ┆ two_name ┆ two_sequence │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str ┆ str │ ╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡ │ read5 ┆ read5 ┆ TTATTTGTCTCCAGC ┆ read5 ┆ CAACAGGCCACATTAGACATATCGGATGGT │ │ read1 ┆ read1/1 ┆ TTATTTGTCTCCAGC ┆ read1/2 ┆ GCTGGAGACAAATAA │ │ read4 ┆ read4/1 ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2 ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │ │ read3 ┆ read3/1 ┆ CCAACTTGATATTAATAACA ┆ read3/2 ┆ TGTTATTAATATCAAGTTGG │ └──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘
I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.
import biobear as bb s = bb.connect() s.execute(""" CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq' """) s.sql("SELECT * FROM test").to_polars()┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐ │ name ┆ description ┆ sequence ┆ quality_scores │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str │ ╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡ │ read1/1 ┆ some text ┆ TTATTTGTCTCCAGC ┆ ##HHHHHHHHHHHHH │ │ read3/1 ┆ null ┆ CCAACTTGATATTAATAACA ┆ HHHHHHHHHHHHHHHHHHHH │ │ read4/1 ┆ null ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │ │ read5 ┆ null ┆ TTATTTGTCTCCAGC ┆ #####HHHHHHHHHH │ └─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘
from biobear.
Related Issues (20)
- How to handle read options (e.g. like `ReadFastaOptions`)
- Update https://www.wheretrue.dev/ w/ Trimming Example
- alternative for `seqkit locate` that can locate subsequences/motifs HOT 12
- Reduce Dependencies HOT 3
- Random subsampling of reads from BAM files HOT 5
- merge paired-end sequences HOT 5
- Loading in of large BAM files HOT 5
- Integer Encoding
- Support reading BED files with less than 12 column. HOT 4
- name column in BED file is limited to 255 bytes. HOT 6
- Update user docs for new BED options HOT 1
- Investigate granges rust crate HOT 4
- Why is specifying the extension required when reading files? HOT 2
- Infer extension and compression from file path.
- Fastq files are not fully read HOT 10
- Why different handling between GFF and mzml/genbank in polars. HOT 4
- `FastaReader` returns empty pandas dataframe HOT 8
- Releases builds failing. HOT 1
- How to run polars dataframe methods on large FASTQ files in a memory efficient way HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from biobear.