As you know, it's very common to have paired-end reads mainly in Illumina NGS technolo

Thanks for the link 👍 , I'm gonna work through the trimming doc (<a class="issue-link

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

How to load pairs of FASTQ files from paired-end reads about biobear HOT 7 CLOSED

wheretrue commented on July 17, 2024

How to load pairs of FASTQ files from paired-end reads

from biobear.

Comments (7)

tshauck commented on July 17, 2024 1

Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example.

>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
   ...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
   ...:      two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
   ...:      
   ...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
   ...: FROM one
   ...:   JOIN two
   ...:     ON one.trimmed_name = two.trimmed_name
   ...: 
   ...: """).to_polars()

Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing /1 or /2), then the body joins the two files together on that new name and returns a bit of info about both reads. You could add quality score or description if you want those.

WITH one AS (
    SELECT REPLACE(name, '/1', '') trimmed_name, *
    FROM fastq_scan('paired.1.fastq')
), two AS (
    SELECT REPLACE(name, '/2', '') trimmed_name, *
    FROM fastq_scan('paired.2.fastq')
)

SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
  JOIN two
    ON one.trimmed_name = two.trimmed_name

The results of which looks like:

shape: (4, 5)
┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐
│ trimmed_name ┆ one_name ┆ one_sequence                   ┆ two_name ┆ two_sequence                   │
│ ---          ┆ ---      ┆ ---                            ┆ ---      ┆ ---                            │
│ str          ┆ str      ┆ str                            ┆ str      ┆ str                            │
╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ read5        ┆ read5    ┆ TTATTTGTCTCCAGC                ┆ read5    ┆ CAACAGGCCACATTAGACATATCGGATGGT │
│ read1        ┆ read1/1  ┆ TTATTTGTCTCCAGC                ┆ read1/2  ┆ GCTGGAGACAAATAA                │
│ read4        ┆ read4/1  ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2  ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │
│ read3        ┆ read3/1  ┆ CCAACTTGATATTAATAACA           ┆ read3/2  ┆ TGTTATTAATATCAAGTTGG           │
└──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘

I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.

import biobear as bb

s = bb.connect()

s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")

s.sql("SELECT * FROM test").to_polars()

┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐
│ name    ┆ description ┆ sequence                       ┆ quality_scores                 │
│ ---     ┆ ---         ┆ ---                            ┆ ---                            │
│ str     ┆ str         ┆ str                            ┆ str                            │
╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡
│ read1/1 ┆ some text   ┆ TTATTTGTCTCCAGC                ┆ ##HHHHHHHHHHHHH                │
│ read3/1 ┆ null        ┆ CCAACTTGATATTAATAACA           ┆ HHHHHHHHHHHHHHHHHHHH           │
│ read4/1 ┆ null        ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │
│ read5   ┆ null        ┆ TTATTTGTCTCCAGC                ┆ #####HHHHHHHHHH                │
└─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘

from biobear.

tshauck commented on July 17, 2024 1

Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.

from biobear.

tshauck commented on July 17, 2024 1

@abearab Nice, congrats on the release! And glad that worked well enough.

I am still working on the adapter/quality trimming info. I'm trying to add some functionality as I go to make it easier to use, e.g. SELECT quality_scores_to_list('!!!') returning [0, 0, 0]. (https://github.com/wheretrue/exon/pull/427/files)

from biobear.

abearab commented on July 17, 2024

There are several common computational task with paired-end reads, here are some examples:

1. trimming task

https://cutadapt.readthedocs.io/en/stable/guide.html#paired-end

2. merge-paired-end-sequences

It might be a separate conversation but it can be a good idea if your tool can have guidance for these tasks. Based on what I have learned about your tool so far, I assume these task can be very efficient with the SQL core in your tool and it can be combined with other tasks at the same time.

from biobear.

abearab commented on July 17, 2024

Hi @tshauck – thank you so much for your response! I'll look more carefully and will get back to you.

Actually, my main goal for loading both R1/R2 reads at the same time is doing this:

trimming each read from certain position
counting unique sequences based on both reads

Basically, I'm going to expand this function to work with more complex inputs: https://github.com/ArcInstitute/ScreenPro2/blob/master/screenpro/ngs.py#L26-L50

Do you have any suggestion? Thanks agin for your insightful responses.

from biobear.

abearab commented on July 17, 2024

Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.

Awesome! Looking forward to that.

I'm also excited to hear your thoughts about this #102 (comment) but there is no rush; whenever you have time please let me know what you think. I already released a new sub-bersion for my tool which is now using biobear – ArcInstitute/ScreenPro2@85445c6. I'll add paired-end read support in the future (hopefully next few weeks).

from biobear.

abearab commented on July 17, 2024

I could follow your example here and it worked well. Thanks!!

ArcInstitute/ScreenPro2@fe9a27d

>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
   ...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
   ...:      two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
   ...:      
   ...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
   ...: FROM one
   ...:   JOIN two
   ...:     ON one.trimmed_name = two.trimmed_name
   ...: 
   ...: """).to_polars()

WITH one AS (
    SELECT REPLACE(name, '/1', '') trimmed_name, *
    FROM fastq_scan('paired.1.fastq')
), two AS (
    SELECT REPLACE(name, '/2', '') trimmed_name, *
    FROM fastq_scan('paired.2.fastq')
)

SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
  JOIN two
    ON one.trimmed_name = two.trimmed_name

The results of which looks like:

shape: (4, 5)
┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐
│ trimmed_name ┆ one_name ┆ one_sequence                   ┆ two_name ┆ two_sequence                   │
│ ---          ┆ ---      ┆ ---                            ┆ ---      ┆ ---                            │
│ str          ┆ str      ┆ str                            ┆ str      ┆ str                            │
╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ read5        ┆ read5    ┆ TTATTTGTCTCCAGC                ┆ read5    ┆ CAACAGGCCACATTAGACATATCGGATGGT │
│ read1        ┆ read1/1  ┆ TTATTTGTCTCCAGC                ┆ read1/2  ┆ GCTGGAGACAAATAA                │
│ read4        ┆ read4/1  ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2  ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │
│ read3        ┆ read3/1  ┆ CCAACTTGATATTAATAACA           ┆ read3/2  ┆ TGTTATTAATATCAAGTTGG           │
└──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘

I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.

import biobear as bb

s = bb.connect()

s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")

s.sql("SELECT * FROM test").to_polars()

┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐
│ name    ┆ description ┆ sequence                       ┆ quality_scores                 │
│ ---     ┆ ---         ┆ ---                            ┆ ---                            │
│ str     ┆ str         ┆ str                            ┆ str                            │
╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡
│ read1/1 ┆ some text   ┆ TTATTTGTCTCCAGC                ┆ ##HHHHHHHHHHHHH                │
│ read3/1 ┆ null        ┆ CCAACTTGATATTAATAACA           ┆ HHHHHHHHHHHHHHHHHHHH           │
│ read4/1 ┆ null        ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │
│ read5   ┆ null        ┆ TTATTTGTCTCCAGC                ┆ #####HHHHHHHHHH                │
└─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘

from biobear.

How to load pairs of FASTQ files from paired-end reads about biobear HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent