Giter VIP home page Giter VIP logo

Comments (7)

tshauck avatar tshauck commented on July 17, 2024 1

Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example.

>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
   ...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
   ...:      two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
   ...:      
   ...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
   ...: FROM one
   ...:   JOIN two
   ...:     ON one.trimmed_name = two.trimmed_name
   ...: 
   ...: """).to_polars()

Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing /1 or /2), then the body joins the two files together on that new name and returns a bit of info about both reads. You could add quality score or description if you want those.

WITH one AS (
    SELECT REPLACE(name, '/1', '') trimmed_name, *
    FROM fastq_scan('paired.1.fastq')
), two AS (
    SELECT REPLACE(name, '/2', '') trimmed_name, *
    FROM fastq_scan('paired.2.fastq')
)

SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
  JOIN two
    ON one.trimmed_name = two.trimmed_name

The results of which looks like:

shape: (4, 5)
┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐
│ trimmed_name ┆ one_name ┆ one_sequence                   ┆ two_name ┆ two_sequence                   │
│ ---          ┆ ---      ┆ ---                            ┆ ---      ┆ ---                            │
│ str          ┆ str      ┆ str                            ┆ str      ┆ str                            │
╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ read5        ┆ read5    ┆ TTATTTGTCTCCAGC                ┆ read5    ┆ CAACAGGCCACATTAGACATATCGGATGGT │
│ read1        ┆ read1/1  ┆ TTATTTGTCTCCAGC                ┆ read1/2  ┆ GCTGGAGACAAATAA                │
│ read4        ┆ read4/1  ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2  ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │
│ read3        ┆ read3/1  ┆ CCAACTTGATATTAATAACA           ┆ read3/2  ┆ TGTTATTAATATCAAGTTGG           │
└──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘

I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.

import biobear as bb

s = bb.connect()

s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")

s.sql("SELECT * FROM test").to_polars()
┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐
│ name    ┆ description ┆ sequence                       ┆ quality_scores                 │
│ ---     ┆ ---         ┆ ---                            ┆ ---                            │
│ str     ┆ str         ┆ str                            ┆ str                            │
╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡
│ read1/1 ┆ some text   ┆ TTATTTGTCTCCAGC                ┆ ##HHHHHHHHHHHHH                │
│ read3/1 ┆ null        ┆ CCAACTTGATATTAATAACA           ┆ HHHHHHHHHHHHHHHHHHHH           │
│ read4/1 ┆ null        ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │
│ read5   ┆ null        ┆ TTATTTGTCTCCAGC                ┆ #####HHHHHHHHHH                │
└─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘

from biobear.

tshauck avatar tshauck commented on July 17, 2024 1

Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.

from biobear.

tshauck avatar tshauck commented on July 17, 2024 1

@abearab Nice, congrats on the release! And glad that worked well enough.

I am still working on the adapter/quality trimming info. I'm trying to add some functionality as I go to make it easier to use, e.g. SELECT quality_scores_to_list('!!!') returning [0, 0, 0]. (https://github.com/wheretrue/exon/pull/427/files)

from biobear.

abearab avatar abearab commented on July 17, 2024

There are several common computational task with paired-end reads, here are some examples:

1. trimming task

2. merge-paired-end-sequences

It might be a separate conversation but it can be a good idea if your tool can have guidance for these tasks. Based on what I have learned about your tool so far, I assume these task can be very efficient with the SQL core in your tool and it can be combined with other tasks at the same time.

from biobear.

abearab avatar abearab commented on July 17, 2024

Hi @tshauck – thank you so much for your response! I'll look more carefully and will get back to you.

Actually, my main goal for loading both R1/R2 reads at the same time is doing this:

  1. trimming each read from certain position
  2. counting unique sequences based on both reads

Basically, I'm going to expand this function to work with more complex inputs: https://github.com/ArcInstitute/ScreenPro2/blob/master/screenpro/ngs.py#L26-L50

Do you have any suggestion? Thanks agin for your insightful responses.

from biobear.

abearab avatar abearab commented on July 17, 2024

Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.

Awesome! Looking forward to that.

I'm also excited to hear your thoughts about this #102 (comment) but there is no rush; whenever you have time please let me know what you think. I already released a new sub-bersion for my tool which is now using biobearArcInstitute/ScreenPro2@85445c6. I'll add paired-end read support in the future (hopefully next few weeks).

from biobear.

abearab avatar abearab commented on July 17, 2024

I could follow your example here and it worked well. Thanks!!

ArcInstitute/ScreenPro2@fe9a27d

Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example.

>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
   ...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
   ...:      two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
   ...:      
   ...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
   ...: FROM one
   ...:   JOIN two
   ...:     ON one.trimmed_name = two.trimmed_name
   ...: 
   ...: """).to_polars()

Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing /1 or /2), then the body joins the two files together on that new name and returns a bit of info about both reads. You could add quality score or description if you want those.

WITH one AS (
    SELECT REPLACE(name, '/1', '') trimmed_name, *
    FROM fastq_scan('paired.1.fastq')
), two AS (
    SELECT REPLACE(name, '/2', '') trimmed_name, *
    FROM fastq_scan('paired.2.fastq')
)

SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
  JOIN two
    ON one.trimmed_name = two.trimmed_name

The results of which looks like:

shape: (4, 5)
┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐
│ trimmed_name ┆ one_name ┆ one_sequence                   ┆ two_name ┆ two_sequence                   │
│ ---          ┆ ---      ┆ ---                            ┆ ---      ┆ ---                            │
│ str          ┆ str      ┆ str                            ┆ str      ┆ str                            │
╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ read5        ┆ read5    ┆ TTATTTGTCTCCAGC                ┆ read5    ┆ CAACAGGCCACATTAGACATATCGGATGGT │
│ read1        ┆ read1/1  ┆ TTATTTGTCTCCAGC                ┆ read1/2  ┆ GCTGGAGACAAATAA                │
│ read4        ┆ read4/1  ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2  ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │
│ read3        ┆ read3/1  ┆ CCAACTTGATATTAATAACA           ┆ read3/2  ┆ TGTTATTAATATCAAGTTGG           │
└──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘

I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.

import biobear as bb

s = bb.connect()

s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")

s.sql("SELECT * FROM test").to_polars()
┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐
│ name    ┆ description ┆ sequence                       ┆ quality_scores                 │
│ ---     ┆ ---         ┆ ---                            ┆ ---                            │
│ str     ┆ str         ┆ str                            ┆ str                            │
╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡
│ read1/1 ┆ some text   ┆ TTATTTGTCTCCAGC                ┆ ##HHHHHHHHHHHHH                │
│ read3/1 ┆ null        ┆ CCAACTTGATATTAATAACA           ┆ HHHHHHHHHHHHHHHHHHHH           │
│ read4/1 ┆ null        ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │
│ read5   ┆ null        ┆ TTATTTGTCTCCAGC                ┆ #####HHHHHHHHHH                │
└─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘

from biobear.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.