Giter VIP home page Giter VIP logo

Comments (12)

tshauck avatar tshauck commented on July 17, 2024 2

@abearab I updated biobear to support locate_regex as shown above. You can see how it looks through biobear/polars here:

def test_read_fastq_with_qs_to_list():
"""Test quality scores to list."""
session = connect()
fastq_path = DATA / "test.fastq.gz"
df = session.sql(
f"""
SELECT quality_scores_to_list(quality_scores) quality_score_list, locate_regex(sequence, '[AC]AT') locate
FROM fastq_scan('{fastq_path}')
"""
).to_polars()
assert len(df) == 2
assert df.get_column("quality_score_list").to_list()[0][:5] == [0, 6, 6, 9, 7]
assert df.get_column("locate").to_list() == [
[
{"start": 29, "end": 32, "match": "AAT"},
{"start": 36, "end": 39, "match": "AAT"},
{"start": 40, "end": 43, "match": "CAT"},
],
[
{"start": 29, "end": 32, "match": "AAT"},
{"start": 36, "end": 39, "match": "AAT"},
{"start": 40, "end": 43, "match": "CAT"},
],
]
.

Please let me know if you have any feedback on how that works for your task -- thanks!

from biobear.

tshauck avatar tshauck commented on July 17, 2024 2

Cool, that all sounds good to me, no rush for me. Please just "at" me on this issue if/when you use this for your work if you have any thoughts on improvements and/or questions. Thanks!

from biobear.

tshauck avatar tshauck commented on July 17, 2024 1

It's a very nice package overall, and to your point would make for a good reference to build a "recipe" list from.

To your question, depending on the end goal, there's a somewhat crude way you could accomplish this in biobear, but it may not fit your needs well enough. This relates back to your comments trimming, in that I'm looking at performant alignment / string functions that I can add to biobear to better support these use cases.

For example, use the seqkit example agctggagctacc...

-- if you wanted to match match against an adapter at the start and allow for two mismatches
❯ SELECT levenshtein(substr('agctggagctacc', 1, 5), 'agcaa') dist;
+------+
| dist |
+------+
| 2    |
+------+

You could then add that to a WHERE clause.

Perhaps, I'll add a function that replicates locate, but returns a list of matches... e.g.

SELECT locate(sequence_column, pattern, options...) -> list of structs {seqid, pattern, strand, start, end, matched_pattern}

Then more or less have the returned table be aligned with seqkits example? Is that a useful function.

from biobear.

abearab avatar abearab commented on July 17, 2024 1

Thanks @tshauck for your response. I think that's interesting addition (it's interesting that levenshtein distance is already implemented; I didn't know that! I'll need to ask you another question about that but I'll leave this question clear for now).

For the locate question here, I had some specific thought which is described here – https://github.com/abearab/PosSpacer. I just made this a public repo so you can see my algorithm design notes. However, I gave up on that until I could replace my codes with biobear! The preprocessed NGS reads needs to be counted and the location of spacers in sequencing reads would be nice to be reported.

from biobear.

tshauck avatar tshauck commented on July 17, 2024 1

@abearab, cool, that makes sense to me... I just updated biobear on pypi with the new quality score function as well as an alignment_score function that does basic local alignment between two sequences (e.g. e69df6b). I'll explore more how to return alignment position(s) within a string and follow up when I have some more info.

from biobear.

tshauck avatar tshauck commented on July 17, 2024 1

Ah, ok, glad you got it. Looks like I have a little work to do dependency-wise.

from biobear.

abearab avatar abearab commented on July 17, 2024 1

Hey @abearab -- apologies for taking a bit to get onto this. The CRAM scanning feature is done, and I go started on the writing (though need some feedback from another developer).

For locate after the fixes to dependencies to afford faster installs recently, I should be able to get back to this tomorrow. What timeline are you working with? Obviously, I want to get something you could use, but don't want to lead you on if it's not feasible given your requirements.

Hi @tshauck, Thanks for looking into this. There is no time pressure on my end (i.e. I might also not test new features in biobear as the top priority). I'm currently working on CLI development ArcInstitute/ScreenPro2#35 but I would like to use locate / locate_regex function for topics like ArcInstitute/ScreenPro2#39.

from biobear.

abearab avatar abearab commented on July 17, 2024

@tshauck I have issue installing it

(this might be a separate issue, I just created a new VM and fresh OS, so ...)

  × Building wheel for biobear (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [168 lines of output]
      Running `maturin pep517 build-wheel -i /home/abearab/miniforge3/envs/screenpro2/bin/python --compatibility off`
      🍹 Building a mixed python/rust project
      🔗 Found pyo3 bindings
      🐍 Found CPython 3.9 at /home/abearab/miniforge3/envs/screenpro2/bin/python
      📡 Using build options features from pyproject.toml
         Compiling libc v0.2.153
       ...
      The following warnings were emitted during compilation:
      
      warning: [email protected]: Failed to run: "cc" "--version"
      
      error: failed to run custom build command for `ring v0.17.8`
      
      Caused by:
        process didn't exit successfully: `/tmp/pip-install-r9v2vcz0/biobear_b2d0e519e8c74f2988c3017387959e41/target/release/build/ring-646b006cd1874eb5/build-script-build` (exit status: 1)
        --- stdout
        cargo:rerun-if-env-changed=RING_PREGENERATE_ASM
        cargo:rustc-env=RING_CORE_PREFIX=ring_core_0_17_8_
        OPT_LEVEL = Some("3")
        TARGET = Some("x86_64-unknown-linux-gnu")
        HOST = Some("x86_64-unknown-linux-gnu")
        cargo:rerun-if-env-changed=CC_x86_64-unknown-linux-gnu
        CC_x86_64-unknown-linux-gnu = None
        cargo:rerun-if-env-changed=CC_x86_64_unknown_linux_gnu
        CC_x86_64_unknown_linux_gnu = None
        cargo:rerun-if-env-changed=HOST_CC
        HOST_CC = None
        cargo:rerun-if-env-changed=CC
        CC = None
        cargo:rerun-if-env-changed=CC_ENABLE_DEBUG_OUTPUT
        cargo:warning=Failed to run: "cc" "--version"
        cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
        CRATE_CC_NO_DEFAULTS = None
        DEBUG = Some("false")
        CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
        cargo:rerun-if-env-changed=CFLAGS_x86_64-unknown-linux-gnu
        CFLAGS_x86_64-unknown-linux-gnu = None
        cargo:rerun-if-env-changed=CFLAGS_x86_64_unknown_linux_gnu
        CFLAGS_x86_64_unknown_linux_gnu = None
        cargo:rerun-if-env-changed=HOST_CFLAGS
        HOST_CFLAGS = None
        cargo:rerun-if-env-changed=CFLAGS
        CFLAGS = None
      
        --- stderr
      
      
        error occurred: Failed to find tool. Is `cc` installed?
      
      
      warning: build failed, waiting for other jobs to finish...
      💥 maturin failed
        Caused by: Failed to build a native library through cargo
        Caused by: Cargo build finished with "exit status: 101": `env -u CARGO PYO3_ENVIRONMENT_SIGNATURE="cpython-3.9-64bit" PYO3_PYTHON="/home/abearab/miniforge3/envs/screenpro2/bin/python" PYTHON_SYS_EXECUTABLE="/home/abearab/miniforge3/envs/screenpro2/bin/python" "cargo" "rustc" "--features" "pyo3/extension-module" "--message-format" "json-render-diagnostics" "--manifest-path" "/tmp/pip-install-r9v2vcz0/biobear_b2d0e519e8c74f2988c3017387959e41/Cargo.toml" "--release" "--lib"`
      Error: command ['maturin', 'pep517', 'build-wheel', '-i', '/home/abearab/miniforge3/envs/screenpro2/bin/python', '--compatibility', 'off'] returned non-zero exit status 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for biobear
Failed to build biobear
ERROR: Could not build wheels for biobear, which is required to install pyproject.toml-based projects

from biobear.

abearab avatar abearab commented on July 17, 2024

#106 (comment)

I had to install this in system level:

sudo apt install GCC

from biobear.

abearab avatar abearab commented on July 17, 2024

It's a very nice package overall, and to your point would make for a good reference to build a "recipe" list from.

To your question, depending on the end goal, there's a somewhat crude way you could accomplish this in biobear, but it may not fit your needs well enough. This relates back to your comments trimming, in that I'm looking at performant alignment / string functions that I can add to biobear to better support these use cases.

For example, use the seqkit example agctggagctacc...

-- if you wanted to match match against an adapter at the start and allow for two mismatches
❯ SELECT levenshtein(substr('agctggagctacc', 1, 5), 'agcaa') dist;
+------+
| dist |
+------+
| 2    |
+------+

You could then add that to a WHERE clause.

Perhaps, I'll add a function that replicates locate, but returns a list of matches... e.g.

SELECT locate(sequence_column, pattern, options...) -> list of structs {seqid, pattern, strand, start, end, matched_pattern}

Then more or less have the returned table be aligned with seqkits example? Is that a useful function.

Hi @tshauck – I'm getting back to this discussion and I need to use this functionality to process a dataset. I would be more than happy to test your tool in case you have any new features. What do you recommend to start with? I liked your idea to have that locate function as part of biobear! Let me know what you think, thanks.

from biobear.

tshauck avatar tshauck commented on July 17, 2024

Hey @abearab -- apologies for taking a bit to get onto this. The CRAM scanning feature is done, and I go started on the writing (though need some feedback from another developer).

For locate after the fixes to dependencies to afford faster installs recently, I should be able to get back to this tomorrow. What timeline are you working with? Obviously, I want to get something you could use, but don't want to lead you on if it's not feasible given your requirements.

from biobear.

tshauck avatar tshauck commented on July 17, 2024

I have something in this branch I hope to finish up tomorrow. It's slightly different than locate as it requires a regex right now, but is relatively close... e.g.

❯ SELECT locate_regex('agctggagctacc', '[a][atcg][c]') AS locate; -- match a, then one of atc or g, then c
+-----------------------------------------------------------------------------------------------------+
| locate                                                                                              |
+-----------------------------------------------------------------------------------------------------+
| [{start: 1, end: 4, match: agc}, {start: 7, end: 10, match: agc}, {start: 11, end: 14, match: acc}] |
+-----------------------------------------------------------------------------------------------------+

❯ SELECT locate_regex('agctggagctacc', 'agc') AS locate; -- match only agc
+-------------------------------------------------------------------+
| locate                                                            |
+-------------------------------------------------------------------+
| [{start: 1, end: 4, match: agc}, {start: 7, end: 10, match: agc}] |
+-------------------------------------------------------------------+

This is similar to the seqkit example: https://bioinf.shenwei.me/seqkit/usage/#locate

I also hope to add a non-regex based one similar to locate, but it's a little trickier... I'll follow up tomorrow when I know more about how hard/easy it is to add locate w/o the regex.

from biobear.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.