Comments (12)
@abearab I updated biobear to support locate_regex
as shown above. You can see how it looks through biobear/polars here:
biobear/python/tests/test_session.py
Lines 43 to 68 in 5d9a881
Please let me know if you have any feedback on how that works for your task -- thanks!
from biobear.
Cool, that all sounds good to me, no rush for me. Please just "at" me on this issue if/when you use this for your work if you have any thoughts on improvements and/or questions. Thanks!
from biobear.
It's a very nice package overall, and to your point would make for a good reference to build a "recipe" list from.
To your question, depending on the end goal, there's a somewhat crude way you could accomplish this in biobear, but it may not fit your needs well enough. This relates back to your comments trimming, in that I'm looking at performant alignment / string functions that I can add to biobear
to better support these use cases.
For example, use the seqkit example agctggagctacc
...
-- if you wanted to match match against an adapter at the start and allow for two mismatches
❯ SELECT levenshtein(substr('agctggagctacc', 1, 5), 'agcaa') dist;
+------+
| dist |
+------+
| 2 |
+------+
You could then add that to a WHERE
clause.
Perhaps, I'll add a function that replicates locate
, but returns a list of matches... e.g.
SELECT locate(sequence_column, pattern, options...) -> list of structs {seqid, pattern, strand, start, end, matched_pattern}
Then more or less have the returned table be aligned with seqkits
example? Is that a useful function.
from biobear.
Thanks @tshauck for your response. I think that's interesting addition (it's interesting that levenshtein
distance is already implemented; I didn't know that! I'll need to ask you another question about that but I'll leave this question clear for now).
For the locate
question here, I had some specific thought which is described here – https://github.com/abearab/PosSpacer. I just made this a public repo so you can see my algorithm design notes. However, I gave up on that until I could replace my codes with biobear
! The preprocessed NGS reads needs to be counted and the location of spacers in sequencing reads would be nice to be reported.
from biobear.
@abearab, cool, that makes sense to me... I just updated biobear on pypi with the new quality score function as well as an alignment_score function that does basic local alignment between two sequences (e.g. e69df6b). I'll explore more how to return alignment position(s) within a string and follow up when I have some more info.
from biobear.
Ah, ok, glad you got it. Looks like I have a little work to do dependency-wise.
from biobear.
Hey @abearab -- apologies for taking a bit to get onto this. The CRAM scanning feature is done, and I go started on the writing (though need some feedback from another developer).
For
locate
after the fixes to dependencies to afford faster installs recently, I should be able to get back to this tomorrow. What timeline are you working with? Obviously, I want to get something you could use, but don't want to lead you on if it's not feasible given your requirements.
Hi @tshauck, Thanks for looking into this. There is no time pressure on my end (i.e. I might also not test new features in biobear
as the top priority). I'm currently working on CLI development ArcInstitute/ScreenPro2#35 but I would like to use locate
/ locate_regex
function for topics like ArcInstitute/ScreenPro2#39.
from biobear.
@tshauck I have issue installing it
(this might be a separate issue, I just created a new VM and fresh OS, so ...)
× Building wheel for biobear (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [168 lines of output]
Running `maturin pep517 build-wheel -i /home/abearab/miniforge3/envs/screenpro2/bin/python --compatibility off`
🍹 Building a mixed python/rust project
🔗 Found pyo3 bindings
🐍 Found CPython 3.9 at /home/abearab/miniforge3/envs/screenpro2/bin/python
📡 Using build options features from pyproject.toml
Compiling libc v0.2.153
...
The following warnings were emitted during compilation:
warning: [email protected]: Failed to run: "cc" "--version"
error: failed to run custom build command for `ring v0.17.8`
Caused by:
process didn't exit successfully: `/tmp/pip-install-r9v2vcz0/biobear_b2d0e519e8c74f2988c3017387959e41/target/release/build/ring-646b006cd1874eb5/build-script-build` (exit status: 1)
--- stdout
cargo:rerun-if-env-changed=RING_PREGENERATE_ASM
cargo:rustc-env=RING_CORE_PREFIX=ring_core_0_17_8_
OPT_LEVEL = Some("3")
TARGET = Some("x86_64-unknown-linux-gnu")
HOST = Some("x86_64-unknown-linux-gnu")
cargo:rerun-if-env-changed=CC_x86_64-unknown-linux-gnu
CC_x86_64-unknown-linux-gnu = None
cargo:rerun-if-env-changed=CC_x86_64_unknown_linux_gnu
CC_x86_64_unknown_linux_gnu = None
cargo:rerun-if-env-changed=HOST_CC
HOST_CC = None
cargo:rerun-if-env-changed=CC
CC = None
cargo:rerun-if-env-changed=CC_ENABLE_DEBUG_OUTPUT
cargo:warning=Failed to run: "cc" "--version"
cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("false")
CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
cargo:rerun-if-env-changed=CFLAGS_x86_64-unknown-linux-gnu
CFLAGS_x86_64-unknown-linux-gnu = None
cargo:rerun-if-env-changed=CFLAGS_x86_64_unknown_linux_gnu
CFLAGS_x86_64_unknown_linux_gnu = None
cargo:rerun-if-env-changed=HOST_CFLAGS
HOST_CFLAGS = None
cargo:rerun-if-env-changed=CFLAGS
CFLAGS = None
--- stderr
error occurred: Failed to find tool. Is `cc` installed?
warning: build failed, waiting for other jobs to finish...
💥 maturin failed
Caused by: Failed to build a native library through cargo
Caused by: Cargo build finished with "exit status: 101": `env -u CARGO PYO3_ENVIRONMENT_SIGNATURE="cpython-3.9-64bit" PYO3_PYTHON="/home/abearab/miniforge3/envs/screenpro2/bin/python" PYTHON_SYS_EXECUTABLE="/home/abearab/miniforge3/envs/screenpro2/bin/python" "cargo" "rustc" "--features" "pyo3/extension-module" "--message-format" "json-render-diagnostics" "--manifest-path" "/tmp/pip-install-r9v2vcz0/biobear_b2d0e519e8c74f2988c3017387959e41/Cargo.toml" "--release" "--lib"`
Error: command ['maturin', 'pep517', 'build-wheel', '-i', '/home/abearab/miniforge3/envs/screenpro2/bin/python', '--compatibility', 'off'] returned non-zero exit status 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for biobear
Failed to build biobear
ERROR: Could not build wheels for biobear, which is required to install pyproject.toml-based projects
from biobear.
I had to install this in system level:
sudo apt install GCC
from biobear.
It's a very nice package overall, and to your point would make for a good reference to build a "recipe" list from.
To your question, depending on the end goal, there's a somewhat crude way you could accomplish this in biobear, but it may not fit your needs well enough. This relates back to your comments trimming, in that I'm looking at performant alignment / string functions that I can add to
biobear
to better support these use cases.For example, use the seqkit example
agctggagctacc
...-- if you wanted to match match against an adapter at the start and allow for two mismatches ❯ SELECT levenshtein(substr('agctggagctacc', 1, 5), 'agcaa') dist; +------+ | dist | +------+ | 2 | +------+
You could then add that to a
WHERE
clause.Perhaps, I'll add a function that replicates
locate
, but returns a list of matches... e.g.SELECT locate(sequence_column, pattern, options...) -> list of structs {seqid, pattern, strand, start, end, matched_pattern}
Then more or less have the returned table be aligned with
seqkits
example? Is that a useful function.
Hi @tshauck – I'm getting back to this discussion and I need to use this functionality to process a dataset. I would be more than happy to test your tool in case you have any new features. What do you recommend to start with? I liked your idea to have that locate
function as part of biobear
! Let me know what you think, thanks.
from biobear.
Hey @abearab -- apologies for taking a bit to get onto this. The CRAM scanning feature is done, and I go started on the writing (though need some feedback from another developer).
For locate
after the fixes to dependencies to afford faster installs recently, I should be able to get back to this tomorrow. What timeline are you working with? Obviously, I want to get something you could use, but don't want to lead you on if it's not feasible given your requirements.
from biobear.
I have something in this branch I hope to finish up tomorrow. It's slightly different than locate as it requires a regex right now, but is relatively close... e.g.
❯ SELECT locate_regex('agctggagctacc', '[a][atcg][c]') AS locate; -- match a, then one of atc or g, then c
+-----------------------------------------------------------------------------------------------------+
| locate |
+-----------------------------------------------------------------------------------------------------+
| [{start: 1, end: 4, match: agc}, {start: 7, end: 10, match: agc}, {start: 11, end: 14, match: acc}] |
+-----------------------------------------------------------------------------------------------------+
❯ SELECT locate_regex('agctggagctacc', 'agc') AS locate; -- match only agc
+-------------------------------------------------------------------+
| locate |
+-------------------------------------------------------------------+
| [{start: 1, end: 4, match: agc}, {start: 7, end: 10, match: agc}] |
+-------------------------------------------------------------------+
This is similar to the seqkit example: https://bioinf.shenwei.me/seqkit/usage/#locate
I also hope to add a non-regex based one similar to locate
, but it's a little trickier... I'll follow up tomorrow when I know more about how hard/easy it is to add locate
w/o the regex.
from biobear.
Related Issues (20)
- How to handle read options (e.g. like `ReadFastaOptions`)
- How to load pairs of FASTQ files from paired-end reads HOT 7
- Update https://www.wheretrue.dev/ w/ Trimming Example
- Reduce Dependencies HOT 3
- Random subsampling of reads from BAM files HOT 5
- merge paired-end sequences HOT 5
- Loading in of large BAM files HOT 5
- Integer Encoding
- Support reading BED files with less than 12 column. HOT 4
- name column in BED file is limited to 255 bytes. HOT 6
- Update user docs for new BED options HOT 1
- Investigate granges rust crate HOT 4
- Why is specifying the extension required when reading files? HOT 2
- Infer extension and compression from file path.
- Fastq files are not fully read HOT 10
- Why different handling between GFF and mzml/genbank in polars. HOT 4
- `FastaReader` returns empty pandas dataframe HOT 8
- Releases builds failing. HOT 1
- How to run polars dataframe methods on large FASTQ files in a memory efficient way HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from biobear.