Giter VIP home page Giter VIP logo

typedfastx.jl's Introduction

TypedFASTX

Latest Release MIT license Documentation Status Coverage

TypedFASTX.jl is a Julia package for working with FASTA and FASTQ files using typed records. It is largely based on BioJulia's FASTX.jl package, whose records are un-typed, i.e. they are agnostic to what kind of data they contain. Besides from the sequence field, the TypedRecord type also has a description and an optional quality field. TypedFASTX.jl aims to enhance readability and reduce potential errors when dealing with different types of biological sequences. It also allows you to define different methods for specific record types.

Performance

TypedRecords generally take up less memory than FASTX.jl records, since BioSequences.jl's LongSequence type stores sequence information more efficiently. However, this approach might be slightly slower compared to, for instance, storing each field in its own vector, due to the additional overhead required to keep it flexible and user-friendly. TypedFASTX.jl is a little slower than FASTX.jl at writing records to files, as the sequences need to be encoded back to ASCII bytes (which is done through string interpolation) to be stored in FASTA/FASTQ format. One benchmark showed that writing records takes about twice as long compared to FASTX.jl. When it comes to reading, it should be almost as fast as just using plain FASTX.jl (including sequence type conversions).

Installation

You can install TypedFASTX from the Julia REPL. Type ] to enter the Pkg REPL mode and run:

(@v1.9) pkg> add TypedFASTX

Example usage

julia> using TypedFASTX

julia> mickey = DNARecord("Mickey Smith", "GATTACA", "quA1!Ty") # quality is optional
DNARecord (FASTQ):
 description: "Mickey Smith"
    sequence: "GATTACA"
     quality: "quA1!Ty"

julia> sequence(mickey)
7nt DNA Sequence:
GATTACA

julia> sequence(String, mickey)
"GATTACA"

julia> error_rate(mickey)
0.14653682578684113

julia> description(mickey)
"Mickey Smith"

julia> identifier(mickey)
"Mickey"

julia> ricky = LongAA("Ricky Smith", "SMITH")
AARecord (FASTA):
 description: "Ricky Smith"
    sequence: "SMITH"

julia> sequence(ricky)
5aa Amino Acid Sequence:
SMITH

Check out the documentation for more detailed information on how to use the package.

typedfastx.jl's People

Contributors

anton083 avatar

Stargazers

Marco Matthies avatar Camilo García avatar

Watchers

 avatar

typedfastx.jl's Issues

records are ugly and i hate it

after implementing the proposed changes in #33, records just don't look as nice...
this is what you get when you call show() on a DNARecord:
TypedFASTA.Record{LongSequence{DNAAlphabet{4}}}(<description>, <sequence>)
and that's assuming you're using BioSequences, cause otherwise it looks like this:
TypedFASTA.Record{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}(<description>, <sequence>)
and that's not even the worst part... I removed the TypedFASTX. at the beginning from the summary method thing to make it smaller, which is a little hack-y, so it would actually look like this:
TypedFASTX.TypedFASTA.Record{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}(<description>, <sequence>)
as you can see, this is very hard to look at...

before #33, it'd look like this:
DNARecord{NoQuality}(<description>, <sequence>)
cause records had a quality type parameter, as opposed to now, where we distinguish between typed FASTA and FASTQ records, which also seems to mess with how aliases are detected. see, DNARecord in the last example isn't a type, it's an alias of TypedRecord{LongDNA{4}}, so it replaces that in the summary method. but that shit gets complicated when we put the subtypes in their own submodules, cause then julia wont just replace the names with any aliases defined in the submodules.

proposed solution:

make the show method the supertype, and make sure that there are constructors that can construct subtype instances using that representation or something, so that it still makes sense. idk

we might be able to make it look like this for FASTA records:
DNARecord(<description>, <sequence>)
and like this for FASTQ records:
DNARecord(<description>, <sequence>, <quality>)
which both happen to be valid constructors since DNARecord is an alias for TypedRecord{LongDNA{4}}, and we define constructors in both concrete types that are based on the TypedRecord thing

julia> supertype(typeof(TypedFASTA.Record{LongDNA{4}}("ricky", "ACGT")))
DNARecord (alias for TypedRecord{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}})

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

TypedRecord constructors

The constructors are a complete mess at the moment (there are like 30 for TypedRecord), and a lot of it has to do with how the quality field of a TypedRecord can be either of type NoQuality or QualityScores. Consider making TypedRecord an abstract type, and differentiate between TypedRecords with and without qualities. The names of these types shouldn't be too long though, and they should be easy to create. There should still be type aliases like DNARecord such that it's as easy as possible to create records.

Here's an idea of how it might look:

using BioSequences

abstract type TypedRecord{T, Q <: AbstractQuality} end

struct TypedFASTARecord{T} <: TypedRecord{T, NoQuality}
    description::String
    sequence::T
end

struct TypedFASTQRecord{T} <: TypedRecord{T, QualityScores}
    description::String
    sequence::T
    quality::QualityScores
end

const DNARecord = TypedRecord{LongDNA{4}}

The constructors might end up looking just as ugly though, and readers and writers would require an overhaul as well.

Should there be constructors where you don't need to specify name/description of records?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.