Giter VIP home page Giter VIP logo

wgatools's Introduction

Anaconda-Server Badge GitHub Workflow Status GitHub repo size

Whole Genome Alignment Tools

A Rust library and tools for whole genome alignment files

Install

Conda

conda install wgatools -c bioconda

Build from source

git clone https://github.com/wjwei-handsome/wgatools.git
cd wgatools
cargo build --release

or just install from git:

cargo install --git https://github.com/wjwei-handsome/wgatools.git

Nix

A nix flake is also available. You can build from within the repo like this:

nix build .#wgatools

Or directly install from github:

nix profile install github:wjwei-handsome/wgatools

Docker and Singularity

Using nix, we can derive docker and singularity images:

nix build .#dockerImage

First, we load the docker image into the local daemon:

docker load < result

It's then possible to pack up a singularity image:

singularity build wgatools-$(git log -1 --format=%h --abbrev=8).sif docker-daemon://wgatools:latest

This can be useful when running on HPCs where it might be difficult to build wgatools.

TOOLS

Usage

> wgatools
wgatools -- a cross-platform and ultrafast toolkit for Whole Genome Alignment Files manipulation

Version: 0.1.0

Authors: Wenjie Wei <[email protected]>

Usage: wgatools [OPTIONS] <COMMAND>

Commands:
  maf2paf    Convert MAF format to PAF format [aliases: m2p]
  maf2chain  Convert MAF format to Chain format [aliases: m2c]
  paf2maf    Convert PAF format to MAF format [aliases: p2m]
  paf2chain  Convert PAF format to Chain format [aliases: p2c]
  chain2maf  Convert Chain format to MAF format [aliases: c2m]
  chain2paf  Convert Chain format to PAF format [aliases: c2p]
  maf-index  Build index for MAF file [aliases: mi]
  maf-ext    Extract specific region from MAF file with index [aliases: me]
  call       Call Variants from MAF file [aliases: c]
  tview      View MAF file in terminal [aliases: tv]
  stat       Statistics for Alignment file [aliases: st]
  dotplot    TEST: Plot dotplot for Alignment file [aliases: dp]
  filter     Filter records for Alignment file [aliases: fl]
  rename     Rename MAF records with prefix [aliases: rn]
  maf2sam    TEST: maf2sam [aliases: m2s]
  pafcov     TEST: pafcov [aliases: pc]
  pafpseudo  TEST: generate pesudo maf from paf [aliases: pp]
  chunk      Chunk MAF file by length [aliases: ch]
  help       Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help (see more with '--help')
  -V, --version  Print version

GLOBAL:
  -o, --outfile <OUTFILE>  Output file ("-" for stdout), file name ending in .gz/.bz2/.xz will be compressed automatically [default: -]
  -r, --rewrite            Bool, if rewrite output file [default: false]
  -t, --threads <THREADS>  Threads, default 1 [default: 1]
  -v, --verbose...         Logging level [-v: Info, -vv: Debug, -vvv: Trace, defalut: Warn]

Each subcommand could be used with -h or --help to get more information.

Auto-Completion for easy-use

wgatools gen-completion --shell fish > ~/.config/fish/completions/wgatools.fish

Ready to enjoy it!

Format Conversion

Three mainstream formats(PAF, MAF, CHAIN) can be converted to each other.

For example, to convert MAF to PAF:

wgatools maf2paf test.maf > test.paf

or to convert PAF to MAF:

wgatools paf2maf test.paf --target target.fa --query query.fa > test.maf

Tip

If you want to convert into MAF format, you should provide target and query genome sequence files in {.fa, .fa.gz}.

stdin and stdout are supported, so you can use pipes to chain commands together🪆:

cat test.maf | wgatools maf2paf | wgatools paf2maf -g target.fa -q query.fa > test.maf

wgatools paf2chain test.paf | wgatools chain2maf -g target.fa -q query.fa | wgatools maf2chain | wgatools chain2paf > funny.paf

Dotplot for MAF/PAF file

We provide two modes for plot, for example:

  • BaseLevel

base

This mode can catch the alignment details in each record, such as matches, insertions and deletions. This can help us to better observe the local alignment.

wgatools dotplot -f paf test/testdotplot.paf > out.html

By default, INDELs smaller than 50bp are merged with adjacent match. You can also use the parameter -l, --length to specify the threshold.

In Interactive html, you can click on the legend to view only the types of interest, for example:

base2

Warning

NOTE: For better interactivity, the zoom function is turned on. However, if there is too much data, the effect may be limited by your browser performance.

This simple example can be found in the test directory.

  • Overview

overview

Similar to common dotplot scripts, it will draw each align record and color it according to identity.

wgatools dotplot test.maf -m overview > overview.html

😎 For vega and DIY hackers, we also provide output in json(vega schema) and csv formats.

Extract regions from MAF file

The line of MAF file is so long that it's hard to read. You can use maf-ext to extract specific region from MAF file with index:

wgatools maf-index test.maf

wgatools maf-extract test.maf -r chr1:1-10,chr2:66-888,chr3:100-50,chr_no:1-10,x:y-z

Tip

  1. Support multi-interval input, separated by commas
  2. Support bed input to specify interval
  3. Mismatched interval are skipped and warned

View MAF file in terminal

View the MAF file in the terminal smoothly, and you can also specify the area to view:

wgatools tview test.maf

example

Press to slide left and right.

Press q to exit.

Press g to bring up the navigation window, where the left side is the optional sequence name, and the right side is the optional interval of the selected sequence, you can press Tab to switch the left and right selection windows, and you can press to select the sequence and interval

After input a legal interval, you can Press Enter to jump to the Destination. Or press Esc to exit the navigation window.

Call Variants from MAF file

The MAF format completely records the alignment of each base, so it can be used to identify variants.

Supported explicit varaint types:

  • SNP
  • INS
  • DEL
  • INV

The default parameter does not output SNP and short INS and DEL (<50). The example is as follows:

wgatools call test/test.maf -s -l0

Output vcf:

##fileformat=VCFv4.4
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Length of structural variant">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the longest variant described in this record">
##INFO=<ID=INV_NEST,Number=1,Type=String,Description="Varations nested within inversion">
##FORMAT=<ID=QI,Number=1,Type=String,Description="Query informations">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample
ref.chr8	181470034	.	TG	T	.	.	SVTYPE=DEL;SVLEN=1;END=181470035	GT:QI	1|1:query.chr8@181989530@181989530@P
ref.chr8	181470279	.	G	C	.	.	.	GT	1|1
ref.chr8	181470292	.	A	G	.	.	.	GT	1|1
ref.chr8	181470431	.	C	G	.	.	.	GT	1|1
ref.chr8	181470609	.	C	A	.	.	.	GT	1|1
ref.chr8	181470641	.	C	T	.	.	.	GT	1|1
ref.chr8	181470774	.	A	AAACCAAGA	.	.	SVTYPE=INS;SVLEN=8;END=181470774	GT:QI	1|1:query.chr8@181990269@181990277@P
ref.chr8	181470793	.	G	T	.	.	.	GT	1|1
ref.chr8	181470894	.	C	T	.	.	.	GT	1|1
ref.chr8	181470895	.	A	T	.	.	.	GT	1|1
ref.chr8	181470903	.	G	A	.	.	.	GT	1|1

Important

This function does not support the identification of chromosomal rearrangements such as DUP, as this requires the extraction of sequences for realignment.

Chunk MAF file by length

You can split a huge MAF record into multiple records by length:

wgatools chunk -l 100 test/test.maf -o chunked.maf

Statistics for MAF/PAF file

wgatools stat test.maf
wgatools stat -f paf test.paf
wgatools stat test.maf

Filter records for MAF/PAF file

You can filter some records by block length or query_size.

For example, to filter records that contig vs reference:

wgatools filter test.maf -q 1000000 > filt.maf

For all-to-all alignment paf file which produced by wfmash, you can filter some pairs by align-size:

wgatools filter all2all.paf -a 1000000 > filt.maf

Rename MAF file

In some practices, the chromosome name of ref and query are both called chr1, which is not easy to distinguish. You can rename the sequence name in MAF file with a prefix:

wgatools rename --prefixs REF.,QUERY. input.maf > rename.maf

[Experimental] PAF Coverage for all-to-all alignment

If you have alignment results for multiple genomes, you can use this command to calculate the alignment coverage on the genomes. It's optimized to use with wfmash output.

wgatools pafcov all.paf > all.cov.beds

[Experimental] Generate pseudo MAF from all-to-all PAF

wgatools pafpseudo -f all.fa.gz all.paf -o out_dir -t 10

pp

Library

Some simple reader and iterator for PAF, MAF and Chain files:

use wgatools::parser::paf::PafReader;
use wgatools::parser::maf::MAFReader;
use wgatools::parser::chain::ChainReader;
fn main() {
    let mut mafreader = MAFReader::from_path("test.maf").unwrap();
    for record in mafreader.records() {
        let record = record.unwrap();
        println!("{:?}", record);
    }
    /// ...
}

TODO for library

  • Error detection and handling
  • Test cases
  • Documentations

Features

  • use nom to parse CIGAR string
  • use rayon to accelerate the speed of conversions
  • use ratatui to visualize MAF file in terminal
  • ...

ROADMAP

  • SAM converter [really need?]
  • Local improvement of alignment by re-alignment
  • MAF -> GAF -> HAL
  • output gvcf for variants
  • call variants from PAF

Contributing

Feel free to dive in! Open an issue or submit PRs.

License

MIT License © WenjieWei

wgatools's People

Contributors

andreaguarracino avatar ekg avatar sharkloc avatar wjwei-handsome avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

wgatools's Issues

About installing

@wjwei-handsome

I am writing to express my keen interest in trying out wgatools. However, I am encountering difficulties installing it from GitHub. I am wondering if you happen to have a compiled version for CentOS 7 or a mature Docker image available?

Your assistance in this matter would be greatly appreciated.

Thank you for your attention to this inquiry.

Best regards,

wrong posistion with inversion when convert from MAF to PAF

Hi, Wenjie

Awesome tool! I noticed a weird bug that caused inversion position calculation. I use Ler-0 and Col-0 to align with AnchorWave. I think this is introduced by inconsistence strands in MAF and PAF. So this should influence all alignment with inversion.

Anchors synteny
image

Synteny after wgatools maf2paf

image

MAF record

a score=255
s       Col-CC#1#Chr4   5497632 1136879 +       25768865
s       Ler-0#1#Chr4    2069952 1174831 -       22700724

After wgatools maf2paf:

Ler-0#1#Chr4    22700724        19455941        20630772        -       Col-CC#1#Chr4   25768865        5497632 6634511 933076  1136879 255     NM:i:431577

wgatools call VCF:

Col-CC#1#Chr4   5497633 .       T       <INV>   .       .       SVTYPE=INV;END=6634511  GT:QI   1|1:Ler-0#1#Chr4@19455941@20630772@N

If I align Col-CC#1#Chr4:5497632-6634511 with wfmash, here is the record:

Chr4:5497632-6634511    1136880 1072368 1136880 -       Chr4    22700724        2069952 2135307 60761   65355   10      gi:f:0.984175   bi:f:0.889711   md:f:0.97985    wt:i:272        pt:i:214        aa:i:23849 >
Chr4:5497632-6634511    1136880 1040    89360   -       Chr4    22700724        3132937 3232036 83960   99099   8       gi:f:0.971714   bi:f:0.829833   md:f:0.971255   wt:i:548        pt:i:151        aa:i:55543 >
Chr4:5497632-6634511    1136880 112576  196288  -       Chr4    22700724        3057298 3128288 64432   83712   6       gi:f:0.980372   bi:f:0.722915   md:f:0.970761   wt:i:308        pt:i:393        aa:i:77987 >
Chr4:5497632-6634511    1136880 770000  950000  -       Chr4    22700724        2256420 2438872 166348  182452  9       gi:f:0.982505   bi:f:0.858779   md:f:0.983062   wt:i:427        pt:i:311        aa:i:237161>
Chr4:5497632-6634511    1136880 970144  1053728 -       Chr4    22700724        2163240 2244231 68148   83584   5       gi:f:0.985467   bi:f:0.712532   md:f:0.979043   wt:i:267        pt:i:712        aa:i:108241>
Chr4:5497632-6634511    1136880 522448  765136  -       Chr4    22700724        2461840 2724638 224405  262798  7       gi:f:0.984742   bi:f:0.806713   md:f:0.980756   wt:i:1140       pt:i:2097       aa:i:472985>
Chr4:5497632-6634511    1136880 220576  500000  -       Chr4    22700724        2745552 3048267 243544  302715  6       gi:f:0.986108   bi:f:0.725764   md:f:0.98171    wt:i:819        pt:i:2493       aa:i:589277>
t

So I think the PAF record should be

Ler-0#1#Chr4    22700724        2069952        3244783        -       Col-CC#1#Chr4   25768865        5497632 6634511 933076  1136879 255     NM:i:431577

PAF2SAM support

Hi, Wenjie

Does wgatools have any plan for supporting PAF2SAM? Although PAF already has enough information for alignment, it is still a pain for visualizing alignments. We strongly rely on IGV for ref-based variant calling, but it didn't read CIGAR in PAF since they cannot index them (igvteam/igv#478). Would be possible for wgatools to support this?

Ideas and thoughts

Hi @wjwei-handsome,

Thanks for making this tool! This is just a list of thoughts I had looking through the repo, that you might consider.

  • you are much better at rust than I am!
  • it looks like you might be reading the whole PAF file into memory, maybe there is a way to iterate record by record instead?
  • it would be awesome if the PAF reader could also parse the cs tag into a CIGAR string allowing the conversion to other formats
  • it would be great to update your io code to allow reading and writing of compressed files. If you want an example you can look here: https://github.com/fiberseq/fibertools-rs/blob/633277099e1fde8684733c05d6b192434ee01c10/bio-io/src/lib.rs#L1-L99
  • For visualization might I humbly suggest integration with rustybam and SafFire?

I think that is all my ideas at the moment, but I will add as I think of things!

Thanks again,
Mitchell

Lower case for nucleotides results in error

Hi
Appreciating def. the effort of writing these tools in rust :)
Small issue which I encountered today while testing paf2maf is the fact that lower nucleotide cases in query or target will actually lead to an error:

wgatools paf2maf --query FF_SIM_1_query.fa --target  FF_SIM_1_target.fa test.paf -o /tmp/test.maf -r 
2023-12-29T10:07:54.434994554+01:00 WARN file /tmp/test.maf exist, will rewrite it
2023-12-29T10:07:54.456604913+01:00 INFO start write file: `/tmp/test.maf`
2023-12-29T10:08:05.112005605+01:00 ERROR Invalid Base: `c`

Switching all my nucleotides to upper case fixed it. But I think that is something which you might encounter quite frequently and what should be fixed.

Option -t double booked

Hi
It seems that your option -t stands for threads and for target, which is --threads and --target.
This conflicts when using one of the short forms.
The long forms seem okay.
Cheers

Missing some fields in minimap2 PAF

Hi, Weijie

I found the wgatools will stop after raise an error with minimap2 PAF minimap2 -x asm20 --eqx ref.fa query.fa > query.paf.

RROR CSV deserialize error by: CSV error: record 193 (line: 194, byte: 1339969): found record with 23 fields, but the previous record has 24 fields

Then I checked the PAF, only 2276/4195 records's 24 column start from cg:Z.
According to https://github.com/lh3/miniasm/blob/master/PAF.md, only the first 12 columns is the requirement, so the aligner may choose random output of some tags.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.