Giter VIP home page Giter VIP logo

vcfstats's Introduction

vcfstats - powerful statistics for VCF files

Pypi Github PythonVers docs github action Codacy Codacy coverage

Documentation | CHANGELOG

Motivation

There are a couple of tools that can plot some statistics of VCF files, including bcftools and jvarkit. However, none of them could:

  1. plot specific metrics
  2. customize the plots
  3. focus on variants with certain filters

R package vcfR can do some of the above. However, it has to load entire VCF into memory, which is not friendly to large VCF files.

Installation

pip install -U vcfstats

Or run with docker:

docker run \
  -w /vcfstats/workdir \
  -v $(pwd):/vcfstats/workdir \
  --rm justold/vcfstats:latest \
  vcfstats \
  --vcf myfile.vcf \
  -o outputs \
  --formula 'COUNT(1) ~ CONTIG' \
  --title 'Number of variants on each chromosome'

Gallery

Number of variants on each chromosome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1) ~ CONTIG' \
    --title 'Number of variants on each chromosome' \
    --config examples/config.toml

Number of variants on each chromosome

Changing labels and ticks

vcfstats uses plotnine for plotting, read more about it on how to specify --ggs to modify the plots.

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1) ~ CONTIG' \
    --title 'Number of variants on each chromosome (modified)' \
    --config examples/config.toml \
    --ggs 'scale_x_discrete(name ="Chromosome", \
        limits=["1","2","3","4","5","6","7","8","9","10","X"]); \
        ylab("# Variants")'

Number of variants on each chromosome (modified)

Number of variants on first 5 chromosome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1) ~ CONTIG[1,2,3,4,5]' \
    --title 'Number of variants on each chromosome (first 5)' \
    --config examples/config.toml
# or
vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1) ~ CONTIG[1-5]' \
    --title 'Number of variants on each chromosome (first 5)' \
    --config examples/config.toml
# or
# require vcf file to be tabix-indexed.
vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1) ~ CONTIG' \
    --title 'Number of variants on each chromosome (first 5)' \
    --config examples/config.toml -r 1 2 3 4 5

Number of variants on each chromosome (first 5)

Number of substitutions of SNPs

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1, VARTYPE[snp]) ~ SUBST[A>T,A>G,A>C,T>A,T>G,T>C,G>A,G>T,G>C,C>A,C>T,C>G]' \
    --title 'Number of substitutions of SNPs' \
    --config examples/config.toml

Number of substitutions of SNPs

Only with SNPs PASS all filters

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1, VARTYPE[snp]) ~ SUBST[A>T,A>G,A>C,T>A,T>G,T>C,G>A,G>T,G>C,C>A,C>T,C>G]' \
    --title 'Number of substitutions of SNPs (passed)' \
    --config examples/config.toml \
    --passed

Number of substitutions of SNPs (passed)

Alternative allele frequency on each chromosome

# using a dark theme
vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'AAF ~ CONTIG' \
    --title 'Allele frequency on each chromosome' \
    --config examples/config.toml --ggs 'theme_dark()'

Allele frequency on each chromosome

Using boxplot

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'AAF ~ CONTIG' \
    --title 'Allele frequency on each chromosome (boxplot)' \
    --config examples/config.toml \
    --figtype boxplot

Allele frequency on each chromosome

Using density plot/histogram to investigate the distribution:

You can plot the distribution, using density plot or histogram

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'AAF ~ CONTIG[1,2]' \
    --title 'Allele frequency on chromosome 1,2' \
    --config examples/config.toml \
    --figtype density

Allele frequency on chromosome 1,2

Overall distribution of allele frequency

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'AAF ~ 1' \
    --title 'Overall allele frequency distribution' \
    --config examples/config.toml

Overall allele frequency distribution

Excluding some low/high frequency variants

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'AAF[0.05, 0.95] ~ 1' \
    --title 'Overall allele frequency distribution (0.05-0.95)' \
    --config examples/config.toml

Overall allele frequency distribution

Counting types of variants on each chromosome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1, group=VARTYPE) ~ CHROM' \
    # or simply
    # --formula 'VARTYPE ~ CHROM' \
    --title 'Types of variants on each chromosome' \
    --config examples/config.toml

Types of variants on each chromosome

Using bar chart if there is only one chromosome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'COUNT(1, group=VARTYPE) ~ CHROM[1]' \
    # or simply
    # --formula 'VARTYPE ~ CHROM[1]' \
    --title 'Types of variants on chromosome 1' \
    --config examples/config.toml \
    --figtype pie

Types of variants on chromosome 1

Counting variant types on whole genome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    # or simply
    # --formula 'VARTYPE ~ 1' \
    --formula 'COUNT(1, group=VARTYPE) ~ 1' \
    --title 'Types of variants on whole genome' \
    --config examples/config.toml

Types of variants on whole genome

Counting type of mutant genotypes (HET, HOM_ALT) for sample 1 on each chromosome

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    # or simply
    # --formula 'GTTYPEs[HET,HOM_ALT]{0} ~ CHROM' \
    --formula 'COUNT(1, group=GTTYPEs[HET,HOM_ALT]{0}) ~ CHROM' \
    --title 'Mutant genotypes on each chromosome (sample 1)' \
    --config examples/config.toml

Mutant genotypes on each chromosome

Exploration of mean(genotype quality) and mean(depth) on each chromosome for sample 1

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'MEAN(GQs{0}) ~ MEAN(DEPTHs{0}, group=CHROM)' \
    --title 'GQ vs depth (sample 1)' \
    --config examples/config.toml

GQ vs depth (sample 1)

Exploration of depths for sample 1,2

vcfstats --vcf examples/sample.vcf \
    --outdir examples/ \
    --formula 'DEPTHs{0} ~ DEPTHs{1}' \
    --title 'Depths between sample 1 and 2' \
    --config examples/config.toml

Depths between sample 1 and 2

See more examples:

#15 (comment)

vcfstats's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

vcfstats's Issues

Question: Multiple samples in a plot

Hi,

Thanks for this great tool :)

is it possible to plot multiple samples in the same plot (for example the histogramm for VAF) either by merging all samples before hand in one file or by loading a list of files?

it is not clear how the config toml should be used

Discussed in #26

Originally posted by ialbert September 9, 2022
It is not clear how to use the toml config or whether that is even working at all. Suppose one wants to set parameters in a toml file.

The software still insists that one passes a --title even though the toml file has a title in it.

title = 'HELLO'

if I pass a title

vcfstats --config asm/config.toml --vcf out/test.vcf --outdir out --title 'test'  --formula='COUNT(1) ~ CONTIG'

the output title will be just the letter 'h'

plot saved to: out/h.col.png   

as evidenced above. If one specifies a formula like so

formula = 'DEPTHs{0} ~ CHROM'
title = 'Depth distribution on each chromosome'

then it does not work at all:

 raise UnexpectedToken(token, e.allowed, state=parser_state, token_history=[last_token], terminals_by_name=self.root_lexer.terminals_by_name)
lark.exceptions.UnexpectedToken: Unexpected token Token('ITEM', 'D') at line 1, column 1.
Expected one of: 
	* "1"
	* NAME
Previous tokens: [None]

The documentation is inadequate to explain how it should work:

https://github.com/pwwang/vcfstats/blob/master/docs/configfile.md

If one makes a toml file that resembles the example the software crashes:

[[instance]]
formula = 'DEPTHs{0} ~ CHROM'
title = 'Depth distribution on each chromosome'

with:

    def_devpars = default_devpars.copy()
AttributeError: 'Namespace' object has no attribute 'copy'

new error message

Tried 0.6.0 using pip install -U vcfstats , but got new error message as below.

$ vcfstats --vcf examples/sample.vcf \

--outdir examples/
--formula 'COUNT(1) ~ CONTIG'
--title 'Number of variants on each chromosome'
--config examples/config.toml

Traceback (most recent call last):
File "/Users/pengl7/opt/anaconda3/envs/genomics/bin/vcfstats", line 8, in
sys.exit(main())
File "/Users/pengl7/opt/anaconda3/envs/genomics/lib/python3.7/site-packages/vcfstats/init.py", line 167, in main
load_config(opts['config'], opts)
File "/Users/pengl7/opt/anaconda3/envs/genomics/lib/python3.7/site-packages/vcfstats/init.py", line 122, in load_config
configs._load(config)
File "/Users/pengl7/opt/anaconda3/envs/genomics/lib/python3.7/site-packages/simpleconf.py", line 269, in _load
cached[rname] = LOADERS[ext](conf, with_profile).conf
File "/Users/pengl7/opt/anaconda3/envs/genomics/lib/python3.7/site-packages/simpleconf.py", line 55, in init
self.conf = self.load(cfile)
File "/Users/pengl7/opt/anaconda3/envs/genomics/lib/python3.7/site-packages/simpleconf.py", line 190, in load
conf = toml.load(fconf)
File "/Users/pengl7/opt/anaconda3/envs/genomics/lib/python3.7/site-packages/toml/decoder.py", line 156, in load
return loads(f.read(), _dict, decoder)
File "/Users/pengl7/opt/anaconda3/envs/genomics/lib/python3.7/site-packages/toml/decoder.py", line 262, in loads
original, i)
toml.decoder.TomlDecodeError: Found invalid character in key name: '!'. Try quoting the key name. (line 6 column 2 char 6)

error running the docker implementation of vcfstats

Hi all,
After a successful docker instalation of the vcfstats image, the command line
'docker run --rm justold/vcfstats:latest vcfstats --vcf myfile.vcf -o outputs --formula 'COUNT(1) ~ CONTIG' --title 'Number of variants on each chromosome'
shows next error:
[01/16/23 09:11:23] INFO Combining regions, be reminded that regions should
NOT be overlapping ...
[01/16/23 09:11:23] INFO Getting vcf handler by given regions ...
[01/16/23 09:11:23] WARNING (cyvcf2) [E::hts_open_format] Failed to open file
"myfile.vcf" : No such file or directory
Traceback (most recent call last):
File "/usr/local/bin/vcfstats", line 6, in
sys.exit(main())
File "/vcfstats/vcfstats/cli.py", line 214, in main
vcf, samples = get_vcf_by_regions(
File "/vcfstats/vcfstats/cli.py", line 46, in get_vcf_by_regions
vcf = VCF(str(vcffile), gts012=True)
File "cyvcf2/cyvcf2.pyx", line 258, in cyvcf2.cyvcf2.VCF.init
File "cyvcf2/cyvcf2.pyx", line 190, in cyvcf2.cyvcf2.HTSFile._open_htsfile
OSError: Error opening myfile.vcf

Of course, 'myfile.vcf' is in the path. Please, some help to fix that issue will be welcome. Thanks in advance.

ImportError: cannot import name 'LOADERS' from 'simpleconf'

Hi, I am trying to run the following command from the README:
vcfstats --vcf test_datasets/test.vcf \ --outdir out/ \ --formula 'COUNT(1) ~ CONTIG' \ --title 'Number of variants on each chromosome' \ --config out/config.toml

and I get the following error, with traceback:

Traceback (most recent call last):
  File "/Users/xxx/opt/anaconda3/envs/dis/bin/vcfstats", line 5, in <module>
    from vcfstats.cli import main
  File "/Users/xxx/opt/anaconda3/envs/dis/lib/python3.7/site-packages/vcfstats/__init__.py", line 4, in <module>
    from . import macros
  File "/Users/xxx/opt/anaconda3/envs/dis/lib/python3.7/site-packages/vcfstats/macros.py", line 5, in <module>
    from .utils import MACROS
  File "/Users/xxx/opt/anaconda3/envs/dis/lib/python3.7/site-packages/vcfstats/utils.py", line 9, in <module>
    from pyparam import defaults
  File "/Users/xxx/opt/anaconda3/envs/dis/lib/python3.7/site-packages/pyparam/__init__.py", line 5, in <module>
    from .params import Params
  File "/Users/xxx/opt/anaconda3/envs/dis/lib/python3.7/site-packages/pyparam/params.py", line 9, in <module>
    from simpleconf import LOADERS, Config
ImportError: cannot import name 'LOADERS' from 'simpleconf' (/Users/xxx/opt/anaconda3/envs/dis/lib/python3.7/site-packages/simpleconf/__init__.py)

I have installed vcfstats using pip.

Many thanks,
Miguel

cyvcf2 warnings while parsing the vcf file

Hi,

There are a number of warnings in the log file after vcfstats execution. Please find the examples below:

[02/02/22 22:21:42] WARNING (cyvcf2) [W::vcf_parse] Contig '1D' is not defined
in the header. (Quick workaround: index the file
with tabix.)
[02/02/22 22:21:42] WARNING (cyvcf2) [W::vcf_parse] INFO 'QualityScore' is not
defined in the header, assuming Type=String
My question is, can i just ignore all of that warnings, because these are not important?

Regards,
Denis

Matplotlib Contour Install Error

I have installed vcfstats using pip.

When I try to say vcfstats --help, it tells me

ModuleNotFoundError: No module named 'matplotlib._contour'

Any idea why?

I did install matplotlib through pip and python3-matplotlib is also installed via apt.

TypeError: getattr(): attribute name must be string

Hi @pwwang!
I'm trying to use vcfstats on a multiple sample vcf file in conda environment. But the command line return some errors that I could not solve.

I got TypeError: getattr(): attribute name must be string

vcfstats --vcf sample.vcf -o example/ --formula 'COUNT(1) ~ CONTIG[1,2,3,4,5]' --title 'test'

[06/26/22 09:20:50] INFO     Combining regions, remind that regions should not be overlapping ...                                            
[06/26/22 09:20:50] INFO     Getting vcf handler by given regions ...                                                                        
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::bcf_hdr_check_sanity] GL should be declared as Number=G                                            
[06/26/22 09:20:50] INFO     Getting instances ...                                                                                           
[06/26/22 09:20:50] INFO     INSTANCE: 'test'                                                                                                
[06/26/22 09:20:50] INFO     test: Parsing formulas ...                                                                                      
[06/26/22 09:20:50] INFO     test: plot type: col                                                                                            
[06/26/22 09:20:50] INFO     Start reading variants ...                                                                                      
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.) 
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.) 
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::vcf_parse] Contig '3' is not defined in the header. (Quick workaround: index the file with tabix.) 
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::vcf_parse] Contig '4' is not defined in the header. (Quick workaround: index the file with tabix.) 
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::vcf_parse] Contig '5' is not defined in the header. (Quick workaround: index the file with tabix.) 
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::vcf_parse] Contig '6' is not defined in the header. (Quick workaround: index the file with tabix.) 
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::vcf_parse] Contig '7' is not defined in the header. (Quick workaround: index the file with tabix.) 
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::vcf_parse] Contig '8' is not defined in the header. (Quick workaround: index the file with tabix.) 
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::vcf_parse] Contig '9' is not defined in the header. (Quick workaround: index the file with tabix.) 
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::vcf_parse] Contig '10' is not defined in the header. (Quick workaround: index the file with tabix.)
[06/26/22 09:20:50] WARNING  (cyvcf2) [W::vcf_parse] Contig 'X' is not defined in the header. (Quick workaround: index the file with tabix.) 
[06/26/22 09:20:50] INFO     105 variants read.                                                                                              
[06/26/22 09:20:50] INFO     test: Summarizing aggregations ...                                                                              
Traceback (most recent call last):
  File "/gpfs/ycga/project/lifton/jc2545/conda_envs/vcfstats/bin/vcfstats", line 8, in <module>
    sys.exit(main())
  File "/gpfs/ycga/project/lifton/jc2545/conda_envs/vcfstats/lib/python3.8/site-packages/vcfstats/[cli.py](http://cli.py/)", line 209, in main
    instance.plot()
  File "/gpfs/ycga/project/lifton/jc2545/conda_envs/vcfstats/lib/python3.8/site-packages/vcfstats/[instance.py](http://instance.py/)", line 285, in plot
    self.save_plot(plt, theme_elems)
  File "/gpfs/ycga/project/lifton/jc2545/conda_envs/vcfstats/lib/python3.8/site-packages/vcfstats/[instance.py](http://instance.py/)", line 318, in save_plot
    Diot(height=1000, width=1000, res=100, format=self.figfmt)
  File "/gpfs/ycga/project/lifton/jc2545/conda_envs/vcfstats/lib/python3.8/site-packages/diot/[diot.py](http://diot.py/)", line 312, in __or__
    ret.update(other)
  File "/gpfs/ycga/project/lifton/jc2545/conda_envs/vcfstats/lib/python3.8/site-packages/diot/[diot.py](http://diot.py/)", line 294, in update
    dict_to_update = dict(*value, **kwargs)
  File "/gpfs/ycga/project/lifton/jc2545/conda_envs/vcfstats/lib/python3.8/site-packages/pyparam/[utils.py](http://utils.py/)", line 46, in __getitem__
    return getattr(self, name)
TypeError: getattr(): attribute name must be string

And i used it in conda environment (python 3.8)
python 3.9 also returns the same error.

Package Version


aiohttp 3.8.1
aiosignal 1.2.0
aiosqlite3 0.3.0
asttokens 2.0.5
async-timeout 4.0.2
attrs 21.4.0
backcall 0.2.0
backports.functools-lru-cache 1.6.4
certifi 2022.6.15
chardet 5.0.0
charset-normalizer 2.0.12
click 8.1.3
coloredlogs 15.0.1
commonmark 0.9.1
cycler 0.11.0
cyvcf2 0.30.15
datar 0.8.5
decorator 5.1.1
descartes 1.1.0
diot 0.1.6
executing 0.8.3
fonttools 4.33.3
frozenlist 1.3.0
humanfriendly 10.0
idna 3.3
importlib-metadata 4.12.0
infi.systray 0.1.12
inflection 0.5.1
intervaltree 3.1.0
ipython 8.4.0
jedi 0.18.1
kiwisolver 1.4.3
lark-parser 0.12.0
Markdown 3.3.7
matplotlib 3.5.2
matplotlib-inline 0.1.3
mizani 0.7.4
multidict 6.0.2
numpy 1.23.0
open-cravat 2.0.1
oyaml 1.0
packaging 21.3
palettable 3.3.0
pandas 1.4.3
parso 0.8.3
patsy 0.5.2
pdtypes 0.0.4
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.1.1
pip 22.1.2
pipda 0.6.0
plotnine 0.8.0
plotnine-prism 0.0.0
prompt-toolkit 3.0.29
ptyprocess 0.7.0
pure-eval 0.2.2
py 1.11.0
Pygments 2.12.0
pyliftover 0.4
pyparam 0.5.3
pyparsing 3.0.9
python-dateutil 2.8.2
python-simpleconf 0.5.6
python-slugify 6.1.2
pytz 2022.1
PyYAML 6.0
requests 2.28.0
requests-toolbelt 0.9.1
rich 12.4.4
rtoml 0.8.0
scipy 1.8.1
setuptools 62.6.0
six 1.16.0
sortedcontainers 2.4.0
stack-data 0.3.0
statsmodels 0.13.2
text-unidecode 1.3
toml 0.10.2
traitlets 5.3.0
twobitreader 3.1.7
typing_extensions 4.2.0
urllib3 1.26.9
varname 0.8.3
vcfstats 0.4.0
wcwidth 0.2.5
websockets 10.3
wheel 0.37.1
XlsxWriter 3.0.3
yarl 1.7.2
zipp 3.8.0

AttributeError: module 'pyparam.params' has no attribute 'vcf'

This seems a really nice tool in exploring QC of vcf. I wanted to give a try. After I installed it in my mac using pip install vcfstats and run the example vcf file, I got the error as below. Do you have any idea or suggestion on wha the problem and solutions are. Thank you

$ vcfstats --vcf examples/sample.vcf \

--outdir examples/
--formula 'COUNT(1) ~ CONTIG'
--title 'Number of variants on each chromosome'
--config examples/config.toml

Traceback (most recent call last):
File "/Users/pengl7/opt/anaconda3/envs/genomics/bin/vcfstats", line 5, in
from vcfstats import main
File "/Users/pengl7/opt/anaconda3/envs/genomics/lib/python3.7/site-packages/vcfstats/init.py", line 18, in
params.vcf.required = True
AttributeError: module 'pyparam.params' has no attribute 'vcf'

numpy version

as I see vcfstats 0.4.2 required numpy <2.0, >=1.22. However when I try to create SNP dist plot via

vcfstats --vcf examples/sample.vcf
--outdir examples/
--formula 'COUNT(1, VARTYPE[snp]) ~ SUBST[A>T,A>G,A>C,T>A,T>G,T>C,G>A,G>T,G>C,C>A,C>T,C>G]'
--title 'Number of substitutions of SNPs (passed)'
--config examples/config.toml
--passed

it gives an error as module 'numpy' has no attribute 'float'. This is because as of numpy v1.20, numpy.float was deprecated. Anyway by downgrading the numpy even if pip gives an error as incompatible numpy version it still creates the desired plot. So maybe you would like to fix this, thanks.

vcfstats error

Hello,
I am trying to run vcfstats.
the command:
vcfstats --vcf GEUVADIS.chr*.vcf.gz --outdir new --formula 'COUNT(1) ~ CONTIG' --title 'Number of variants on each chromosome (modified)' --config config.toml --ggs 'scale_x_discrete(name ="Chromosome",
limits=c("1","2","3","4","5","6","7","8","9","10","X")) +
ylab("# Variants")'
the error:
[2021-08-04 09:50:34,888 INFO] 1881217 variants read.
[2021-08-04 09:50:34,889 INFO] [T] Summarizing aggregations ...
[2021-08-04 09:50:34,929 INFO] [T] Composing R code ...
[2021-08-04 09:50:34,934 INFO] [T] Running R code to plot ...
[2021-08-04 09:50:34,934 INFO] [T] Data will be saved to: new/T.txt
[2021-08-04 09:50:34,934 INFO] [T] Plot will be saved to: new/T.col.png
[2021-08-04 09:50:35,631 ERROR] [T] Loading required package: ggplot2
[2021-08-04 09:50:35,632 ERROR] [T] Registered S3 methods overwritten by 'ggplot2':
[2021-08-04 09:50:35,632 ERROR] [T] method from
[2021-08-04 09:50:35,632 ERROR] [T] [.quosures rlang
[2021-08-04 09:50:35,632 ERROR] [T] c.quosures rlang
[2021-08-04 09:50:35,632 ERROR] [T] print.quosures rlang
[2021-08-04 09:50:35,632 ERROR] [T] Warning messages:
[2021-08-04 09:50:35,633 ERROR] [T] 1: In png(paste0("new/T", ".", figtype, ".png"), height = 2000, width = 2000, :
[2021-08-04 09:50:35,633 ERROR] [T] unable to load shared object '/home/kxj190026/anaconda3/envs/chip-seq/lib/R/library/grDevices/libs//cairo.so':
[2021-08-04 09:50:35,633 ERROR] [T] /home/kxj190026/anaconda3/envs/chip-seq/lib/R/library/grDevices/libs//../../../.././libharfbuzz.so.0: undefined symbol: FT_Done_MM_Var
[2021-08-04 09:50:35,633 ERROR] [T] 2: In png(paste0("new/T", ".", figtype, ".png"), height = 2000, width = 2000, :
[2021-08-04 09:50:35,633 ERROR] [T] failed to load cairo DLL
[2021-08-04 09:50:35,633 ERROR] [T] Error: unexpected input in " p = p + scale_x_discrete(name ="Chromosome", "
[2021-08-04 09:50:35,633 ERROR] [T] Execution halted

Can you please help me out with this.

Order

How can I keep chr order as I like while using the command below;

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'COUNT(1) ~ CONTIG[1,2,3,4,5]' \
	--title 'Number of variants on each chromosome (first 5)' \
	--config examples/config.toml

Assume I set CONTIG as [1-22] I don't want it to plot chr10 after chr1.

I see the this command as well;

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'COUNT(1) ~ CONTIG' \
	--title 'Number of variants on each chromosome (modified)' \
	--config examples/config.toml \
	--ggs 'scale_x_discrete(name ="Chromosome", \
		limits=["1","2","3","4","5","6","7","8","9","10","X"]); \
		ylab("# Variants")'

but while using this command if I don't want to see all chromosomes in the plot and kick some available ones out, the code evaluates the as na and throws and error if I am not doing something wrong.

Thanks for the nice tool.

png plot of vcfstats

Hello,
I am trying to run vcfstats.
the command:
vcfstats --vcf GEUVADIS.chr1.genotype.vcf.gz --outdir new --formula 'COUNT(1) ~ CONTIG' --title 'Number of variants on each chromosome' --config config.toml
the error:
[2021-08-04 09:37:30,895 INFO] 3003378 variants read.
[2021-08-04 09:37:30,896 INFO] [T] Summarizing aggregations ...
[2021-08-04 09:37:30,931 INFO] [T] Composing R code ...
[2021-08-04 09:37:30,956 INFO] [T] Running R code to plot ...
[2021-08-04 09:37:30,956 INFO] [T] Data will be saved to: new/T.txt
[2021-08-04 09:37:30,956 INFO] [T] Plot will be saved to: new/T.col.png
Its creating two new files T.txt and T.plot.R but T.col.png is not getting generated.
Can you please help me out with this.

Config file for the vcfstats

Hi,
Thank you for developing vcfstats software! Could you advise me please, where i may find the "config.toml" file to run vcfstats?

Regards,
Denis

Vcfstats doesn't show the MNPs?

Hi there,

I was trying to produce plots with vcfstats and it doesn't seem to represent MNPs (multiple nucleotide polymorphisms) as reported by bcftools stats ?
I also notice a discrepancy in SNPs/Indels between bcftools and vcfstats, has this ever happened to you ?
I'll take any information I can get, thanks!
Best,
Laurence

SVG format

Dear colleagues, is it possible to get a figure in SVG format directly after excusing vcfstats instead of PNG format?

Thanks in advance

ValueError

Hello! Thanks for this great tool. I used pip to install but when I ran
vcfstats --vcf examples/sample.vcf
--outdir examples/
--formula 'COUNT(1) ~ CONTIG'
--title 'Number of variants on each chromosome'
--config examples/config.toml
I was met with the following error:
Traceback (most recent call last):
File "/Users/mothership/miniconda3/bin/vcfstats", line 5, in
from vcfstats import main
File "/Users/mothership/miniconda3/lib/python3.7/site-packages/vcfstats/init.py", line 9, in
from cyvcf2 import VCF
File "/Users/mothership/miniconda3/lib/python3.7/site-packages/cyvcf2/init.py", line 1, in
from .cyvcf2 import (VCF, Variant, Writer, r_ as r_unphased, par_relatedness,
File "cyvcf2/cyvcf2.pyx", line 1, in init cyvcf2.cyvcf2
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Might you be able to advise me on what I have done wrong?

Many thanks!
Melody

Gentle suggestion about the new plots implementation

The vcfstats are able to produce a plenty of informative plots to explore and access genotypes in the vcf file. I'd like to list several currently not supported by vcfstats plotting possibilities which are in my opinion would help in understanding of the content of the vcf file.

  1. Barplot with just two bars - one for transitions and one for transversions respectively. Calculate the transitions/transversions ratio and draw it somewhere nearby the barplot.

  2. Plot the DP value from the INFO field of vcf. It may be a histogram showing the DP distribution across the whole genome or one plot with the multiple boxplots on it (one boxplot for each chromosome).

  3. Plot the per sample percent of the missing genotype calls (single histogram for all samples).

  4. Plot the per locus percent of the missing genotype calls ( single histogram for the all loci in genome or plot with multiple boxplots - one per chromosome).

  5. Barplot reflecting the percent of positions containing only the homozygous (only HOM REF + only HOM ALT + positions where both the HOM REF and HOM ALT are observed) calls to the total variants number on the chromosome (number of bars is equal to the chromosome number).

Regards,
Denis

Pass a list of VCF files

I'm really enjoying the simplicity of this package so far - thank you for creating it!

I'd like to modify the code to be able to accept a list of VCF files as input, and have the variant metrics aggregated across all VCFs. This would be useful, for example, when a user has VCF files separated by chromosome, but would like to plot the genome-wide metrics. However, I'm only an amateur python coder at this point, so I'm hoping you can point me toward the area of the code I would need to modify to make this happen. I tried adding nargs and action = "extend" to the args.toml file, such that it looks like this for the --vcf arg:

[[arguments]]
flags = ["--vcf", "-v"]
required = true
help = "The VCF file."
type = "path"
nargs = "+"
action = "extend"

but I get the following error when I try to pass multiple --vcf arguments to vcfstats:

OSError: Error opening [PosixPath('/mnt/chr1.vcf.gz'), PosixPath('/mnt/chr2.vcf.gz')]

Any suggestions on what I'm doing wrong here? Is there code elsewhere that I need to edit in order to make this work?

vcfstats command not recognized

Hello, I followed the installation instructions by using pip -U install and the command vcfstats is still not recognized. When I ran the installation command I got no errors. Do I have to clone the repository as well?

Thanks

Labelling

On het hom plot I would like to use labels on graph as HET & HOM not like HET & HOM_ALT, so how can I edit that part?

and on GQ vs depth graph I want the chrom labels in order as 1,2,3, etc but with the current code it continues as 1, 10, 2 etc, so how can I change it?

thanks..

No pyproject.toml file during poetry update

I tried to build the docker container using the Dockerfile provided, however, I receive an error that the pyproject.toml file cannot be found when running 'poetry update'.

docker build . --tag py_vcfstats

[+] Building 18.8s (9/9) FINISHED                                                                                                                                                                           
 => [internal] load build definition from Dockerfile                                                                                                                                                   0.0s
 => => transferring dockerfile: 320B                                                                                                                                                                   0.0s
 => [internal] load .dockerignore                                                                                                                                                                      0.0s
 => => transferring context: 2B                                                                                                                                                                        0.0s
 => [internal] load metadata for docker.io/library/python:3.9.12-slim-buster                                                                                                                           0.7s
 => [1/5] FROM docker.io/library/python:3.9.12-slim-buster@sha256:830e161433edfe047a23ebc99c12ee0eb1dc0a50e6b5f1c98e869ac271786632                                                                     0.0s
 => [internal] load build context                                                                                                                                                                      0.0s
 => => transferring context: 315B                                                                                                                                                                      0.0s
 => CACHED [2/5] RUN pip install -U cython poetry                                                                                                                                                      0.0s
 => CACHED [3/5] WORKDIR /vcfstats                                                                                                                                                                     0.0s
 => CACHED [4/5] COPY . /vcfstats/                                                                                                                                                                     0.0s
 => ERROR [5/5] RUN poetry config virtualenvs.create false &&     pip install -U pip &&     poetry update &&     poetry install                                                                       18.0s
------                                                                                                                                                                                                      
 > [5/5] RUN poetry config virtualenvs.create false &&     pip install -U pip &&     poetry update &&     poetry install:                                                                                   
#9 4.485 Requirement already satisfied: pip in /usr/local/lib/python3.9/site-packages (22.0.4)                                                                                                              
#9 5.499 Collecting pip                                                                                                                                                                                     
#9 5.994   Downloading pip-23.2-py3-none-any.whl (2.1 MB)                                                                                                                                                   
#9 6.263      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 8.8 MB/s eta 0:00:00                                                                                                                      
#9 8.230 Installing collected packages: pip
#9 8.232   Attempting uninstall: pip
#9 8.236     Found existing installation: pip 22.0.4
#9 8.959     Uninstalling pip-22.0.4:
#9 9.342       Successfully uninstalled pip-22.0.4
#9 15.82 Successfully installed pip-23.2
#9 15.83 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#9 17.91 
#9 17.91 Poetry could not find a pyproject.toml file in /vcfstats or its parents
------
executor failed running [/bin/sh -c poetry config virtualenvs.create false &&     pip install -U pip &&     poetry update &&     poetry install]: exit code: 1

option --title never works

--Hi,

this command:
vcfstats --vcf ERR5875663.snpeff.vcf --outdir examples --formula 'COUNT(1, group=VARTYPE) ~ CHROM' --title 'ERR5875663' --config config.toml

doesn't print the title on plot, just used to name the output file.

Why ?

VCF Mean Depth Plot Is In Reverse

Hi-

My VCF is like this

##fileformat=VCFv4.1
##source=VarScan2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Average per-sample depth of bases with Phred score >= 15">
##INFO=<ID=WT,Number=1,Type=Integer,Description="Number of samples called reference (wild-type)">
##INFO=<ID=HET,Number=1,Type=Integer,Description="Number of samples called heterozygous-variant">
##INFO=<ID=HOM,Number=1,Type=Integer,Description="Number of samples called homozygous-variant">
##INFO=<ID=NC,Number=1,Type=Integer,Description="Number of samples not called">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Raw Read Depth as reported by SAMtools">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Quality Read Depth of bases with Phred score >= 15">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=PVAL,Number=1,Type=String,Description="P-value from Fisher's Exact Test">
##FORMAT=<ID=RBQ,Number=1,Type=Integer,Description="Average quality of reference-supporting bases (qual1)">
##FORMAT=<ID=ABQ,Number=1,Type=Integer,Description="Average quality of variant-supporting bases (qual2)">
##FORMAT=<ID=RDF,Number=1,Type=Integer,Description="Depth of reference-supporting bases on forward strand (reads1plus)">
##FORMAT=<ID=RDR,Number=1,Type=Integer,Description="Depth of reference-supporting bases on reverse strand (reads1minus)">
##FORMAT=<ID=ADF,Number=1,Type=Integer,Description="Depth of variant-supporting bases on forward strand (reads2plus)">
##FORMAT=<ID=ADR,Number=1,Type=Integer,Description="Depth of variant-supporting bases on reverse strand (reads2minus)">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Sample1
1	1669766	.	A	.	.	PASS	DP=1;WT=0;HET=0;HOM=0;NC=1	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	./.:.:1
1	1669767	.	A	.	.	PASS	DP=1;WT=0;HET=0;HOM=0;NC=1	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	./.:.:1
1	1669768	.	T	.	.	PASS	DP=5;WT=0;HET=0;HOM=0;NC=1	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	./.:.:5
1	1669769	.	G	.	.	PASS	DP=7;WT=0;HET=0;HOM=0;NC=1	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	./.:.:7
1	1669770	.	T	.	.	PASS	DP=8;WT=1;HET=0;HOM=0;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	0/0:14:8:8:8:0:0%:1E0:34:0:8:0:0:0
1	1669771	.	T	.	.	PASS	DP=9;WT=1;HET=0;HOM=0;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	0/0:18:9:9:9:0:0%:1E0:33:0:9:0:0:0
1	1669772	.	T	.	.	PASS	DP=9;WT=1;HET=0;HOM=0;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	0/0:18:9:9:9:0:0%:1E0:34:0:9:0:0:0
1	1669773	.	G	.	.	PASS	DP=16;WT=1;HET=0;HOM=0;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	0/0:29:16:16:16:0:0%:1E0:34:0:16:0:0:0
1	1669774	.	T	.	.	PASS	DP=16;WT=1;HET=0;HOM=0;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	0/0:29:16:16:16:0:0%:1E0:33:0:16:0:0:0
1	1669775	.	C	.	.	PASS	DP=21;WT=1;HET=0;HOM=0;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	0/0:40:21:21:21:0:0%:1E0:33:0:21:0:0:0
1	1669776	.	A	.	.	PASS	DP=22;WT=1;HET=0;HOM=0;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	0/0:40:22:22:22:0:0%:1E0:34:0:22:0:0

When I use this command

vcfstats --vcf /data/test.vcf --outdir /data/examples/ --formula 'MEAN(DEPTHs{0}) ~ CONTIG[1,2,3,4,5]' --title 'MeanDepth' --config examples/config.toml

My mean depth plots are in the negative scale to the bottom.

Can you please help?

AttributeError: module 'py' has no attribute 'io'

I am getting the following error when running vcfstats and I do not know how I can go about this error. Can someone help with this:

 vcfstats --vcf output_filtered.vcf --outdir examples/ --formula 'AAF ~ CONTIG' --title 'Allele frequency on each chromosome (boxplot)' --config examples/config.toml  --figtype boxplot
[05/22/23 12:24:43] INFO     Combining regions, be reminded that regions should NOT be overlapping ...                                                                    
[05/22/23 12:24:43] INFO     Getting vcf handler by given regions ...                                                                                                     
Traceback (most recent call last):
  File "/home/okendojo/.local/bin/vcfstats", line 8, in <module>
    sys.exit(main())
  File "/home/okendojo/.local/lib/python3.8/site-packages/vcfstats/cli.py", line 181, in main
    vcf, samples = get_vcf_by_regions(
  File "/home/okendojo/.local/lib/python3.8/site-packages/vcfstats/cli.py", line 39, in get_vcf_by_regions
    with capture_c_msg("cyvcf2"):
  File "/usr/local/Anaconda/envs/py3.8/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/okendojo/.local/lib/python3.8/site-packages/vcfstats/utils.py", line 65, in capture_c_msg
    capture = py.io.StdCaptureFD(out=False, in_=False)
AttributeError: module 'py' has no attribute 'io'

Install error

There seems to be a problem with the call to MSVC cl.exe that gets triggered during the install of this software. On a fresh install of Python 3.9.5 and the MSVC 2019 tools, I receive this error in response to "python -m pip install vcfstats":

C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.28.29910\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ihtslib -Icyvcf2 -IC:\Python3\lib\site-packages\numpy\core\include -IC:\Python3\include -IC:\Python3\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.28.29910\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /Tccyvcf2/cyvcf2.c/Fobuild\temp.win-amd64-3.9\Release\cyvcf2/cyvcf2.obj -Wno-sign-compare -Wno-unused-function -Wno-strict-prototypes -Wno-unused-result -Wno-discarded-qualifiers

cl : Command line error D8021 : invalid numeric argument '/Wno-sign-compare'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.28.29910\bin\HostX86\x64\cl.exe' failed with exit code 2

Passing VCF context to macros

Hello,

I'm trying to make a GLOBAL_POS macro, which calculates sum(all_previous_chrom_lengths) + variant.POS. However, getting all_previous_chrom_lengths requires reading the ##contig entries from the VCF header (with cyvcf2.VCF.seqlens and cyvcf2.VCF.seqnames). As far as I can tell, there is no way to access the VCF object in the context of a macro, is this correct? If not, could I please request that some mechanism to achieve the above be added? Perhaps supporting mymacro(variant, vcf): ... as an optional arg passed to macro functions that support it? (similar to how input functions in snakemake can use additional arguments beyond just "wildcards", so long as they are given in a specific order or with specific names)

Best,
K

Increase memory allocation

Is there an option for specifying the memory allocation to vcfstats? My job keeps failing due to out of memory issues, even though I'm only running one chromosome and I've requested a worker with 125 Gb of memory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.