czbiohub-sf / cerebra Goto Github PK

View Code? Open in Web Editor NEW

59.0 4.0 9.0 168.08 MB

A tool for fast and accurate summarizing of variant calling format (VCF) files

License: MIT License

Makefile 0.34% Python 95.06% TeX 4.13% Dockerfile 0.48%

variant-calling mutations scrna-seq genomics

cerebra's Introduction

cerebra

What is `cerebra`?

This tool allows you to quickly summarize and interpret VCF files.

If you're interested in learning the full complement of genomic variants present in your DNA/RNA samples, a typical workflow might involve sequencing, followed by variant calling with a tool such as GATK HaplotypeCaller, which generates variant calling format (VCF) files.

A VCF file looks like this:

##source=HaplotypeCaller
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 631391 . C T 72.28 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=NaN;QD=25.36;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:84,6,0

Note that only a single VCF record is displayed here. A sequencing run can generate on the order of 10⁸ unique VCF records. Only a small portion of these VCF records are likely to be relevant to the researcher. Thus drawing conclusions from VCF files remains a substantial challenge.

cerebra provides a fast and intuitive framework for summarizing VCF records across samples. It comprises three modules that do the following:

    1) remove germline variants from samples of interest        
    2) count the total number of variants in a given sample, on a per-gene basis           
    3) report protein variants for each sample

cerebra gets its name from the eponymous X-men character, who had the ability to detect mutant individuals among the general public.

If you're working with tumor data, it might be a good idea to limit the variant search space to only known, and potentially actionable, cancer variants. Therefore cerebra implements an optional method for restricting to variants also found in the COSMIC database.

This tool was developed for, but is certainly not limited to, single-cell RNA sequencing data. cerebra is free software available under the MIT license.

What makes `cerebra` different from traditional VCF parsers?

Python libraries exist (i.e. PyVCF and vcfpy) for extracting information from VCF files. Another is GATK VariantsToTable, which produces a file that looks like this:

CHROM    POS ID      QUAL    AC
 1        10  .       50      1
 1        20  rs10    99      10

This table contains only genomic (i.e. DNA-level) coordinates. Often the next questions are: what are the genes associated with these variants, and what are the peptide-level effects of these variants? cerebra queries a reference genome (.fa) and annotation (.gtf) to match each DNA-level variant with its associated gene, and its probable protein variant. cerebra produces the following outfile:

$ python
> import json
> f = open(/path/to/cerebra/output.json)
> data = json.load(f)
> print(json.dumps(data, indent=4))

{
    "CCN1": {
        "A16_B000563": [],
        "A1_B001546": [],
        "A1_B002531": [
            "ENSP00000398736.2:p.(Glu189=)"
        ],
        "A1_B002570": [],
        "A2_B002558": [],
        "A3_B000561": [
            "ENSP00000398736.2:p.(Arg209Trp)",
            "ENSP00000398736.2:p.(Ile90Val)"
        ],
        "A3_B000568": [],
        "A3_B001544": [
            "ENSP00000398736.2:p.(Ala82Thr)"
        ],
        "A3_B002090": [],
        "A3_B002531": [
            "ENSP00000398736.2:p.(Pro217Ser)"
        ]
    }
}

Here CCN1 is a gene name while A16_B000563, A1_B001546, A1_B002531,... are RNA-seq sample IDs. cerebra reports variants to every gene in the genome, for every sample in a given experiment. The ENSP* numbers refer to Ensembl translation IDs -- unique identifiers that correspond to exactly one polypeptide in the Ensembl database. The strings enclosed in parentheses refer to the amino-acid level variants reported in that particular sample. For example the string Arg209Trp indicates that position 209 of this particular polypeptide should contain an Arg, but the experimental sample instead contains a Trp. cerebra adheres to HGVS sequence variant nomenclature in reporting amino-acid variants.

Features

`germline-filter`

Variant calling is often applied to the study of cancer. If the research project is centered around a “tumor vs. normal” question, then germline-filter is the proper starting point. This module removes germline variants that are common between tumor and normal samples, and thus excludes variants unlikely to be pathogenic for the cancer under study. The user provides a very simple metadata file (see USAGE.md) that indicates which tumor samples correspond to which normal samples. The output of germline-filter is a set of trimmed-down VCF files, which will be used for the next two steps. If you do not have access to “normal” samples then proceed directly to count-variants or find-peptide-variants.

`count-variants`

This module reports the raw variant counts for every gene across every sample. The output is a CSV file that contains counts for each sample versus every gene in the genome. If working with cancer variants, the user has the option of limiting the search space to variants also found in the COSMIC database.

`find-peptide-variants`

This module reports the protein variations associated with each genomic variant. VCF records are converted to peptide-level variants, and then ENSEMBL protein IDs, in accordance to the HGVS sequence variant nomenclature. The output is a hierarchically ordered text file (CSV or JSON) that reports the Ensemble protein ID and the gene associated with each variant, for each experimental sample. The user again has the option of limiting the search space to variants also found in the COSMIC database. This module also has an option to report the number of variant vs. wildtype reads found at each locus.

Dependencies

cerebra depends on some (fairly standard) packages and libraries. Before installing, it might be a good idea to make sure all of the requisite packages are installed on your system (note: if installing with Docker you can skip this step).

MacOS Dependencies:

sudo pip install setuptools
brew update
brew install openssl
brew install zlib

Linux Dependencies:

sudo apt-get install autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev

As of present cerebra is not installable on Windows. cerebra depends on the pysam library (or rather, pysam is a dependency-of-a-dependency) and currently this library is only available on Unix-like systems. Windows solutions like WSL exist for overcoming precisely this challenge, however, cerebra has not been tested on WSL or any other Unix-like subsystem for Windows.

Installation (for users)

There are four different methods available to install cerebra. Choose one of the following:

With Docker (recommended)

docker pull lincolnharris/cerebra
docker run -it lincolnharris/cerebra

warning: this image will take up ~1Gb of space.

With the python standard library venv module

python3 -m venv cerebra
source cerebra/bin/activate
pip3 install cerebra

With conda

conda create -n cerebra python=3.7
conda activate cerebra
pip3 install cerebra

From PyPi (system-wide installation, NOT RECOMMENDED)
For novice users, it's generally a better idea to install packages within virtual environments. However, cerebra can be installed system-wide, if you so choose.

pip3 install cerebra

# OR, if you dont have root privileges
pip3 install --user cerebra

Installation (for developers)

Here's how to set up cerebra for local development. After installing the requisite dependencies:

Fork the cerebra repo on GitHub: https://github.com/czbiohub/cerebra

Clone your fork locally:

$ git clone https://github.com/your-name/cerebra.git

Install your local copy into a virtualenv. Using the standard library venv module:

$ cd cerebra
$ python3 -m venv cerebra-dev
$ source cerebra-dev/bin/activate
$ pip3 install -r requirements.txt -r test_requirements.txt -e .

Create a branch for local development:
```
$ git checkout -b name-of-your-bugfix-or-feature
```
Now you can make your changes locally.
When you're done making changes, check that your changes pass flake8 and pytest:
```
$ make test
$ make coverage
$ make lint
```

Commit your changes and push your branch to GitHub:

$ git add .
$ git commit -m "Your detailed description of your changes."
$ git push origin name-of-your-bugfix-or-feature

Submit a pull request through the GitHub website. See CONTRIBUTING.md for more.

(Quickstart) Usage

$ cerebra should return usage information

Usage: cerebra  <command>

  a tool for fast and accurate summarizing of variant calling format (VCF)
  files

Options:
  -h, --help  Show this message and exit.

Commands:
  germline-filter    filter out common SNPs/indels between tumor and normal samples
  count-variants    count total number of variants in each sample, and report on a per-gene basis
  find-peptide-variants  report peptide-level SNPs and indels in each sample, and associated coverage

Note that the -h command will display usage information for each of the three commands.

An example workflow might look like this:

Step 1:

cerebra germline-filter --processes 2 --normal_path /path/to/normal/vcfs --tumor_path /path/to/tumor/vcfs --metadata /path/to/metadata/file --outdir /path/to/filtered/vcfs

Step 2:

cerebra count-variants --processes 2 --cosmicdb /optional/path/to/cosmic/database --refgenome /path/to/genome/annotation --outfile /path/to/output/file /path/to/filtered/vcfs/*

Step 3:

cerebra find-peptide-variants --processes 2 --cosmicdb /optional/path/to/cosmic/database --annotation /path/to/genome/annotation --genomefa /path/to/genome/fasta --report_coverage 1 --output /path/to/output/file /path/to/filtered/vcfs/*

For advanced usage information, see USAGE.md.

Authors

This work was produced by Lincoln Harris, Rohan Vanheusden, Olga Botvinnik and Spyros Darmanis of the Chan Zuckerberg Biohub. For questions please contact [email protected]

Contributing

We welcome any bug reports, feature requests or other contributions. Please submit a well documented report on our issue tracker. For substantial changes please fork this repo and submit a pull request for review.

See CONTRIBUTING.md for additional details.

You can find official releases here.

cerebra's People

Contributors

Stargazers

Watchers

Forkers

olgabot zorrodong lincoln-harris huaichao2018 giovanniquinones panando ivalexander13 suranfake

cerebra's Issues

add dbSNP filter, probably in germline_filter.py module

fuck

output json instead of csv

json might be better suited here because its a better way to represent hierarchical data, such as
cell -> gene -> amino acid mutation relationships

also nice for very sparse matricies like ours

trouble adding to bioconda

linking bioconda/bioconda-recipes#23260

tried to add recipe to bioconda.
for now failing due to hgvs dependency....there is no bioconda or conda-forge recipe for hgvs

until this exists we wont be able to add to bioconda, unfortunately

all I could find in the way of conda builds was this https://anaconda.org/aroth85/hgvs

not sure its possible to add independent channels to meta.yaml, and even if it was, doesnt seem like a good idea to add in an outdate third-party recipe

refactor check_coverage module to work on gene-level, rather than with a user-defined set of ROIs

consider changing mutations -> variants in module names

mutations implies functional or perhaps known consequence

germline-filter run instructions

Hi,

I am trying to run the germline-filter tool but am unclear as to what the contents should be for the metadata csv file.

The message from the tools help page is somewhat vague on the requirements for each argument:

Usage: cerebra germline-filter [OPTIONS]

  filter out common SNPs/indels between germline samples and samples of
  interest

Options:
  --processes INTEGER
  --germline TEXT      [required]
  --cells TEXT         [required]
  --metadata TEXT      [required]
  --outdir TEXT        [required]
  -h, --help           Show this message and exit.

Please could the run instructions be updated to describe how to run the germline-filter tool with all the required files.

An example dataset would also be useful to demonstrate the required contents of the metadata csv.

Also then just to clarify I am assuming the germline and cells should be the VCF files for each?

Best wishes,
Roshan

fathmm filter excludes a bunch of indels

im pretty sure fathmm is a per-amino acid score, so i dont think it can properly handle indels
keep this in mind when searching for things like EGFR del19

get_specific_mutations not working for X or Y chrom

only accepting integer chromosome names right now

not picking up intronic variants

in mutations_table.py, if a vcf record reports a hit to an intronic region (we think), it will NOT be included in the final table

this is because while the COSMIC db includes hits to intron, the gtf does not contain entries for intron bounds, so the COSMIC hit wont match anything in the gtf
solution is probably just not matching back to the gtf when doing the COSMIC filter; ie. just grab the gene name from the COSMIC database line, and dont do the gene matching back to the gtf
only really need the gtf for the un-filtered case

are we dealing with deletions properly?

im pretty confident were getting insertions, but what about deletions?

would like to make cerebra modules importable

would like to make modules importable, so that cerebra can be used more easily by developers.

right now its really geared toward end users.

dont know how to do this, yet

could look at https://github.com/biocommons/hgvs for inspiration

filtered_cosmic_db in find_aa_mutations limiting to lung

I'm an effing idiot
https://github.com/czbiohub/cerebra/blob/1e654649277f37bd4391917c293df14ab0a8c579/cerebra/find_aa_mutations.py#L48

look into potential changes in mutation_counts functionality due to GenomePosition.from_vcf_record change

integrate interval trees

make germline-filter module more robust

right now it only works if you have a metadata file with a very specific format

cell_id,patient_id
sample1,gl_sample1
sample2,gl_sample1
sample3,gl_sample2
sample4,gl_sample2
sample5,gl_sample2

would be nice if we could loosen up this requirement a bit. for example, what if you had germline vcfs for some experimental samples but not all?

Requirements and CI

It looks like conda_requirements.txt is a subset of requirements.txt and conda_requirements.txt is only used by the CI testing.

Why is this file included and why does the CI testing use conda at all? As a user who is looking to install the software, this is quite confusing.

requirements.txt also includes test-only requirements (coverage, flake8, etc.). These should be excluded from the requirements file and standard installation.

openjournals/joss-reviews#2432

hgvs throwing stdout

cryptic error message in cerebra find-peptide-variants:

<class 'hgvs.edit.AASub'> <class 'hgvs.edit.AARefAlt'>

would like to find out where this (error?) message is coming from

Python 3.8 support

The software is not tested against Python 3.8, despite this being the current major release.

I suggest the authors add '3.8' '3.8-dev' and '3.9-dev' to the CI testing, with allowed failures for the dev versions. It also wouldn't hurt to test against (and allow failures for) 'pypy3'. Here is a simple example of this from one of my own projects: https://github.com/CountESS-Project/fqfa/blob/master/.travis.yml

Once 3.8 support is established, it should also be added to the setup.py classifiers section.

openjournals/joss-reviews#2432

Supporting different types of annotation GTFs

find_aa_mutations soft-requires gencode’s gtf annotation, which for whatever reason has 5 comment lines at the top which pandas doesn’t like
might be good to document somewhere that those have to be removed from the file or add handling so that comment lines are cut off
right now there’s a disparity; mutation_counts’ annotation read_csv call is normal (hg38.gtf), while find_aa_mutations’ has to use skiprows=5 (gencode .gtf)

JoSS Review: Clarify README.md

With regard to the ongoing review for your JoSS submission, I would like to suggest stating some sort of use cases and/or warnings in your README.md. In particular, one of the largest drawbacks is the case of large VCF files. You even state as much in your code:

def dataframe(filename, large=True):
    """ [...]
    Note: Using large=False with large VCF files. It will be painfully slow.
    [...]
    """

This same cannot be said about your instructions/quickstart.

VCFs can be excruciatingly large sometimes (looking at you 1KG/ENCODE). Many times, researchers want/need to summarize from this starting point. Therefore, suggesting something like filtering the VCF based on region/contig prior to running cerebra would be thoughtful suggestion to gung-ho users to preempt potentially long-running jobs.

how to deal with vcf records with multiple ALTs?

right now we're just picking the longest one...which may not be the smartest choice. essentially, what if you get a SNP and indel at the exact same loci?
- (rn our code would just pick up the indel)

Installation process

Currently the cerebra installation process requires some system packages to be installed. These are all relatively standard requirements that many users will already have. It's good to see that users have the option of running this in Docker. Providing a Nix expression is another option.

However, the documentation should be improved with an explanation of why these system packages are needed. They seem to be requirements for cerebra's dependencies rather than the package itself. This transparency is especially important if the dependencies-of-dependencies change in the future.

There are no installation instructions for Windows users, although there are is a large population of bioinformatics specialists who use Windows as their primary environment and might benefit from this software. I'm guessing this is because the package won't install on Windows due to the pysam dependency, but it should be explicitly stated.

The installation instructions for the package itself are not ideal. The primary method of installation (especially for research software with pinned requirements) should be a virtual environment. The primary installation method suggested is a system-level installation.

There are instructions for installing in a virtual environment via conda, but there is no reason I could see to require conda rather than using standard library's venv. The virtual environment instructions also install the package in editable mode, which is ideal for development but not what most users would be expecting.

The installation section could be split into a "for users" section and a "for developers" section, where the latter describes an editable installation with test requirements and instructions for running unit tests locally.

openjournals/joss-reviews#2432

JOSS guidelines

something to keep in mind

write up JOSS submission in the form of a markdown document (.md)
a nice template to follow comes from @pranathivemuri: https://github.com/czbiohub/bam2fasta/blob/master/paper/paper.md

JOSS submissions are meant to be very lightweight descriptions of the software, how it works, and testing procedures

the benchmarking figures I have are probably better suited for something like BMC App Notes

wierd Bioseq install error -- package not found

when I try to invoke cli, getting ModuleNotFoundError: No module named 'Bio' even though Biopython is listed in requirements.txt

improve testing speed

the biggest thing to do here is gonna be removing that massive hg38.fa chr7 / chr12 file.

either reworking the tests to get around using the .fa entirely or else figuring out how to further compress it

JoSS Review: Quality of writing

With regard to the "Quality of writing" section for the ongoing review for your JoSS submission (openjournals/joss-reviews#2432), these are my (very minor/nitpicky) remarks:

"wrangling" is somewhat parlance and may be a difficult word to discern context from for international audiences
I may be wrong here, but surnames should be written for both in-text and bibliographic citation. For groups of 6 or more, it is acceptable to use "et al" after the first author. This simplifies the citation so as to not break up the thought of the sentence any more than it already has.
You are already use LaTex to write the paper, so it may be simpler to write 10^8 as $10^{8}$ instead
(Question) What happens if a *.fai is already provided prior to find-peptide-variants? Will it construct a new one?
It might be helpful to state in the caption of Figure 3 that the first 10+ minutes appear to be the construction of the genome interval tree and can be considered overhead for the process, which runs approximately linearly thereafter (Ha! Just saw that you state the same thing just below)

in get_specific_mutations.py it probably makes sense to first subset the cosmic db by gene name

runs wicked fast when i test this on a just-EGFR dataframe

during vcf reconstruction (germline filter) step, we're literally appending the same vcf header to all our files

related to #15
this is not ideal, we need INFO field to do any sort of coverage analysis

need to integrate gvcf search into coverage module

if a cell/sample doesnt have a variant, is it because its wildtype, or because it doesnt have coverage
right now i think this is happening in the summary table module...not ideal

Universal wheel

Currently the project has a setup.cfg file that specifies a universal wheel. Based on the definition in the Python Packaging User Guide this is doesn't seem necessary or correct, since the project does not support Python 2 (nor should it at this point).

openjournals/joss-reviews#2432

improve germline filtering: handle multiple ALTs better

I was just thinking—right now, the germline filter will only match two records if the CHROM, POS, REF, and ALT are all the same. But, since VCFs will encode all mutations that happen at a particular POS as multiple ALTs, we could instead just check if one (not all) of the ALTs match, and then remove that ALT specifically.

add prebuilt index for genome.fa

can we add a prebuilt index for the human genome .gtf / .fa that would load much faster?

hg38.fa -> 3 Gb
hg38.gtf -> 144 Mb

Versioning question

This repo has only a single release, 1.0.0. However, setup.py sets the version as 1.0.9 and the PyPI version is 1.1.0. __init__.py sets the version as 0.1.0.

Additionally, the seemingly-neglected HISTORY.md has only a single entry for 0.1.0 (despite this tag not existing in the repo).

This is important to reconcile as part of openjournals/joss-reviews#2432

mapping ambiguity between vcf and gtf

some vcf mutations correspond to multiple records in the gtf file; ie. stop codon in one gene could be an exon in another. how to assign these ambiguous vcf hits?

a relevant question is how often does this happen? if its 0.01% of cases, we can afford to just throw out these ambiguous vcf records

clarification on AD field of vcf files

want to make sure find_aa_mutations --report_coverage is pulling out the right thing.
vcf documentation is not very clear on what information the AD field actually contains

GenomePosition.from_str shouldn't assume 1-index

I've found that BEDtools seems to convert 1-based genome position strings into 0-based ones. Perhaps GenomePosition.from_str should offer an argument allowing one to indicate if the string is 0-based or 1-based.

setup docker container

filtered_cosmic_db in count_mutations.py limiting to lung

same thing as #63

germline_filter.py requires 'ID' field to match...this probably makes things too strict

multiprocessing race condition

was getting a biopython.seq error when running through some new vcf files:

TranslationError: Codon '>' is invalid

but seems to have resolved its self somehow

need to integrate coverage module

a la: https://github.com/czbiohub/SNP_calling_pipeline/blob/master/coverage/coverage_search_from_vcf_parallel.py

lets make a new branch for this, and clean it up

add comprehensive test for germline_filter

want to add a CliRunner test in the vein of test_basic_functionality.py, for the full run.
just for my own sanity

vcf '20' field is sometimes malformed

seeing 0/0:46:46 instead of 1/1:0,2:2:6:84,6,0
relevant when trying to extract AD, in check_coverage_whole_gene module
not sure what this edge case means...
ignoring it for now, with try/except block

find-aa-mutations not working, post-merge with Rohan's branch

exiting gracefully, but not actually finding variants. I tested on remade artificial vcfs as well as real vcfs, same behavior seen for both. returns empty matrix with only the cell names. merged with rvanheusen/aa-mutations. have also checked out raw versions of rvanheusen/aa-mutations and rvanheusen/mutation-counts-cleanup, and same behavior is observed

vcf and cosmic appear to disagree about the strandedness for certain genes?

originally noticed for BRAF; base positions seem to be reverse complements of each other, as if the reading direction were reversed
have only seen this in BRAF so far
not sure how this could be, given both VCF and COSMIC rely on the same reference genome

perhaps give non-tumor cells better names for germline filtering

Right now, my revision of germline filter differentiates tumor from non-tumor bulk-seq VCFs based on the non-tumor cells having three parts (X_Y_Z.vcf). It would probably be better to have a more explicit method for differentiating between the bulk-seq VCFs.

Review comments on CONTRIBUTING.md

The CONTRIBUTING.md document includes most of the required material, but there are a few issues.

The links in the document lead to the site home pages rather than the specified pages due to misplaced >. The document would benefit from including additional links, such as a link to the relevant GitHub documentation in the first "Get Started!" list item.

This document has very different installation instructions to the README. They should be harmonized, and it will probably be simpler to maintain if CONTRIBUTING refers the user to the README, with notes about changes required to create a dev environment. It's quite confusing to have one document require conda and another require virtualenvwrapper. Ideally both should just use standard library venv.

This document specifies tox but the main project seems to use pytest. The Tips section uses unittest. The project should pick one testing framework and stick to it throughout.

openjournals/joss-reviews#2432

pysam build fails on python3.8

this is a known issue: jbloomlab/alignparse#58

one way to get around it is perhaps a conda installation of pysam?

how does utils.get_best_overlap deal with genome positions that have multiple hits

this seems to be most of them...cosmic database has TONS of repeated shit
specifically the problem is with SNP calling in get_aa_mutations:
- i want to be able to match CDSs against entries in the cosmic db in order to determine the AA substitution, but what happens if there are multiple cosmic entries, with multiple CDSs
- utils.get_best_overlap is probably just returning the first matching entry, so the CDSs do not necessarily match

when we reconstruct vcfs during the germline filter step, we excluded the 2 last columns

FORMAT and this wierd 20 field dont make the cut...
probably want to sub out my janky vcf parsing/writing functions for something more official
ie. pyVCF or VCFpy

czbiohub-sf / cerebra Goto Github PK

cerebra's Introduction

cerebra

What is cerebra?

What makes cerebra different from traditional VCF parsers?

Features

germline-filter

count-variants

find-peptide-variants

Dependencies

Installation (for users)

Installation (for developers)

(Quickstart) Usage

Authors

Contributing

cerebra's People

Contributors

Stargazers

Watchers

Forkers

cerebra's Issues

Recommend Projects

Recommend Topics

Recommend Org

What is `cerebra`?

What makes `cerebra` different from traditional VCF parsers?

`germline-filter`

`count-variants`

`find-peptide-variants`