Giter VIP home page Giter VIP logo

cerebra's Introduction

cerebra

PyPI Docker Build
Build Status codecov
DOI DOI

What is cerebra?

This tool allows you to quickly summarize and interpret VCF files.

If you're interested in learning the full complement of genomic variants present in your DNA/RNA samples, a typical workflow might involve sequencing, followed by variant calling with a tool such as GATK HaplotypeCaller, which generates variant calling format (VCF) files.

A VCF file looks like this:

##source=HaplotypeCaller
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 631391 . C T 72.28 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=NaN;QD=25.36;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:84,6,0

Note that only a single VCF record is displayed here. A sequencing run can generate on the order of 108 unique VCF records. Only a small portion of these VCF records are likely to be relevant to the researcher. Thus drawing conclusions from VCF files remains a substantial challenge.

cerebra provides a fast and intuitive framework for summarizing VCF records across samples. It comprises three modules that do the following:

    1) remove germline variants from samples of interest        
    2) count the total number of variants in a given sample, on a per-gene basis           
    3) report protein variants for each sample                 

cerebra gets its name from the eponymous X-men character, who had the ability to detect mutant individuals among the general public.

If you're working with tumor data, it might be a good idea to limit the variant search space to only known, and potentially actionable, cancer variants. Therefore cerebra implements an optional method for restricting to variants also found in the COSMIC database.

This tool was developed for, but is certainly not limited to, single-cell RNA sequencing data. cerebra is free software available under the MIT license.

What makes cerebra different from traditional VCF parsers?

Python libraries exist (i.e. PyVCF and vcfpy) for extracting information from VCF files. Another is GATK VariantsToTable, which produces a file that looks like this:

CHROM    POS ID      QUAL    AC
 1        10  .       50      1
 1        20  rs10    99      10

This table contains only genomic (i.e. DNA-level) coordinates. Often the next questions are: what are the genes associated with these variants, and what are the peptide-level effects of these variants? cerebra queries a reference genome (.fa) and annotation (.gtf) to match each DNA-level variant with its associated gene, and its probable protein variant. cerebra produces the following outfile:

$ python
> import json
> f = open(/path/to/cerebra/output.json)
> data = json.load(f)
> print(json.dumps(data, indent=4))

{
    "CCN1": {
        "A16_B000563": [],
        "A1_B001546": [],
        "A1_B002531": [
            "ENSP00000398736.2:p.(Glu189=)"
        ],
        "A1_B002570": [],
        "A2_B002558": [],
        "A3_B000561": [
            "ENSP00000398736.2:p.(Arg209Trp)",
            "ENSP00000398736.2:p.(Ile90Val)"
        ],
        "A3_B000568": [],
        "A3_B001544": [
            "ENSP00000398736.2:p.(Ala82Thr)"
        ],
        "A3_B002090": [],
        "A3_B002531": [
            "ENSP00000398736.2:p.(Pro217Ser)"
        ]
    }
}

Here CCN1 is a gene name while A16_B000563, A1_B001546, A1_B002531,... are RNA-seq sample IDs. cerebra reports variants to every gene in the genome, for every sample in a given experiment. The ENSP* numbers refer to Ensembl translation IDs -- unique identifiers that correspond to exactly one polypeptide in the Ensembl database. The strings enclosed in parentheses refer to the amino-acid level variants reported in that particular sample. For example the string Arg209Trp indicates that position 209 of this particular polypeptide should contain an Arg, but the experimental sample instead contains a Trp. cerebra adheres to HGVS sequence variant nomenclature in reporting amino-acid variants.

Features

germline-filter

Variant calling is often applied to the study of cancer. If the research project is centered around a “tumor vs. normal” question, then germline-filter is the proper starting point. This module removes germline variants that are common between tumor and normal samples, and thus excludes variants unlikely to be pathogenic for the cancer under study. The user provides a very simple metadata file (see USAGE.md) that indicates which tumor samples correspond to which normal samples. The output of germline-filter is a set of trimmed-down VCF files, which will be used for the next two steps. If you do not have access to “normal” samples then proceed directly to count-variants or find-peptide-variants.

count-variants

This module reports the raw variant counts for every gene across every sample. The output is a CSV file that contains counts for each sample versus every gene in the genome. If working with cancer variants, the user has the option of limiting the search space to variants also found in the COSMIC database.

find-peptide-variants

This module reports the protein variations associated with each genomic variant. VCF records are converted to peptide-level variants, and then ENSEMBL protein IDs, in accordance to the HGVS sequence variant nomenclature. The output is a hierarchically ordered text file (CSV or JSON) that reports the Ensemble protein ID and the gene associated with each variant, for each experimental sample. The user again has the option of limiting the search space to variants also found in the COSMIC database. This module also has an option to report the number of variant vs. wildtype reads found at each locus.

Dependencies

cerebra depends on some (fairly standard) packages and libraries. Before installing, it might be a good idea to make sure all of the requisite packages are installed on your system (note: if installing with Docker you can skip this step).

MacOS Dependencies:

sudo pip install setuptools
brew update
brew install openssl
brew install zlib

Linux Dependencies:

sudo apt-get install autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev

As of present cerebra is not installable on Windows. cerebra depends on the pysam library (or rather, pysam is a dependency-of-a-dependency) and currently this library is only available on Unix-like systems. Windows solutions like WSL exist for overcoming precisely this challenge, however, cerebra has not been tested on WSL or any other Unix-like subsystem for Windows.

Installation (for users)

There are four different methods available to install cerebra. Choose one of the following:

With Docker (recommended)

docker pull lincolnharris/cerebra
docker run -it lincolnharris/cerebra

warning: this image will take up ~1Gb of space.

With the python standard library venv module

python3 -m venv cerebra
source cerebra/bin/activate
pip3 install cerebra 

With conda

conda create -n cerebra python=3.7
conda activate cerebra
pip3 install cerebra

From PyPi (system-wide installation, NOT RECOMMENDED)
For novice users, it's generally a better idea to install packages within virtual environments. However, cerebra can be installed system-wide, if you so choose.

pip3 install cerebra

# OR, if you dont have root privileges
pip3 install --user cerebra

Installation (for developers)

Here's how to set up cerebra for local development. After installing the requisite dependencies:

  1. Fork the cerebra repo on GitHub: https://github.com/czbiohub/cerebra

  2. Clone your fork locally:

    $ git clone https://github.com/your-name/cerebra.git
    
  3. Install your local copy into a virtualenv. Using the standard library venv module:

    $ cd cerebra
    $ python3 -m venv cerebra-dev
    $ source cerebra-dev/bin/activate
    $ pip3 install -r requirements.txt -r test_requirements.txt -e .
    
  4. Create a branch for local development:

    $ git checkout -b name-of-your-bugfix-or-feature
    

    Now you can make your changes locally.

  5. When you're done making changes, check that your changes pass flake8 and pytest:

    $ make test
    $ make coverage
    $ make lint
    
  6. Commit your changes and push your branch to GitHub:

    $ git add .
    $ git commit -m "Your detailed description of your changes."
    $ git push origin name-of-your-bugfix-or-feature
    
  7. Submit a pull request through the GitHub website. See CONTRIBUTING.md for more.

(Quickstart) Usage

$ cerebra should return usage information

Usage: cerebra  <command>

  a tool for fast and accurate summarizing of variant calling format (VCF)
  files

Options:
  -h, --help  Show this message and exit.

Commands:
  germline-filter    filter out common SNPs/indels between tumor and normal samples
  count-variants    count total number of variants in each sample, and report on a per-gene basis
  find-peptide-variants  report peptide-level SNPs and indels in each sample, and associated coverage

Note that the -h command will display usage information for each of the three commands.

An example workflow might look like this:

Step 1:

cerebra germline-filter --processes 2 --normal_path /path/to/normal/vcfs --tumor_path /path/to/tumor/vcfs --metadata /path/to/metadata/file --outdir /path/to/filtered/vcfs

Step 2:

cerebra count-variants --processes 2 --cosmicdb /optional/path/to/cosmic/database --refgenome /path/to/genome/annotation --outfile /path/to/output/file /path/to/filtered/vcfs/*

Step 3:

cerebra find-peptide-variants --processes 2 --cosmicdb /optional/path/to/cosmic/database --annotation /path/to/genome/annotation --genomefa /path/to/genome/fasta --report_coverage 1 --output /path/to/output/file /path/to/filtered/vcfs/*

For advanced usage information, see USAGE.md.

Authors

This work was produced by Lincoln Harris, Rohan Vanheusden, Olga Botvinnik and Spyros Darmanis of the Chan Zuckerberg Biohub. For questions please contact [email protected]

Contributing

We welcome any bug reports, feature requests or other contributions. Please submit a well documented report on our issue tracker. For substantial changes please fork this repo and submit a pull request for review.

See CONTRIBUTING.md for additional details.

You can find official releases here.

cerebra's People

Contributors

lincoln-harris avatar olgabot avatar rvanheusden avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cerebra's Issues

output json instead of csv

json might be better suited here because its a better way to represent hierarchical data, such as
cell -> gene -> amino acid mutation relationships

also nice for very sparse matricies like ours

trouble adding to bioconda

linking bioconda/bioconda-recipes#23260

tried to add recipe to bioconda.
for now failing due to hgvs dependency....there is no bioconda or conda-forge recipe for hgvs

until this exists we wont be able to add to bioconda, unfortunately

all I could find in the way of conda builds was this https://anaconda.org/aroth85/hgvs

not sure its possible to add independent channels to meta.yaml, and even if it was, doesnt seem like a good idea to add in an outdate third-party recipe

germline-filter run instructions

Hi,

I am trying to run the germline-filter tool but am unclear as to what the contents should be for the metadata csv file.

The message from the tools help page is somewhat vague on the requirements for each argument:

Usage: cerebra germline-filter [OPTIONS]

  filter out common SNPs/indels between germline samples and samples of
  interest

Options:
  --processes INTEGER
  --germline TEXT      [required]
  --cells TEXT         [required]
  --metadata TEXT      [required]
  --outdir TEXT        [required]
  -h, --help           Show this message and exit.

Please could the run instructions be updated to describe how to run the germline-filter tool with all the required files.

An example dataset would also be useful to demonstrate the required contents of the metadata csv.

Also then just to clarify I am assuming the germline and cells should be the VCF files for each?

Best wishes,
Roshan

not picking up intronic variants

in mutations_table.py, if a vcf record reports a hit to an intronic region (we think), it will NOT be included in the final table

  • this is because while the COSMIC db includes hits to intron, the gtf does not contain entries for intron bounds, so the COSMIC hit wont match anything in the gtf
  • solution is probably just not matching back to the gtf when doing the COSMIC filter; ie. just grab the gene name from the COSMIC database line, and dont do the gene matching back to the gtf
  • only really need the gtf for the un-filtered case

make germline-filter module more robust

right now it only works if you have a metadata file with a very specific format

cell_id,patient_id
sample1,gl_sample1
sample2,gl_sample1
sample3,gl_sample2
sample4,gl_sample2
sample5,gl_sample2

would be nice if we could loosen up this requirement a bit. for example, what if you had germline vcfs for some experimental samples but not all?

Requirements and CI

It looks like conda_requirements.txt is a subset of requirements.txt and conda_requirements.txt is only used by the CI testing.

Why is this file included and why does the CI testing use conda at all? As a user who is looking to install the software, this is quite confusing.

requirements.txt also includes test-only requirements (coverage, flake8, etc.). These should be excluded from the requirements file and standard installation.

openjournals/joss-reviews#2432

hgvs throwing stdout

cryptic error message in cerebra find-peptide-variants:

<class 'hgvs.edit.AASub'> <class 'hgvs.edit.AARefAlt'>

would like to find out where this (error?) message is coming from

Python 3.8 support

The software is not tested against Python 3.8, despite this being the current major release.

I suggest the authors add '3.8' '3.8-dev' and '3.9-dev' to the CI testing, with allowed failures for the dev versions. It also wouldn't hurt to test against (and allow failures for) 'pypy3'. Here is a simple example of this from one of my own projects: https://github.com/CountESS-Project/fqfa/blob/master/.travis.yml

Once 3.8 support is established, it should also be added to the setup.py classifiers section.

openjournals/joss-reviews#2432

Supporting different types of annotation GTFs

find_aa_mutations soft-requires gencode’s gtf annotation, which for whatever reason has 5 comment lines at the top which pandas doesn’t like
might be good to document somewhere that those have to be removed from the file or add handling so that comment lines are cut off
right now there’s a disparity; mutation_counts’ annotation read_csv call is normal (hg38.gtf), while find_aa_mutations’ has to use skiprows=5 (gencode .gtf)

JoSS Review: Clarify README.md

With regard to the ongoing review for your JoSS submission, I would like to suggest stating some sort of use cases and/or warnings in your README.md. In particular, one of the largest drawbacks is the case of large VCF files. You even state as much in your code:

def dataframe(filename, large=True):
    """ [...]
    Note: Using large=False with large VCF files. It will be painfully slow.
    [...]
    """

This same cannot be said about your instructions/quickstart.

VCFs can be excruciatingly large sometimes (looking at you 1KG/ENCODE). Many times, researchers want/need to summarize from this starting point. Therefore, suggesting something like filtering the VCF based on region/contig prior to running cerebra would be thoughtful suggestion to gung-ho users to preempt potentially long-running jobs.

Installation process

Currently the cerebra installation process requires some system packages to be installed. These are all relatively standard requirements that many users will already have. It's good to see that users have the option of running this in Docker. Providing a Nix expression is another option.

However, the documentation should be improved with an explanation of why these system packages are needed. They seem to be requirements for cerebra's dependencies rather than the package itself. This transparency is especially important if the dependencies-of-dependencies change in the future.

There are no installation instructions for Windows users, although there are is a large population of bioinformatics specialists who use Windows as their primary environment and might benefit from this software. I'm guessing this is because the package won't install on Windows due to the pysam dependency, but it should be explicitly stated.

The installation instructions for the package itself are not ideal. The primary method of installation (especially for research software with pinned requirements) should be a virtual environment. The primary installation method suggested is a system-level installation.

There are instructions for installing in a virtual environment via conda, but there is no reason I could see to require conda rather than using standard library's venv. The virtual environment instructions also install the package in editable mode, which is ideal for development but not what most users would be expecting.

The installation section could be split into a "for users" section and a "for developers" section, where the latter describes an editable installation with test requirements and instructions for running unit tests locally.

openjournals/joss-reviews#2432

improve testing speed

the biggest thing to do here is gonna be removing that massive hg38.fa chr7 / chr12 file.

either reworking the tests to get around using the .fa entirely or else figuring out how to further compress it

JoSS Review: Quality of writing

With regard to the "Quality of writing" section for the ongoing review for your JoSS submission (openjournals/joss-reviews#2432), these are my (very minor/nitpicky) remarks:

  • "wrangling" is somewhat parlance and may be a difficult word to discern context from for international audiences
  • I may be wrong here, but surnames should be written for both in-text and bibliographic citation. For groups of 6 or more, it is acceptable to use "et al" after the first author. This simplifies the citation so as to not break up the thought of the sentence any more than it already has.
  • You are already use LaTex to write the paper, so it may be simpler to write 10^8 as instead
  • (Question) What happens if a *.fai is already provided prior to find-peptide-variants? Will it construct a new one?
  • It might be helpful to state in the caption of Figure 3 that the first 10+ minutes appear to be the construction of the genome interval tree and can be considered overhead for the process, which runs approximately linearly thereafter (Ha! Just saw that you state the same thing just below)

improve germline filtering: handle multiple ALTs better

I was just thinking—right now, the germline filter will only match two records if the CHROM, POS, REF, and ALT are all the same. But, since VCFs will encode all mutations that happen at a particular POS as multiple ALTs, we could instead just check if one (not all) of the ALTs match, and then remove that ALT specifically.

Versioning question

This repo has only a single release, 1.0.0. However, setup.py sets the version as 1.0.9 and the PyPI version is 1.1.0. __init__.py sets the version as 0.1.0.

Additionally, the seemingly-neglected HISTORY.md has only a single entry for 0.1.0 (despite this tag not existing in the repo).

This is important to reconcile as part of openjournals/joss-reviews#2432

mapping ambiguity between vcf and gtf

some vcf mutations correspond to multiple records in the gtf file; ie. stop codon in one gene could be an exon in another. how to assign these ambiguous vcf hits?

  • a relevant question is how often does this happen? if its 0.01% of cases, we can afford to just throw out these ambiguous vcf records

clarification on AD field of vcf files

want to make sure find_aa_mutations --report_coverage is pulling out the right thing.
vcf documentation is not very clear on what information the AD field actually contains

GenomePosition.from_str shouldn't assume 1-index

I've found that BEDtools seems to convert 1-based genome position strings into 0-based ones. Perhaps GenomePosition.from_str should offer an argument allowing one to indicate if the string is 0-based or 1-based.

multiprocessing race condition

was getting a biopython.seq error when running through some new vcf files:

TranslationError: Codon '>' is invalid

but seems to have resolved its self somehow

vcf '20' field is sometimes malformed

  • seeing 0/0:46:46 instead of 1/1:0,2:2:6:84,6,0
  • relevant when trying to extract AD, in check_coverage_whole_gene module
  • not sure what this edge case means...
  • ignoring it for now, with try/except block

find-aa-mutations not working, post-merge with Rohan's branch

exiting gracefully, but not actually finding variants. I tested on remade artificial vcfs as well as real vcfs, same behavior seen for both. returns empty matrix with only the cell names. merged with rvanheusen/aa-mutations. have also checked out raw versions of rvanheusen/aa-mutations and rvanheusen/mutation-counts-cleanup, and same behavior is observed

perhaps give non-tumor cells better names for germline filtering

Right now, my revision of germline filter differentiates tumor from non-tumor bulk-seq VCFs based on the non-tumor cells having three parts (X_Y_Z.vcf). It would probably be better to have a more explicit method for differentiating between the bulk-seq VCFs.

Review comments on CONTRIBUTING.md

The CONTRIBUTING.md document includes most of the required material, but there are a few issues.

The links in the document lead to the site home pages rather than the specified pages due to misplaced >. The document would benefit from including additional links, such as a link to the relevant GitHub documentation in the first "Get Started!" list item.

This document has very different installation instructions to the README. They should be harmonized, and it will probably be simpler to maintain if CONTRIBUTING refers the user to the README, with notes about changes required to create a dev environment. It's quite confusing to have one document require conda and another require virtualenvwrapper. Ideally both should just use standard library venv.

This document specifies tox but the main project seems to use pytest. The Tips section uses unittest. The project should pick one testing framework and stick to it throughout.

openjournals/joss-reviews#2432

how does utils.get_best_overlap deal with genome positions that have multiple hits

  • this seems to be most of them...cosmic database has TONS of repeated shit
  • specifically the problem is with SNP calling in get_aa_mutations:
    - i want to be able to match CDSs against entries in the cosmic db in order to determine the AA substitution, but what happens if there are multiple cosmic entries, with multiple CDSs
    - utils.get_best_overlap is probably just returning the first matching entry, so the CDSs do not necessarily match

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.