milaboratory / mixcr Goto Github PK

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.

License: Other

Java 21.49% JavaScript 0.13% Shell 8.46% Kotlin 68.21% Python 1.72%

bioinformatics immunology antibody t-cell-receptor t-cell rep-seq 10x rna-seq sequencing single-cell

mixcr's Introduction

MiXCR is a universal software for fast and accurate analysis of raw T- or B- cell receptor repertoire sequencing data. It works with any kind of sequencing data:

Bulk repertoire sequencing data with or without UMIs
Single cell sequencing data including but not limited to 10x Genomics protocols
RNA-Seq or any other kind of fragmented/shotgun data which may contain just a tiny fraction of target sequences
and any other kind of sequencing data containing TCRs or BCRs

Powerful downstream analysis tools allow to obtain vector plots and tabular results for multiple measures. Key features include:

Ability to group samples by metadata values and compare repertoire features between groups
Comprehensive repertoire normalization and filtering
Statistical significance tests with proper p-value adjustment
Repertoire overlap analysis
Vector plots output (.svg / .pdf)
Tabular outputs

Other key features:

Clonotype assembly by arbitrary gene feature, including full-length variable region
PCR / Sequencing error correction with or without aid of UMI or Cell barcodes
Robust and dedicated aligner algorithms for maximum extraction with zero false-positive rate
Supports any custom barcode sequences architecture (UMI / Cell)
Human, Mice, Rat, Spalax, Alpaca, Monkey
Support IMGT reference
Barcodes error-correction
Adapter trimming
Optional CDR3 reconstruction by assembling overlapping fragmented sequencing reads into complete CDR3-containing contigs when the read position is floating (e.g. shotgun-sequencing, RNA-Seq etc.)
Optional contig assembly to build longest possible TCR/IG sequence from available data (with or without aid of UMI or Cell barcodes)
Comprehensive quality control reports provided at all the steps of the pipeline
Regions not covered by the data may be imputed from germline
Exhaustive output information for clonotypes and alignments:
- nucleotide and amino acid sequences of all immunologically relevant regions (FR1, CDR1, ..., CDR3, etc..)
- identified V, D, J, C genes
- comprehensive information on nucleotide and amino acid mutations
- positions of all immunologically relevant points in output sequences
- and many more informative columns
Ability to backtrack fate of each raw sequencing read through the whole pipeline

See full documentation at https://docs.milaboratories.com.

Who uses MiXCR

MiXCR is used by 8 out of 10 world leading pharmaceutical companies in the R&D for:

Vaccine development
Antibody discovery
Cancer immunotherapy research

Widely adopted by academic community with 1000+ citations in peer-reviewed scientific publications.

Installation / Download

Using Homebrew on Mac OS X or Linux (linuxbrew)

brew install milaboratory/all/mixcr

to upgrade already installed MiXCR to the newest version:

brew update
brew upgrade mixcr

Conda

We maintain Anaconda repository to simplify installation of MiXCR using conda package manager. To install latest stable MiXCR build with conda run:

conda install -c milaboratories mixcr

to install a specific version run:

conda install -c milaboratories mixcr=3.0.12

mixcr package specifies openjdk as a dependency, if you already have Java installed on your system, it might be a good idea to prevent conda from installing another copy of JDK, to do that use --no-deps flag:

conda install -c milaboratories mixcr --no-deps

Docker

Official MiXCR Docker repository is hosted on the GitHub along with this repo.

Example:

docker run --rm \
    -e MI_LICENSE="...license-token..." \
    -v /path/to/raw/data:/raw:ro \
    -v /path/to/put/results:/work \
    ghcr.io/milaboratory/mixcr/mixcr:latest \
    align -s hs /raw/data_R1.fastq.gz /raw/data_R2.fastq.gz alignments.vdjca

Setting the license

There are several ways to pass the license for mixcr when executed inside a container:

Using environment variable:

docker run \
    -e MI_LICENSE="...license-token..." \
    ....

Using license file:

docker run \
    -v /path/to/mi.license:/opt/mixcr/mi.license:ro \
    ....

If it is hard to mount mi.license file into already populated folder /opt/mixcr/ (i.e. in Kubernetes or with other container orchestration tools), you can tell MiXCR where to look for it:
```
docker run \
    -v /path/to/folder_with_mi_license:/secrets:ro \
    -e MI_LICENSE_FILE="/secrets/milicense.txt" \
    ....
```

Migration from the previous docker images

New docker images define mixcr startup script as an entrypoint of the image, compared to the previous docker repo where bash was used instead. So, what previously was executed this way:

docker run ... old/mixcr/image/address:with_tag mixcr align ...

now will be

docker run ... new/mixcr/image/address:with_tag align ...

For those who rely on other tools inside the image, beware, new build relies on a different base image and has slightly different layout.

mixcr startup script is added to PATH environment variable, so even if you specify custom entrypoint, there is no need in using of full path to run mixcr.

License notice for IMGT images

Images with IMGT reference library contain data imported from IMGT and is subject to terms of use listed on http://www.imgt.org site.

Data coming from IMGT server may be used for academic research only, provided that it is referred to IMGT®, and cited as "IMGT®, the international ImMunoGeneTics information system® http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France)."

References to cite: Lefranc, M.-P. et al., Nucleic Acids Research, 27, 209-212 (1999) Cover of NAR; Ruiz, M. et al., Nucleic Acids Research, 28, 219-221 (2000); Lefranc, M.-P., Nucleic Acids Research, 29, 207-209 (2001); Lefranc, M.-P., Nucleic Acids Res., 31, 307-310 (2003); Lefranc, M.-P. et al., In Silico Biol., 5, 0006 (2004) [Epub], 5, 45-60 (2005); Lefranc, M.-P. et al., Nucleic Acids Res., 33, D593-D597 (2005) Full text, Lefranc, M.-P. et al., Nucleic Acids Research 2009 37(Database issue): D1006-D1012; doi:10.1093/nar/gkn838 Full text.

Manual install (any OS)

download the latest stable MiXCR build from release page
unzip the archive
add resulting folder to your PATH variable
- or add symbolic link for mixcr script to your bin folder
- or use MiXCR directly by specifying full path to the executable script

Requirements

Any OS with Java support (Linux, Windows, Mac OS X, etc..)
Java 1.8 or higher

Obtaining a license

To run MiXCR one needs a license file. MiXCR is free for academic users with no commercial funding. We are committed to support academic community and provide our software free of charge for scientists doing non-profit research.

Academic users can quickly get a license at https://licensing.milaboratories.com.

Commercial trial license may be requested at https://licensing.milaboratories.com or by email to [email protected].

To activate the license do one of the following:

put mi.license to
- ~/.mi.license
- ~/mi.license
- directory with mixcr.jar file
- directory with MiXCR executable
- to any place and specify it in MI_LICENSE_FILE environment variable
put mi.license content to MI_LICENSE environment variable
run mixcr activate-license and paste mi.license content to the command prompt

Usage & documentation

See usage examples and detailed documentation at https://docs.milaboratories.com

If you haven't found the answer to your question in the docs, or have any suggestions concerning new features, feel free to create an issue here, on GitHub, or write an email to [email protected] .

License

Before downloading or accessing the software, please read carefully the License Agreement available at: https://github.com/milaboratory/mixcr/blob/develop/LICENSE

By downloading or accessing the software, you accept and agree to be bound by the terms of the License Agreement. If you do not want to agree to the terms of the Licensing Agreement, you must not download or access the software.

Cite

Dmitriy A. Bolotin, Stanislav Poslavsky, Igor Mitrophanov, Mikhail Shugay, Ilgar Z. Mamedov, Ekaterina V. Putintseva, and Dmitriy M. Chudakov. "MiXCR: software for comprehensive adaptive immunity profiling." Nature methods 12, no. 5 (2015): 380-381.

(Files referenced in this paper can be found here.)
Dmitriy A. Bolotin, Stanislav Poslavsky, Alexey N. Davydov, Felix E. Frenkel, Lorenzo Fanchi, Olga I. Zolotareva, Saskia Hemmers, Ekaterina V. Putintseva, Anna S. Obraztsova, Mikhail Shugay, Ravshan I. Ataullakhanov, Alexander Y. Rudensky, Ton N. Schumacher & Dmitriy M. Chudakov. "Antigen receptor repertoire profiling from RNA-seq data." Nature Biotechnology 35, 908–911 (2017)

mixcr's People

Contributors

Stargazers

Watchers

mixcr's Issues

Implement infrastructure for simple loci library installation and usage

This issue is connected to #42

User story 1 (installation of loci library)

If I have a custom loci library and want to use it without specifying the full path I can put it into following locations:
- PATH_TO_MIXCR_SCRIPT/reference/ for system-wide installation
- ~/.mixcr/reference/ for user-local installation
- working directory . or ./reference
Symlinks in any of the following cases should be correctly dereferenced by MIXCR

User story 2 (usage of custom loci library)

If I have a custom loci library I can use it, either I have installed it or not, in the following way:
- If it is installed as described above:
```
mixcr align --lociLibrary myLL ....
mixcr assemble ...
mixcr exportClones ...
```
  I don't have to specify loci library second time in assemble, as *.vdjca file already contains this informatio
- If I just have a file somewhere in the file system:
```
mixcr align --lociLibrary /path/to/myLL ....
mixcr assemble  --lociLibrary /path/to/myLL ...
mixcr exportClones  --lociLibrary /path/to/myLL ...
```

User story 2 (default loci library)

Any installed libraries with names other than default.ll will be used only if user specified it on the align step, the internal mi.ll will be used if --lociLibrary option is not used.
Library installed with the name default.ll will be used by default.

Shortcuts for frequently used sets of settings

rna = -OvParameters.geneFeatureToAlign=VTranscript
full-length = ...
short-full-length = full_length - FR4
dont-cluster = ....

Possible usage:

mixcr align -:rna -:dont-cluster ...

here dont-cluster will be skipped, as it affects only assembling stage, but by permitting such things we will allow to set the same set of parameters for all stages, in the end it will simplify implementation of #14 to be in the following form:

mixcr analyse -r report.txt -:compress-intermediate -:rna -:dont-cluster my_name_R1.fastq.gz my_name_R2.fastq.gz

which will produce the following set of files:

my_name.vdjca.gz
my_name.clns.gz
my_name.txt
report.txt

Check for possible bias in TCR/IG extraction rates in RNA-Seq

By comparing number of extracted TCR/IG sequences with the number of alignments with corresponding C genes.

Rewrite algorithm of VDJCAlignerPVFirst to handle all possible paired-read structures

In small percent of cases in randomly shred libraries, alignment could gain additional total score from J or C gene in the right part of a paired-end sequence.

Support p-segments

Estimate complexity of implementation of paired-chain analysis

Assembling by clonal sequences without CDR3

Add value marks to all command line parameters in documentation

Like this

--species {name}

instead of

--species

Support of clones with same CDR3 and different V, C or V+C genes

Requires

modification of Clustering from MiLib
modification of all major steps in Assembler

Correct status reporting when analysing *.vdjca.gz files

Include information on MiXCR version to binary vdjca/clns files

> mixcr info file.vdjca
Created on ...
MiXCR version ..
...

Problem with IGHV4-61

Unexpected output of mixcr program (v1.3). Fixed in v1.4?

I am testing the MixCR program (v1.3) and I have found an unusual situation when running 'exportAlignments'. The problem I have noticed is that the order in which sequences are provided in a FASTA or FASTQ file will affect the number of successful sequences that are aligned.

In the example(s) I provide below, I made a FASTA file containing 7 total sequences. There are only 4 unique NGS reads in the FASTA file; that is I repeated one sequences 3x and a second sequence 2x. The remaining two sequences should not return strong hits.

if I run mixcr using the 7 test sequences (test1.fasta), then the Mixcr log file says that 2/7 sequences (rather than 5/7) returned results. This is problematic in that not all 5 are found, BUT even more problematic is if I simply change the order of the sequences in the file (test2.fasta) then the Mixcr log file says 4/7 (rather than 5/7) returned results.

The fact that I do not see 5/7 sequences successfully returned seems to be a bug. Also, I would not expect the output of exportalignments to be sensitive to the order of the sequences in a file. Is this true? If so, is it a known problem?

If its not a bug, then how can I run the settings so that I get all 5 successful sequences returned when using 'exportalignments'?

Add `cloneId` by default in `exportClones`

Filter out clones with empty clonalSequence

Two offset syntax for gene features

Like:

CDR3(-3,+3)

as a shortcut for

{CDR3Begin(-3):CDR3End(+3)}

UTR5BeginTrimmed reference point

Move FR4End bound in default Gene Features by 6 nucleotides to the left

To prevent possible artefacts on the right side of J gene alignment.

Parsing (~ Pandas/R ) friendly export column names

Make a special command line option for all export... actions to convert column names to names without spaces. E.g.:

N. Seq. CDR3 -> cdr3n
AA. seq. CDR3 -> cdr3aa
Min. qual. FR4 -> fr4MinQual
etc..

Cli wrapper to simplify analysis --- one command for all actions

Support custom alleles

Documentation for utility actions

-v option
versionInfo action
mergeAlignments action

Fix RST docs

Trim low quality letters on the end of reads ??

Instead such low-qulity endings (with nearly random sequence) can lead to bad alignments with j rightFloatingBound = false ?

Feature: extract aggregated reads for each clone

Amino acid mutations/alignments with reference genes

From this letter:

Oh well. One more question about the mutations: these are interpretable as SHM, right? Assuming there is not sequencing/pcr error, so NGS being completely error-free, then these mutations would be SHM, and, not as it is now the case a mixture of SHM and NGS-related errors, right? And a suggestion: it would be nice to have them also on the amino acid levels (similar to IMGT).

RNASeq optimized analysis mode

Update documentation for exportAlignmentsPretty

new output format
new option: --cdr3-contains
new option: --read-contains
new option: --verbose

Check for xmllint in importFromIMGT.sh

Make Hotfix!

Pipe MiXCR export

Check all export actions to output to stdout.

Optional assembly of all possible out-of-clonal-sequence sequences

E.g. if we assemble clones using CDR3, assemble consensus sequences for all other sequence parts covered by reads.

Individual coverage values for each letter in consensus sequences.

Handling of java specific parameters in mixcr sctipt

mixcr -Xmx2g -Xms1g align ....

Aggregate quality in a right way

Convert docs to RST ? Move to readthedocs?

Add -v and --version to CLI

Add example for backward links in "Quick start"

Output reference points

Like this:

v1:::0:::56:93:102:::

where v1 is a version of reference points.

Add license information to documentation and release zip.

Merge vdjca files.

Limit possible set of D genes

Limit possible set of D genes only to loci of V and J genes.

To exclude combinations like:
TRBV -- IGHD -- TRBJ

High priority for clones lower priority for alignments (vdjca files).

-vHitsWithoutScore export option

Add -OvjAlignmentOrder option to docs

Fix documentation

add options --filter-out-of-frames and --filter-stops to export
rename -presetFile with preset-file in export
rename -listFields with list-fields in export
add description for --save-reads in align
add description for --index in assemble
add info on possibility to add JVM args mixcr -Xmx2g align ...
add description for clones <-> reads mapping (export fields, new actions etc.)

compress vdjca files

RnaSeq parameters as default

My test showed that rna-seq parameters performs slightly better on real highly enriched datasets. While I expected the opposite effect. MiXCR with this parameters has nearly zero false positive rate, and sensitivity is also very high (it detects nearly all V(D)J events even in short 75+75 RNA-Seq datasets).

So, why don't we use this parameters as default?

Additional testing on broader spectrum of real enriched datasets required.

Backward links from clones to V(D)J alignments to initial reads

Implement actions for simple loci library creation

User story 1 (from IMGT-like reference):

I have a set of fasta files with reference sequences of V, D, J or C genes.
Each file is padded with . symbols or something similar to align anchor points. So each anchor point has the same position in all sequences. (exactly like IMGT gaps)
There is file or command-line argument with positions of all anchor points. Something like this:
```
V=108:117:125:148:157: etc...
```
I can create new loci library from this information or append it to already existing one:
```
mixcr addReferenceGenes --taxonId 9615 --speciesCommonName dog,canis --locus TRB --geneType V --anchorPoints 108:117:125:148:157 --geneNamePattern '...' input.fasta myLL.ll
```
this will create myLL.ll file or add locus information to it if it already exists.
- Loci library file can't store two records with the same combination of taxonId, locus and geneName.

User story 2 (library from genomic data, MiXCR way of LL creation):

I have a big fasta file with genomic sequence of chromosome or particular locus.
There is another file with tab-delimited list of reference genes. Example segments.txt:

GeneName Locus GeneType AnchorPoints

TRBV12-3 TRB V 123341:123356:123387:123456

... ... ... ...

GeneName	Locus	GeneType	AnchorPoints
TRBV12-3	TRB	V	123341:123356:123387:123456
...	...	...	...

I can create new loci library from this information or append it to already existing one:

mixcr addReferenceLocus --taxonId 9615 --speciesCommonName dog,canis input.fasta segments.txt myLL.ll

Filtered export

Export canonical clonotypes only: CXX...XX[WF] mask for CDR3
Export functional clonotypes only: no * or _ in CDR3, V segment is not a pseudogene

Add export column: identity of the V/D/J/C genes

Calculate percent of aligned letters and output it in the export tab-delimited files.

Extended export options

Export canonical clonotypes only: CXX...XX[WF] mask for CDR3
Export functional clonotypes only: no * or _ in CDR3, V segment is not a pseudogene