gfellerlab / mixmhc2pred Goto Github PK

HLA-II ligand predictor.

License: Other

Makefile 2.44% C++ 97.56%

mixmhc2pred's Introduction

MixMHC2pred

MixMHC2pred is a pan-allele predictor of MHC class II ligands and epitopes. It is described in:
Racle, J., et al., Machine learning predictions of MHC-II specificities reveal alternative binding mode of class II epitopes. Immunity (2023) (available here).

and

Racle, J., et al. Robust prediction of HLA class II epitopes by deep motif deconvolution of immunopeptidomes. Nat. Biotechnol. 37, 1283–1286 (2019) (available here).

MixMHC2pred is also available as a web application: http://mixmhc2pred.gfellerlab.org.

Installation

Download MixMHC2pred-2.0.zip file (https://github.com/GfellerLab/MixMHC2pred/releases) and move it to a directory of your choice, where you have writing permissions.
Unzip MixMHC2pred-2.0.zip in this directory.
Above's zip file already contains the human alleles' definition files. If you want to do predictions for MHC-II alleles from other species or if you cloned the git repository instead of downloading this zip file, you will need to download the wished alleles' definition files from the page: http://mixmhc2pred.gfellerlab.org/PWMdef. These additional alleles' PWM definition files need to then be unzipped in the folder of your choice. The option -f <folders> is then used to indicate where these files are (see below for details). Alternatively, you can copy/move all downloaded allele definition files into the folder PWMdef at the root of the directory where MixMHC2pred is installed, to avoid using the -f option.
To test your installation, make sure you are in MixMHC2pred-2.0 directory and run the following command, depending on your operating system:
- Mac OS: ./MixMHC2pred -i test/testData.txt -o test/out.txt -a DRB1_15_01 DRB5_01_01 DPA1_02_01__DPB1_01_01 DQA1_01_02__DQB1_05_01 DQA1_01_02__DQB1_06_02
- Unix: ./MixMHC2pred_unix -i test/testData.txt -o test/out.txt -a DRB1_15_01 DRB5_01_01 DPA1_02_01__DPB1_01_01 DQA1_01_02__DQB1_05_01 DQA1_01_02__DQB1_06_02
- Windows: MixMHC2pred.exe -i test/testData.txt -o test/out.txt -a DRB1_15_01 DRB5_01_01 DPA1_02_01__DPB1_01_01 DQA1_01_02__DQB1_05_01 DQA1_01_02__DQB1_06_02
Your file test/out.txt should be the same as test/out_compare.txt. Running the software takes few seconds or more when testing lots of peptides and alleles.

The testData.txt file corresponds to a subset of the HLA-II peptidomics data obtained from the cell line DOHH2 in Dheilly et al., Cancer Cell (2020), containing some peptides bound to the HLA in the reverse direction.
(Optional) To run MixMHC2pred from anywhere on your computer, make an alias of MixMHC2pred executable or add it in your path.

If using a non-standard OS, it is possible to compile MixMHC2pred using the Makefile found in the bin folder.

Running

Command

MixMHC2pred -i input_file -o output_file -a allele1 allele2 [additional options]

Depending on your operating system, use MixMHC2pred, MixMHC2pred_unix or MixMHC2pred.exe as indicated in the installation instructions.
Do not use spaces in your file or directory names.
Do not use other special characters (e.g., *, ?, %, &,...) in file or directory names.

Required arguments

-i <file> or --input <file> (input file name):
File listing all the peptides. It should contain two columns: the 1st column being the sequence of the peptide and 2nd column beeing its context sequence (12 amino acids long: 3 residues upstream of the peptide, 3 N-terminal residues of the peptide, 3 C-terminal residues of the peptide and 3 residues downstream of the peptide). When the peptide lies near the begin or end of a protein, the corresponding context AAs should be written as "-", i.e. for the protein ACDEFG... if the peptide is CDEFG... the first 6 AA encoding its context should be written as --ACDE (and these 6 AAs should be directly followed by the 6 AAs describing the context near the C-terminal of the peptide). Also, if some AAs from the context of a peptide are not known, the unknown AAs should be written with the letter X. See test/testData.txt for an example of input file. When using the no_context option (see below), then this input file should only contain the list of the peptides, without any 2nd column of the context (an example input file without in such a case is available at test/testData_noContext.txt).
-o <file> or --output <file> (output file name):
The name of the output file (including the directory). Peptides are kept in the same order than in the input file.
-a <alleles> or --alleles <alleles>:
List of MHC-II alleles to test. If you want to make predictions with multiple alleles, list the different alleles separated by a space (e.g. -a DRB1_11_01 DRB3_02_02).
Use the nomenclature DRB1_03_01 for HLA-DRB1*03:01 and DPA1_01_03__DPB1_04_01 for HLA-DPA1*01:03-DPB1*04:01. The list of alleles available and corresponding nomenclature is given in http://mixmhc2pred.gfellerlab.org/PWMdef (also given in the files Alleles_list_xxx.txt in the corresponding PWMdef folders). Simply, the names used in MixMHC2pred are obtained by dropping the HLA- from human alleles, by replacing all "-", "*" and ":" by "_", and by placing "__" between the alpha and beta chains forming the heterodimer (the invariant DR-alpha chains are not indicated in the allele names as in standard practice).

Optional arguments

--no_context:
In principle, MixMHC2pred includes the peptide context for its predictions (i.e. corresponding to a sequence of 12 AAs in total, including AAs just before and just after the peptide as explained above). It is nevertheless possible to decide to not consider any context information at all, when using this option. It is generally advised to include the context, in order to search for best candidate epitopes. But if analyzing a posteriori pre-cleaved peptide sequences (e.g. in experiments testing specific peptides directly, that therefore did not need to be cleaved by the cell), it may be a good to not consider the context encoding (often multiple overlapping epitopes are observed, so the peptide tested may not correspond to the best peptide based on context but it could still be recognized by the same T cells when given directly). When using this --no_context option, then the input file should only contain the list of peptides, without their context. An example input file is available at test/testData_noContext.txt; results of running MixMHC2pred without the context on this file, based on the same alleles as the first example above is given in test/out_noContext_compare.txt.
-f <folders> or --allelesFolder <folders> (folder(s) containing allele definitions):
If the folder containing the PWM definition files of the alleles is not at its default location ("path_to_exec/PWMdef"), you can give this option indicating where these files are located. It is possible to list multiple folders, separated by a space (when an allele is found in multiple folders, the first definition found for this allele is used). This path can be given as a full path, a path relative to current location, or with the special keyword exec: (e.g., -f exec:PWMdef) to give the path relative to the root folder of MixMHC2pred executable. An other special keyword default can also be used: it represents the default path where these files are located (i.e., it is equivalent to using exec:PWMdef).
-e or --extra_out: By default, the score returned by MixMHC2pred is the final peptide presentation %Rank score and this is the score recommended to use in all analyses/predictions. If you are however interested by other intermediate scores, you can pass the -e option. In this case, additional columns are appended to the output file (first columns are the same as when running MixMHC2pred without this option). The new columns give:
- the Score_... corresponding to the raw score returned by the 2nd block of the neural network described in our paper, a value of 1 being the best score and 0 the worst. Note however that the %Rank are obtained from this score by taking into account the expected peptide length distribution. Note also that the neural network is repeated multiple times, with the returned %Rank / Score being the average of the %Rank / Score from the repetitions, respectively, so it is not possible to directly transform this returned Score to the %Rank.
- ScorePWM_... corresponding to the binding scores based on the position weight matrices (eq. (2) from our manuscript). The worst score is 0 and bigger scores are better without having an upper limit. The corresponding %RankPWM_... are percentiles computed separately per peptide length.

Results returned and additional information

MixMHC2pred is meant for scoring different peptides and prioritising the most likely HLA-II ligands and epitopes. As it is trained on naturally presented peptides, it does not output a predicted affinity value, simply a score.
Input should consist in a list of peptides, not proteins. Currently, MixMHC2pred is not cutting longer peptides/proteins into shorter fragments: it uses the peptides given in input as is.
The score is computed for each allele provided in input. Results are returned for each allele in separate columns and additional columns give the results from the best allele for each peptide (columns BestAllele and ..._best in the output file, determined by the allele that had the best score, i.e. the most likely allele by which the peptide would be presented).
The two first columns of the output file give the Peptide and Context sequence of the peptide, which were given in the input file. When the option --no_context is used, the column Context is kept but it is empty.
The scores returned (columns %Rank) correspond to a percentile rank (best score is about 0, worst score is 100). This tells among random peptides, the percent of peptides expected to be better presented by this allele than the given peptide.
The CoreP1_... columns tell what is the most likely binding core position for the given peptide towards the given allele (this tells the position of the first amino acid from the binding core (which has a size of 9 aa in the predictions), starting at a value of 1 (i.e., if binding core corresponds to the 9 first amino acids from the peptide, this CoreP1 = 1)).
For conveniance, the binding core sequence is also indicated for the best allele per peptide (column Core_best, for the other alleles this can be obtained from the CoreP1 as indicated above).
The subSpec_... columns tell in which sub-specificity the given peptide is likely bound toward the given allele. The value 1 corresponds to the main sub-specificity (the only one for most alleles). But for example for DRB1*08:01 allele a 2nd sub-specificity exists and is indicated by the value 2. For alleles accomodating reverse binding, a value of -1 indicates that the given peptide is bound in the reverse orientation.
Peptides shorter than 12 amino acids, longer than 21 amino acids or containing non-standard amino acids are kept but with a score of "NA".
The list of alleles available is provided in http://mixmhc2pred.gfellerlab.org/PWMdef (also given in the files Alleles_list_xxx.txt in the corresponding PWMdef folders). These files show the nomenclature to use when running MixMHC2pred and the standard nomenclature used for example in the IPD-IMGT/HLA database.

Latest version

Latest version of MixMHC2pred is available at https://github.com/GfellerLab/MixMHC2pred/releases.

Check the file NEWS to see the main changes of the given version.

Web application

MixMHC2pred is also available as a web application at http://mixmhc2pred.gfellerlab.org.

License

MixMHC2pred can be used freely by academic groups for non-commercial purposes (see license). The product is provided free of charge, and, therefore, on an "as is" basis, without warranty of any kind.

FOR-PROFIT USERS: If you plan to use MixMHC2pred (version 2.0) or any data provided with the script in any for-profit application, you are required to obtain a separate license. To do so, please contact Nadette Bulgin ([email protected]) at the Ludwig Institute for Cancer Research Ltd.

Contact information

For scientific questions, please contact Julien Racle ([email protected]) or David Gfeller ([email protected]).

For license-related questions, please contact Nadette Bulgin ([email protected]).

How to cite

To cite MixMHC2pred, please refer to:

Racle, J., et al., Machine learning predictions of MHC-II specificities reveal alternative binding mode of class II epitopes. Immunity (2023) (available here).

and

Racle, J., et al. Robust prediction of HLA class II epitopes by deep motif deconvolution of immunopeptidomes. Nat. Biotechnol. 37, 1283–1286 (2019) (available here).

mixmhc2pred's People

Contributors

Stargazers

Watchers

Forkers

flywind2 anu-bioinfo luoyuan1984 fw1121 1512474508 zpeng1989 jasonzhyan gyeongback fakher77

mixmhc2pred's Issues

MixMHC2pred fails to predict for some random natural peptides

Hi Julien

I am trying to run MixMHC2pred on some datasets consisting of both eluted ligands and random natural peptides from the human proteome (for a benchmark). For a very small proportion of the random natural peptides, MixMHC2pred does not output any predictions, only a row of NAs. As an example, when predicting for these random peptides

NVASYUGLTGQYI
LRKPNSDULGMEEKNI
VLNFGSCTUPSFMFKF

with the alleles DRB1_04_01 DRB1_07_01 DQA1_01_01__DQB1_03_01 DQA1_01_01__DQB1_03_03

I get the following output:

####################
# Output from MixMHC2pred (v2.0.2)
# Input file: xx
# Alleles: DRB1_04_01, DRB1_07_01, DQA1_01_01__DQB1_03_01, DQA1_01_01__DQB1_03_03
# Alleles folder: 'exec:/PWMdef/';
# Context type: none
#
# MixMHC2pred is freely available for academic users.
# Private companies should contact [email protected] at the Ludwig Institute
#  for Cancer Research Ltd for commercial licenses.
#
# To cite MixMHC2pred, please refer to:
# Racle, J., et al., Machine learning predictions of MHC-II specificities
#  reveal alternative binding mode of class II epitopes. bioRxiv (2022).
# and
# Racle, J., et al. Robust prediction of HLA class II epitopes by deep motif
#  deconvolution of immunopeptidomes. Nat. Biotechnol. 37, 1283-1286 (2019).
#
####################
Peptide	Context	BestAllele	%Rank_best	Core_best	CoreP1_best	SubSpec_best	%Rank_DRB1_04_01	CoreP1_DRB1_04_01	SubSpec_DRB1_04_01	%Rank_DRB1_07_01	CoreP1_DRB1_07_01	SubSpec_DRB1_07_01	%Rank_DQA1_01_01__DQB1_03_01	CoreP1_DQA1_01_01__DQB1_03_01	SubSpec_DQA1_01_01__DQB1_03_01	%Rank_DQA1_01_01__DQB1_03_03	CoreP1_DQA1_01_01__DQB1_03_03	SubSpec_DQA1_01_01__DQB1_03_03
NVASYUGLTGQYI		NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
LRKPNSDULGMEEKNI		NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
VLNFGSCTUPSFMFKF		NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

It's quite weird, because the method is able to predict the vast majority of the random peptides, but some of them fail like these ones. I have attached an overview of the peptide sets which failed to give a prediction, along with the alleles I inputted for each set of peptides.
failed_peptides.txt

Hope you can shed some light on this issue, and why these peptides are not yielding any predictions.

Best
Jonas

Interpretation of columns

Dear author(s),

I am wondering the following:
MixMHCpred for class-I has both a Score_bestAllele and a %Rank_bestAllele that both give very different outputs.
MixMHC2pred for class-II has just a %Rank_best column and no real score.

We want to compare epitope presentation patterns of different proteomes based on your prediction tools, for both class-I and class-II. We would want to use a single metric for comparison. Would it be wise to compare %Rank's metrics?

Or am I missing a binding score for MixMHC2pred? I guess not considering you wrote the following in the paper:
"This binding score is then transformed to a percentile rank per peptide length by
comparing it to the score of 10’000 random human peptides of the same length, and
then further transformed to a global percentile rank by making that the top 1% random
human peptides follow the same peptide length distribution than the global peptide
length distribution observed in our HLA-II peptidomics data. Finally, when the score
among multiple alleles is requested, the score from each peptide is taken as its best
percentile rank among all the alleles"

Position of mutation and length of peptides

Dear author(s),

When I using this tool, I encounter a problem：
If the mutated amino acid is at the end of the peptide, how should the length of the inputed peptide be selected? Because the input length of the software is limited to 12-25 amino acids, if I select the 9-length peptide upstream of the mutated amino acid, the software will report an NA, but select more than 9-length peptide upstream of the mutated amino acid (for example,12-length peptide), the 9 first amino acids will not contain the mutated amino acid.

Could you give some suggestions?
Best regards

How do we filter the large result?

Hi, is there any suggestion to do filtering of final result?

How to distinguish strong and weak binder?

Thanks for your work in MixMHC2pred. I have tried this tool and found it was very easy to use. Currently, I want to compare it with netMHCpanII, so I am writing to ask you how to distinguish the strong and weak binder in the output file. As you know, the netMHCpan indicates strong and weak binder by the value of rank.
Thanks very much!

Best regards
Xiaoxiang

Programme does not accept txt files as -a parameter inputs

Hi!

I'm just running into an issue where I'm trying to predict binding for all available HLA alleles. I can't simply provide all of the allele names as parameters in command line because I run into the ~8000 character limit. I have therefore provided a path to a file containing the names of all of the HLA alleles separated by a space, as recommended by the documentation. However the programme refuses to run and returns the following error message:

The allele(s) 'C:\Me\alleles.txt' aren't defined in 'C:\Users\Me\MixMHC2pred-2.0/PWMdef/', cannot perform the computations.

There is no clear reason for this to occur, the txt is correctly formatted using the approved names from the human alleles list file, and the encoding is UTF-8. Are there any other ways to run binding predictions for all alleles in the programme's definitions?

Thank you.

Trainning data for MixMHCIIpred

hi,

Thank you for your work in MixMHC2pred. Now, I want to compare it with other tools. So I'm trying to find the training data set of this tool but can't find it. Can I have access to the training dataset?

Cutoff selection of %Rank score when predicting HLA-II neo-peptide using MixMHC2pred

Hi, we want to use MixMHC2pred to predict candidate HLA-II neo-peptide from NGS data. As you mentioned in README "The score returned (%Rank, column 3) corresponds to a percentile rank (best score is 0, worst score is 100)". Could you please give any advice on how to select the cutoff (like 50 or 95) to ensure both the specificity and sensitivity? Thank you!

Getting a std::out_of_range error

Hi,

I was testing your program with a complete proteome against just a randomly picked allele, and I'm getting this error. Any idea on why, or how to fix this? It works well with just a single protein.

Thanks in advance!

Make tool available as a python package?

Our team is interested in integrating MixMHC2pred into pVACtools. I'm not sure what language it was written in but we were wondering if there are any plans of making it available as a python package?

Question on the use of MixMHC2pred webserver

Dear developer,

Hello, I encountered some issues while using MixMHC2pred webserver. For the first block, which is used to predict binding specificity, after entering an existing allele in the text box and clicking the "Predict Allele Specificities" button, the page displays "Cannot POST /alleles." It seems that an error has occurred. Additionally, the "List alleles" and "Help/FAQ" modules on your web server also do not open. When I click the corresponding buttons, no content is displayed.

Looking forward to your reply.

FATAL: kernel too old

Dear developers ：
MixMHC2pred is a powerful tools for peptide prediction whith MHC2 alleles,but an error has occurred when i test the binary file,just like as follow:

my linux system version is centos 6.5，so i would like to know the environment in which the tool works，look forward to your reply，thanks.

Include raw prediction score in output

Hi there,

Currently the program only outputs the percentile rank score for each peptide. Is it possible to include the raw prediction scores (i.e. the prediction between 0 and 1 which is used to calculate the rank score) in the output for benchmarking purposes?

Best regards,
Jonas

The method of obtaining random peptides

Hi,

The paper had described the method to calculate %Rank : "This binding score is then transformed to a percentile rank per peptide length by comparing it to the score of 10,000 random human peptides of the same length". But I don’t understand how these random peptides were obtained, from randomly translating human genome regions into peptides ?

Thank you.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.