deepomicslab / spechla Goto Github PK

SpecHLA reconstructs entire diploid sequences of HLA genes and infers LOH events. It supports HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 genes. Also, it supports both short- and long-read data.

License: MIT License

Shell 0.72% Python 4.47% Perl 0.72% Makefile 0.11% C 17.13% JavaScript 1.36% C++ 71.58% TeX 2.32% Roff 0.21% CMake 0.41% Jupyter Notebook 0.04% Cuda 0.94%

hla-typing haplotype loh hla-sequence nanopore pacbio-data phase cancer local-assembly reads-binning

spechla's Introduction

SpecHLA: full-resolution HLA typing from sequencing data

SpecHLA is a software package leveraging reads binning and local assembly to achieve accurate full-resolution HLA typing and loss-of-heterozygosity detection.

SpecHLA reconstructs diploid sequences of HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 genes.
SpecHLA accepts short-reads-only, long-reads-only, and short+long-reads data.
SpecHLA supports WES, WGS, and RNA-seq data.
SpecHLA detects HLA loss-of-heterozygosity events.

Install

First, create the env with conda, and activate the env.

git clone https://github.com/deepomicslab/SpecHLA.git --depth 1
cd SpecHLA/
conda env create --prefix=./spechla_env -f environment.yml
conda activate ./spechla_env

Second, make the softwares in bin/ executable.

chmod +x -R bin/*

Third, index the database and install the packages.

unset LD_LIBRARY_PATH && unset LIBRARY_PATH 
bash index.sh

Perform SpecHLA with

bash script/whole/SpecHLA.sh -h

With only long reads, run

python3 script/long_read_typing.py -h

Note:

SpecHLA requires a License of Novoalign in bin/. If not detected, it uses bowtie2 as a replacement automatically. The License file of Novoalign should be put in the bin/ folder before running bash index.sh.
If you want to run SpecHLA with only long-read data, there is no need for the Novoalign license.
SpecHLA now supports Linux and Windows WSL systems.
SpecHLA does not accept short single-end reads.
The conda environment should be located in the local folder.

Test

Please go to the example/ folder, run SpecHLA with given scripts, and find results in the output/.

Basic Usage

Main functions

Scripts	Description
script/ExtractHLAread.sh	Extract HLA reads from enrichment-free data.
script/whole/SpecHLA.sh	HLA typing with paired-end (PE), PE+long reads, PE+HiC, or PE+10X data.
script/long_read_typing.py	HLA typing with only long-read data.
script/typing_from_assembly.py	HLA Typing from diploid assemblies.
script/cal.hla.copy.pl	Detect HLA LOH events based on SpecHLA's typing results.

Extract HLA-related reads

First extract HLA reads with enrichment-free data. Otherwise, HLA typing would be slow. Map reads to hg19 or hg38, then use script/ExtractHLAread.sh to extract HLA-related reads. We use the script of Kourami with minor revision for this step. Extract HLA-related reads by

USAGE: bash script/ExtractHLAread.sh -s <sample_id> -b <bamfile> -r <refGenome> -o <outdir>

 -s          : desired sample name (ex: NA12878) [required]

 -b          : sorted and indexed bam or cram (ex: NA12878.bam) [required]

 -r          : hg38 or hg19

 -o          : folder to save extracted reads [required]

HLA Typing

Full-resolution and exon HLA typing using SpecHLA. With Exome data like WES or RNASeq, only support exon typing. For efficient HLA typing, we strongly recommend utilizing only HLA-related reads. Specifically, for enrichment-free data, we recommend first performing the aforementioned step.

Perform full-resolution HLA typing with paired-end reads by

bash script/whole/SpecHLA.sh -n <sample> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Perform exon HLA typing with paired-end reads by

bash script/whole/SpecHLA.sh -n <sample> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/ -u 1

Perform full-resolution HLA typing with paired-end reads and PacBio reads by

bash script/whole/SpecHLA.sh -n <sample> -t <sample.pacbio.fq.gz> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Perform full-resolution HLA typing with paired-end reads and Nanopore reads by

bash script/whole/SpecHLA.sh -n <sample> -e <sample.nanopore.fq.gz> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Perform full-resolution HLA typing with paired-end reads and Hi-C reads by

bash script/whole/SpecHLA.sh -n <sample> -c <sample.hic.fwd.fq.gz> -d <sample.hic.rev.fq.gz> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Perform full-resolution HLA typing with paired-end reads and 10X linked reads by (LongRanger should be installed in system env)

bash script/whole/SpecHLA.sh -n <sample> -x <sample.10x.read.folder> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Consider long Indels and use population information for annotation by

bash script/whole/SpecHLA.sh -n <sample> -v True -p <Asia> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o outdir/

Full arguments can be seen in

SpecHLA: Full-resolution HLA typing from sequencing data.

Note:
  1) Use HLA reads only, otherwise, it would be slow. Use ExtractHLAread.sh to extract HLA reads first.
  2) WGS, WES, and RNASeq data are supported.
  3) With Exome data like WES or RNASeq, must select exon typing  (-u 0).
  4) Short single-end read data are not supported.

Usage:
  bash SpecHLA.sh -n <sample> -1 <sample.fq.1.gz> -2 <sample.fq.2.gz> -o <outdir>

Options:
  -n        Sample ID. <required>
  -1        The first fastq file of paired-end data. <required>
  -2        The second fastq file of paired-end data. <required>
  -o        The output folder to store the typing results. Default is ./output
  -u        Choose full-length or exon typing [0|1]. 0 indicates full-length, 1 means exon,
            default is 0. With Exome or RNA data, must select 1 (i.e., exon typing).
  -p        The population of the sample [Asian, Black, Caucasian, Unknown, nonuse] for annotation.
            Default is Unknown, meaning use mean allele frequency in all populations. nonuse indicates
            only adopting mapping score and considering zero-frequency alleles.
  -j        Number of threads [5].
  -t        Pacbio fastq file.
  -e        Nanopore fastq file.
  -c        fwd hi-c fastq file.
  -d        rev hi-c fastq file.
  -x        Path of folder created by 10x demultiplexing. Prefix of the filenames of FASTQs
            should be the same as Sample ID. Please install Longranger in the system env.
  -w        The weight to use allele imbalance info for phasing [0-1]. Default is 0 that means
            not use. 1 means only use imbalance info; other values integrate reads and allele imbalance.
  -m        The maximum mismatch number tolerated in assigning gene-specific reads. Deault
            is 2. It should be set larger to infer novel alleles.
  -y        The minimum different mapping score between the best and second-best aligned genes.
            Discard the read if the score is lower than this value. Deault is 0.1.
  -v        True or False. Consider long InDels if True, else only consider small variants.
            Default is False.
  -q        Minimum variant quality. Default is 0.01. Set it larger in high quality samples.
  -s        Minimum variant depth. Default is 5.
  -a        Use this long InDel file if provided.
  -r        The minimum Minor Allele Frequency (MAF), default is 0.05 for full length and
            0.1 for exon typing.
  -k        The mean depth in a window lower than this value will be masked by N, default is 5.
            Set 0 to avoid masking.
  -z        Whether only mask exon region, True or False, default is False.
  -f        The trio infromation; child:parent_1:parent_2 [Example: NA12878:NA12891:NA12892]. If provided,
            use trio info to improve typing. Note: use it after performing SpecHLA once already.
  -b        Whether use database for unlinked block phasing [0|1], default is 1 (i.e., use).
  -i        Location of the IMGT/HLA database folder, default is db.
  -l        Whether remove all tmp files [0|1], default is 1.
  -h        Show this message.

HLA typing with long-read data alone

Perform HLA typing only with long reads by

usage: python3 long_read_typing.py -h

HLA Typing with only long-read data.

Required arguments:
  -r                  Long-read fastq file. PacBio or Nanopore. (default: None)
  -n                  Sample ID (default: None)
  -o                  The output folder to store the typing results. (default: ./output)

Optional arguments:
  -p                  The population of the sample [Asian, Black, Caucasian, Unknown, nonuse] for annotation.
                        Unknown means use mean allele frequency in all populations. nonuse indicates only adopting
                        mapping score and considering zero-frequency alleles. (default: Unknown)
  -j                  Number of threads. (default: 5)
  -d                  Minimum score difference to assign a read to a gene. (default: 0.001)
  -g                  Whether use G group resolution annotation [0|1]. (default: 0)
  -m                  1 represents typing, 0 means only read assignment (default: 1)
  -k                  The mean depth in a window lower than this value will be masked by N, set 0 to avoid masking
                        (default: 5)
  -a                  Prefix of filtered fastq file. (default: long_read)
  -y                  Read type, [nanopore|pacbio]. (default: pacbio)
  --minimap_index     Whether build Minimap2 index for the reference [0|1]. Using index can reduce memory usage.
                        (default: 0)
  --db                db dir. (default: /home/wangshuai/softwares/SpecHLA/script/../db/)
  --strand_bias_pvalue_cutoff
                        Remove a variant if the allele observations are biased toward one strand (forward or reverse).
                        Recommand setting 0 to high-depth data. (default: 0.01)
  --seed              seed to generate random numbers (default: 8)
  --max_depth         maximum depth for each HLA locus. Downsample if exceed this value. (default: 2000)
  -h, --help

HLA Typing from diploid assemblies

usage: python3 typing_from_assembly.py -h

HLA Typing from diploid assemblies.

Required arguments:
  -1        Assembly file of the first haplotype in fasta formate (default: None)
  -2        Assembly file of the second haplotype in fasta formate (default: None)
  -n        Sample ID (default: None)
  -o        The output folder to store the typing results. (default: ./output)

Optional arguments:
  -j        Number of threads. (default: 10)
  -h, --help

An example of running command is

python SpecHLA/script/typing_from_assembly.py -j 15 -o output -n v12_HG00096 -1 v12_HG00096_hgsvc_pbsq2-clr_1000-flye.h1-un.arrow-p1.fasta -2 v12_HG00096_hgsvc_pbsq2-clr_1000-flye.h2-un.arrow-p1.fasta

The result is stored in <sample>_extracted_HLA_alleles.fasta, please see result description in the below Interpret output section.

Pedigree phasing

After performing HLA typing, we can afford trio information to update the results of child by

bash script/SpecHLA.sh -n <child> -1 <child.fq.1.gz> -2 <child.fq.2.gz> -o outdir/ --trio child:parent1:parent2

The fresh results can be found in the trio/ folder inside the original result folder.

Detect HLA LOH in tumor samples

Based on the tumor purity and ploidy, we can infer HLA LOH event by

usage:  perl cal.hla.copy.pl [Options] -S <samplename> -C <5> -purity <Purity> -ploidy <Ploidy> -F <filelist> -T <hla.result.txt> -O <outdir>

        OPTIONS:
        -purity  [f]  the purity of the tumor sample. <required>
        -ploidy  [f]  the ploidy of the tumor sample. <required> Note: tumor purity and ploidy can be inferred by software like ABSOLUTE and ASCAT.
        -S       [s]  sample name <required>
        -C       [f]  the cutoff of heterogeneous snp number. Default is 5.
        -F       [s]  the HLA_*_freq.txt file list. <required>  Obtained by "ls typing_result_dir/*_freq.txt >freq.list"
        -T       [s]  the hla typing result file of Spechla <required>
        -O       [s]  the output dir. <required> Need exist before running.

The result file "merge.hla.copy.txt" can be found in the output dir.

Use ls typing_result_dir/*_freq.txt >freq.list to generate the file required by -F. The command example is

perl script/cal.hla.copy.pl -purity 0.5 -ploidy 2 -S test -F freq.list -T hla.result.txt -O ./

Interpret output

In the denoted outdir, the results of each sample are saved in a folder named as the sample ID.

In the directory of one specific sample, you will find the below files:

Output	Description
hla.result.txt	HLA-typing results for all alleles
hla.result.details.txt	All the alleles with the highest mapping score in annotation
hla.result.g.group.txt	G group resolution HLA type
hla.allele..HLA_.fasta	Sequence of each allele (the low-depth region is masked by N)
HLA_*_freq.txt	Haplotype frequencies of each gene
HLA_*.rephase.vcf.gz	Phased vcf file for each gene
hlala.like.results.txt	Report all potential types like HLA*LA, only generated by long-read-only mode

If you performed pedigree phasing, you will find below files.

Output	Description
trio/HLA_*.trio.vcf.gz	Phased vcf file after pedigree phasing
hla.allele..HLA_.fasta	Allele sequence after pedigree phasing

An example for hla.result.txt is as below:

Sample  HLA_A_1 HLA_A_2 HLA_B_1 HLA_B_2 HLA_C_1 HLA_C_2 HLA_DPA1_1      HLA_DPA1_2      HLA_DPB1_1      HLA_DPB1_2      HLA_DQA1_1    HLA_DQA1_2      HLA_DQB1_1      HLA_DQB1_2      HLA_DRB1_1      HLA_DRB1_2
HG00118 A*24:02:01:01   A*02:01:01:01   B*07:02:01:01   B*40:01:02:02   C*03:04:01:01   C*07:02:01:03   DPA1*01:03:01:02        DPA1*01:03:01:02    DPB1*04:01:01:01        DPB1*04:01:01:01        DQA1*01:02:01:01        DQA1*01:02:01:01        DQB1*06:02:01:01        DQB1*06:02:01:01    DRB1*15:01:01:01        DRB1*15:01:01:01

Note:

We handle each gene independently. Hence the result does not contain the linkage information between different genes.
The Symbol - means low confidence sequence which cannot map to any allele in the database.

An example for HLA_*_freq.txt is as below:

# Haplotype     Frequency
hla.allele.1.HLA_A.fasta 0.532
hla.allele.2.HLA_A.fasta 0.468
# The number of heterozygous variant is 30

After typing from diploid assmebly, HLA sequences are in <sample>_extracted_HLA_alleles.fasta, and HLA types can be got by cat <sample>_extracted_HLA_alleles.fasta| grep >, an example is as below:

>v12_HG00096.h1.HLA-A   cluster13_contig_98:2073847-2077365     A*29:02:01:01   # version: IPD-IMGT/HLA 3.51.0
>v12_HG00096.h1.HLA-B   cluster13_contig_98:657830-661911       B*44:03:01:01   # version: IPD-IMGT/HLA 3.51.0
...
>v12_HG00096.h2.HLA-A   cluster13_scaffold_119:2516943-2520446  A*01:01:01:01   # version: IPD-IMGT/HLA 3.51.0
...

Interpret each column in the annotation line as

Column	Description
first	gene and haplotype, h1 means the first haplotype
second	the region to extract the sequence from the assembly
third	the best matched IMGT allele, i.e., the typing result
four	IMGT version

An example for HLA LOH result file: merge.hla.copy.txt is as below:

Sample  HLA     Allele1                 Allele2              copyratio       KeptHLA                 LossHLA                 Freq1   Freq2   Purity  Het_num LOH
test    A       A*24:02:01:01           A*02:01:01:01           1:1          A*24:02:01:01           A*02:01:01:01           0.507   0.493   0.5     100     N
test    B       B*07:02:01:01           B*40:01:02:01           1:1          B*07:02:01:01           B*40:01:02:01           0.502   0.498   0.5     32      N
test    C       C*03:04:01:01           C*07:02:01:03           1:1          C*07:02:01:03           C*03:04:01:01           0.433   0.567   0.5     99      N
test    DPA1    DPA1*01:03:01:02        DPA1*01:03:01:02        2:0          DPA1*01:03:01:02        homogeneous             1       0       0.5     0       N
test    DPB1    DPB1*04:01:01:01        DPB1*04:01:01:01        2:0          DPB1*04:01:01:01        DPB1*04:01:01:01        0.909   0.091   0.5     1       N
test    DQA1    DQA1*01:02:01:01        DQA1*01:02:01:01        1:1          DQA1*01:02:01:01        DQA1*01:02:01:01        0.548   0.452   0.5     1       N
test    DQB1    DQB1*06:02:01:01        DQB1*06:02:01:01        2:0          DQB1*06:02:01:01        homogeneous             1       0       0.5     0       N
test    DRB1    DRB1*15:01:01:01        DRB1*15:01:01:01        2:0          DRB1*15:01:01:01        DRB1*15:01:01:01        0.65    0.35    0.5     2       N

Interpret each column as

Header	Description
Sample	Sample Name
HLA	HLA locus
Allele1	HLA type of the first allele
Allele2	HLA type of the second allele
copyratio	Copy numbers of two alleles
KeptHLA	The remaining allele
LossHLA	The possible lost allele
Freq1	Frequency of the first allele
Freq2	Frequency of the second allele
Purity	Tumor purity
Het_num	Number of heterozygous variant
LOH	Whether LOH exists (Y or N)

Update to latest IMGT/HLA database

To nomenclature the HLA alleles using the latest IMGT/HLA database, please execute the following command:

cd SpecHLA/bin
perl renew_HLA_annotation_db.pl

This script will download the latest IMGT/HLA database from Github, and preprocess it for SpecHLA to use.

Dependencies

Systematic requirement

SpecHLA requires conda 4.12.0+, cmake 3.16.3+, and GCC 9.4.0+ for environment construction and software installation.

Programming

python=3.8.12 or above
Python libraries: numpy=1.22.3, pulp=2.6.0, pysam=0.19.0, scipy=1.8.0, and biopython=1.79.
Perl 5 or above

Third party packages

SpecHLA enables automatic installation of these third party packages using conda. Also, we can install the softwares by ourselves, and add their locations to system path (Exception: put bcftools, fermikit, and novoalign to bin/ folder). After applying for the authority of Novoalign, please put the novoalign.lic (License file) in the bin/ folder.

Third party packages:
- SamTools-1.3.1
- bamutil Version: 1.0.14
- BWA-0.7.17-r1188
- blast 2.12.0
- BLAT v. 36x2
- bedtools-v2.26.0
- bcftools-Version: 1.9
- Freebayes-v1.2.0-2-g29c4002
- Novoalign V4.02.01
- ScanIndel v1.3
- Fermikit-0.13
- SpecHap v1.0.1 and ExtractHAIRs
- Minimap 2.17-r941
- pbmm2-1.4.0
- pbsv Version 2.6.2
- Bowtie 2 version 2.3.4.1
- longshot-0.4.5

Citation

Wang, Shuai, Mengyao Wang, Lingxi Chen, Guangze Pan, Yanfei Wang, and Shuai Cheng Li. "SpecHLA enables full-resolution HLA typing from sequencing data." Cell Reports Methods (2023).

Getting help

Should you have any queries, please feel free to contact us, we will reply as soon as possible ([email protected]).

spechla's People

Contributors

Stargazers

Watchers

Forkers

wangjingwan wangmengyao758 tzhang-nmdp kelly-st-hri sophiajinyalan mamba-yu liuyongming0612 zhouqian177 redmar-van-den-berg

spechla's Issues

multithreading does not work

I have followed exactly the steps mentioned for installation and using. Now, I am doing HLA typing on WES data with -u 1 and -j 12. but SpecHLA only uses one core. how can I do it in parallel mode ?

Feature request

Would it be possible to add an option to specify the HLA database folder to use, rather than using the hardcoded 'db' option?

Cwd.c: loadable library and perl binaries are mismatched

Hello!

I'm trying to use your tool but I am getting the following error:

Cwd.c: loadable library and perl binaries are mismatched (got handshake key 0xde00080, needed 0xed00080)

If I am not mistaken, this error means the database is not properly indexed or the packages are not correctly installed. Right?

I have run the following commands to download and install:

git clone https://github.com/deepomicslab/SpecHLA.git --depth 1
cd SpecHLA/
conda env create --prefix=./spechla_env -f environment.yml
conda activate ./spechla_env
chmod +x -R bin/*
unset LD_LIBRARY_PATH && unset LIBRARY_PATH 
bash index. sh

Do I have to set a specific path for both LD_LIBRARY_PATH and LIBRARY_PATH? If so, which paths should I use?
Maybe it is another problem?

Thanks for the help!

Diploid asm

Can the SpecHLA work with diploid assembly as HLA*ASM? Possibly, long-read option works for that? Thanks!

installation on guix

Hi,
it would be great if this package could be compatible with Guix, as this can help to make the work more reproducible.. are you planning this? because we use Guix, it would take too much effort to try to contain the package and its dependencies in the conda environment. Maybe I am missing something :(

ExtractHLAread.sh syntax errors.

IF/ELSE syntax is not compatible with Bash on CentOS 7.

bash --version
GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

bash script/ExtractHLAread.sh -s NA12878 -b SRE3-230623_gencode.sorted.bam -r hg38 -o /home/user/outs
mkdir: missing operand
Try 'mkdir --help' for more information.
script/ExtractHLAread.sh: line 68: [: ==: unary operator expected
script/ExtractHLAread.sh: line 100: [: ==: unary operator expected
>>>>>>>>>>>>>> indexing extracted bam
[E::hts_open_format] fail to open file '/NA12878.tmp.extract.bam'
samtools index: failed to open "/NA12878.tmp.extract.bam": No such file or directory
>>>>>>>>>>>>>> bamUtil fastq extraction
$bamUtil fastq extraction Failed!

Resolves here: https://stackoverflow.com/questions/13617843/unary-operator-expected-error-in-bash-if-condition
https://stackoverflow.com/a/13618376

Null results for certain HLA loci

Hi Shuai,

If I have many WES samples with no results for some HLA-loci using the default parameters for WES.

For example in this tumour - normal matched sample:
tumour:

# version: IPD-IMGT/HLA 3.38.0
Sample  HLA_A_1 HLA_A_2 HLA_B_1 HLA_B_2 HLA_C_1 HLA_C_2 HLA_DPA1_1      HLA_DPA1_2      HLA_DPB1_1      HLA_DPB1_2      HLA_DQA1_1      HLA_DQA1_2      HLA_DQB1_1  HLA_DQB1_2      HLA_DRB1_1      HLA_DRB1_2
tumour        A*24:02:01:01   A*24:02:01:01   -       -       -       -       -       -       -       -       -       -       -       -       -       -

normal:

# version: IPD-IMGT/HLA 3.38.0
Sample  HLA_A_1 HLA_A_2 HLA_B_1 HLA_B_2 HLA_C_1 HLA_C_2 HLA_DPA1_1      HLA_DPA1_2      HLA_DPB1_1      HLA_DPB1_2      HLA_DQA1_1      HLA_DQA1_2      HLA_DQB1_1  HLA_DQB1_2      HLA_DRB1_1      HLA_DRB1_2
normal      A*24:02:01:01   A*24:02:01:01   -       -       -       -       -       -       -       -       -       -       -       -       -       -

Would you just interpret the lack of results for the MHC I genes to represent a lack of confidence in that locus? Or would you interpret this as a loss of MHC I? The logs don't seem to indicate anything particularly wrong.

DRB3 and DRB5

Hi, can SpecHLA type DRB3 and DRB5 as kourami? Thanks

Novel alleles

if SpecHLA finds novel HLA alleles, will the novel alleles be labeled by "novel" in hla.result.txt file? Thanks

For example, assuming that HLA_A_2 is a novel HLA allele here

version: IPD-IMGT/HLA 3.38.0

Sample HLA_A_1 HLA_A_2
test A*23:01:01:01 novel

MICA and MICB

Hello,
Can the current software type MICA and MICB genes?

what's the meanning of "interval_dict" in long_read_typing.py ?

Hello , professor,
For "A":"HLA_A:1000-4503" from the below code line , what's the meaning for these two number "1000" and "4503" ?
and how to get these coordinate number values ,like "C:1000-5304", "DPA1:1000-10775" ?
'''
interval_dict = {"A":"HLA_A:1000-4503", "B":"HLA_B:1000-5081","C": "HLA_C:1000-5304","DPA1":"HLA_DPA1:1000-10775",
"DPB1":"HLA_DPB1:1000-12468","DQA1":"HLA_DQA1:1000-7492","DQB1":"HLA_DQB1:1000-8480","DRB1":"HLA_DRB1:1000-12229" }
'''

Expecting your reply !

thank you .

Performace among Novoalign V4, V3 and Bowtie2 in read-binning step

Hi,
Since it takes time to request Novoalign V4 or money (I requested the license one week ago, but no reply), have you test Novoalign V3 which doesn't need license for the use of research. What's the performance among V4, V3 and Bowtie2? A naive question.

thank you!
Long

Question about minimap2 parameters in long_read_typing.py

Hi! I'm running long_read_typing.py for PacBio -reads for which I have extracted the HLA-regions (originally aligned data to chm13, then extracted hla with HLA*LA and ran bam2fastq). I was wondering about the alignment, as the same script is used also for typing Nanopore reads and there is only one set of minimap2 parameters:
minimap2 -t {parameter.threads} -p 0.1 -N 100000 -a $ref $fq

And wondered why those recommended for pacbio aren't used? E.g:
minimap2 -ax map-pb ref.fa pacbio-reads.fq
or
-k19 -w19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200

Should instead use the script SpecHLA.sh or am I correct in only running the long_read_typing.py at this point?

Many thanks!

full-length or exon typing

SpecHLA publication suggests that full-length typing outperforms exon typing. I am wondering if the reconstructed gene sequences/full-length (-u 0) are better than the reconstructed exon sequences (-u 1). Do the reads from noncoding regions (introns) improve phasing? Thanks

how to assign read to gene ?

hello,

for the below codes , what's the reason for doing this assignment ?

'''
elif gene_score[0][0] == 'DPB2' and gene_score[1][0] == "DPA1":
assigned_locus = ["DPA1"]
'''

Expecting your reply !

Best

Index command fails but reports success

On my system, bash index.sh reports success even when some steps in the script fail. This causes specHLA to crash later on when certain required tools are missing.

index.sh: line 49: cmake: command not found
make: *** No targets specified and no makefile found.  Stop.
mkdir: cannot create directory ‘SpecHLA/bin/extractHairs/build’: File exists
index.sh: line 55: cmake: command not found
make: *** No targets specified and no makefile found.  Stop.
 
 
 
The installation is finished! Please start use SpecHLA.

HLA typing with long-read data

Hello，
I looked at the documentation and I was wondering when performing full-resolution HLA typing with paired-end reads and Nanopore reads or with Nanopore data alone, do I still use the script/ExtractHLAread.sh to extract HLA-related reads from long-read data?
Thanks for SpecHLA and for your time.

ExtractHLAread.sh

Hi,
When I apply the ExtractHLAread.sh script, the unpaired reads produced are far more than the paired reads. After researching, I saw that some people say the bam file should be sorted by read name before converting to fastq, while in ExtractHLAread.sh, it is sorted by coordinates. However, when I switched to sorting by read name, although it reduced a lot of unpaired reads, running SpecHLA.sh afterwards generated a large number of “-” results. This is very confusing, do you know what might be the situation?

Errors when running SpecHLA.sh

Hi there, I got an error below (SpecHLA v1.0.2). Could you please give me some suggestions? Thanks

Attention: please ensure the platform can run gzip -l automatically, otherwise, it may not continue.
less: /home/src/SpecHLA/script/whole/../../spechla_env/lib/libtinfo.so.5: version NCURSES_TINFO_5.0.19991023' not found (required by less) BAM and VCF are ready. Mean depth {'HLA_A': 30, 'HLA_B': 29, 'HLA_C': 29, 'HLA_DPA1': 37, 'HLA_DPB1': 48, 'HLA_DQA1': 29, 'HLA_DQB1': 30, 'HLA_DRB1': 29} Minimum Minor Allele Frequency is 0.05. Traceback (most recent call last): File "/home/src/SpecHLA/script/whole/../phase_variants.py", line 1866, in <module> snp_list,beta_set,snp_index_dict = read_vcf(vcffile,outdir,snp_dp,bamfile,indel_len,gene,\ File "/home/src/SpecHLA/script/whole/../phase_variants.py", line 132, in read_vcf in_vcf = VariantFile(vcffile) # convert vcf to double-allele vcf File "pysam/libcbcf.pyx", line 4054, in pysam.libcbcf.VariantFile.__init__ File "pysam/libcbcf.pyx", line 4284, in pysam.libcbcf.VariantFile.open ValueError: invalid file b'./bam3_hla/test123/test123.realign.filter.vcf' (mode=b'r'`) - is it VCF/BCF format?

The test123.realign.filter.vcf does not have any header and the first line is

HLA_A 1146 . C T 224.05 PASS AB=0.846154;ABP=16.5402;AC=2;AF=0.666667;AN=3;AO=11;CIGAR=1X;DP=13;DPB=13;DPRA=0;EPP=7.94546;EPPR=7.35324;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=1.89309;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=330;QR=60;RO=2;RPL=2;RPP=12.6832;RPPR=7.35324;RPR=9;RUN=1;SAF=8;SAP=7.94546;SAR=3;SRF=2;SRP=7.35324;SRR=0;TYPE=snp GT:DP:AD:RO:QR:AO:QA 0/1/1:13:2,11:2:60:11:330

Identify somatic HLA mutation from SpecHLA

Hi,
Thank you for developing this tool which is quite easy to use!
I am wondering if SpecHLA could also be used to detect somatic HLA mutations since we have paired tumor and normal WES or RNAseq data. Also I found some HLA_XXX.vcf.gz/HLA_XXX.rephase.vcf.gz/HLA_XXX.specHap.phased.vcf in the output. I don't know if SpecHLA already did this things. And maybe I just need to compare the results from tumor with that from normal to get the distinct varaints?
Could you give some comments about this and guide us how to identify the somatic (or germline if possible) HLA mutations using SpecHLA if it works?
thank you very much!
Long

Publication for SpecHLA

HI, I come across your repository. Could you please provide your publication related to this HLA typing software?

Typing long reads

Hello
Help me with long read typing
I tried_
bash script/ long_read_typing.py -j 5 -n HLAs1 -r HLAs1. fastq -o output_

Question about ploidy for LOH

Hi SpecHLA developers,

Your tool has been incredibly easy to set up and use so far. I just wanted a clarification on your cal.hla.copy.pl function.

In your description you had --ploidy as "the ploidy of tumor sample in HLA gene region". What would you put if you detected focal CNV deletions or gains at different HLA regions here?

NGS Whole genome pipeline hang at run.assembly.realign.sh

Hi, I am trying out SpecHLA using example data.
I have installed and indexed the tool according with some modification, because the shared library is missing when using bowtie2 from bin directory => I installed bowtie2 by conda, and changed the path in index.sh

(the error when using direct bowtie2 SpecHLA/bin/bowtie2/bowtie2-align-s: error while loading shared libraries: libtbb.so.2: cannot open shared object file: No such file or directory)

When I used the script: bash SpecHLA.sh -n test_wgs -1 HG00118_1.filter.fastq.gz -2 HG00118_2.filter.fastq.gz -p Caucasian -o output
it hanged at run.assembly.realign.sh. I checked the log file (test_wgs.local_assem.log), the last 5 lines are:

SpecHLA/script/../bin/fermikit/fermi.kit/bfc -1s 1k -k 10 -t 5 output/test_wgs/prefix2.ec.fq.gz 2> output/test_wgs/prefix2.flt.fq.gz.log | gzip -1 > output/test_wgs/prefix2.flt.fq.gz
SpecHLA/script/../bin/fermikit/fermi.kit/ropebwt2 -dNCr output/test_wgs/prefix2.flt.fq.gz > output/test_wgs/prefix2.flt.fmd 2> output/test_wgs/prefix2.flt.fmd.log
SpecHLA/script/../bin/fermikit/fermi.kit/fermi2 assemble -l 40 -m 53 -t 5 output/test_wgs/prefix2.flt.fmd 2> output/test_wgs/prefix2.pre.gz.log | gzip -1 > output/test_wgs/prefix2.pre.gz
SpecHLA/script/../bin/fermikit/fermi.kit/fermi2 simplify -CSo 45 -m 53 -T 40 output/test_wgs/prefix2.pre.gz 2> output/test_wgs/prefix2.mag.gz.log | gzip -1 > output/test_wgs/prefix2.mag.gz
"output/test_wgs/prefix2.mag.gz" may be a binary file.  See it anyway?

the file output/test_wgs/prefix2.mag.gz is empty

Could you please help me debug this issue please?

BWA or bowtie2

Hi there, is it possible to use BWA-mem as an aligner? Thanks

Novoalign

Hi there, do you have a plan to replace commercial Novoalign with a non-commercial aligner? This publication (https://link.springer.com/article/10.1007/s00521-021-06188-z) compared 17 aligners where the SEGEMEHL seems to be a competitive one. Thanks.

Low confidence result for all HLA genes

Hello,

When using SpecHLA.sh on paired-end RNA data, I get no results for any HLA alleles. The hla.result.txt file is as follows:

# version: IPD-IMGT/HLA 3.38.0
Sample	HLA_A_1	HLA_A_2	HLA_B_1	HLA_B_2	HLA_C_1	HLA_C_2	HLA_DPA1_1	HLA_DPA1_2	HLA_DPB1_1	HLA_DPB1_2	HLA_DQA1_1	HLA_DQA1_2	HLA_DQB1_1	HLA_DQB1_2	HLA_DRB1_1	HLA_DRB1_2
sample1	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-

All hla.allele.*.HLA-*.fasta files contain almost exclusively N's, which explains why the sequence confidence is too low to map the HLA alleles. However, the input fastq.gz files do give a result in other HLA typing tools. What could be causing the empty results?

[help] simulate TGS

Hello：

In oder to test the performence of several softwares, how to simulate TGS data (Pacbio) for HLA typing test.

As I know, the pbsim may add some error in simulated reads, and it will affect the true result of hla typing.

So， how to simulate NGS or TGS data for HLA typing test for comparing softwares ?

Thank you !

The best!

no result for SpecHLA_single

I use the SpecHLA_single (Pacbio CCS reads) to do HLA type, but no result ouput . If the coverage of data was two slow ?

Couldn't run a test on example files of exon. I saw this error related to libtinfo.so.5: version `NCURSES_TINFO_5.0.19991023'

Hello,

I installed this tool following your instruction on GitHub and ran the following command for samples in the example folder.
bash /home/ibiolab/SpecHLA/script/whole/SpecHLA.sh -n NA06985FINAL -1 /home/ibiolab/SpecHLA/example/exon/NA06985_1.filter.fastq.gz -2 /home/ibiolab/SpecHLA/example/exon/NA06985_2.filter.fastq.gz -o /home/ibiolab/SpecHLA/example/exon/output -u 1

However, an error below has shown up, and no results for HLA typing.

An error

"Attention: please ensure the platform can run gzip -l automatically, otherwise, it may not continue.
less: /home/ibiolab/SpecHLA/script/whole/../../spechla_env/lib/libtinfo.so.5: version `NCURSES_TINFO_5.0.19991023' not found (required by less)
BAM and VCF are ready."

Results

version: IPD-IMGT/HLA 3.38.0

Sample HLA_A_1 HLA_A_2 HLA_B_1 HLA_B_2 HLA_C_1 HLA_C_2 HLA_DPA1_1 HLA_DPA1_2 HLA_DPB1_1 HLA_DPB1_2 HLA_DQA1_1 HLA_DQA1_2 HLA_DQB1_1 HLA_DQB1_2 HLA_DRB1_1 HLA_DRB1_2
NA06985NEW - - - - - - - - - - - - - - - -

/home/ibiolab/SpecHLA/spechla_env/lib

is my lib directorythat contains libtinfo.so.5 file.

What should I do to figure out this issue?

Best,
Jay

Possibility to Update and Display IMGT/HLA Database Version in Output

Hello!

Congratulations on the effort to produce this tool. Would it be possible to update the IMGT/HLA database to the most recent version, or if the software already does this, could you include the database version used for HLA typing in the output?

Thank you.

HLA typing only based on long-read RNA-seq data?

Hi, according to the paper, it says it support RNA-seq data and long-read sequencing data. I am wondering whether it can do HLA typing only based on long-read RNA-seq data (Oxford Nanopore data)?

how to get hla_gen.format.filter.extend.DRB.no26789.fasta ?

Hello,
could you tell us how to get this ref file , hla_gen.format.filter.extend.DRB.no26789.fasta ?
I checked this fasta , for example , it is only 2-field fasta sequence . so how to get 2-filed allele sequence from IMGT ?
I find IMGT is full resolution sequence database ( 4-field).
Expecting your repley !

thank you !

Install probelm

Hi there, I got an error during the installation as below

/home/src/SpecHLA/bin/SpecHap/hic_util.h:33:43: required from here /usr/include/c++/5/ext/new_allocator.h:120:4: error: no matching function for call to ‘std::pair<const unsigned int, HiCLinker>::pair(unsigned int&, Fragment&)’

Could you please give me some suggestions? Thanks

use deep-variants calling too to replace longshot ?

hey,
For nanopore sequencing data , I find deepvariant can also do snvs calling , so do you have a plan to use deepvariant tool to replace longshot ?
or longshot is better than deepvariant based your testing ?

Expecting your reply !

deepomicslab / spechla Goto Github PK

spechla's Introduction

SpecHLA: full-resolution HLA typing from sequencing data

Install

Test

Basic Usage

Main functions

Extract HLA-related reads

HLA Typing

HLA typing with long-read data alone

HLA Typing from diploid assemblies

Pedigree phasing

Detect HLA LOH in tumor samples

Interpret output

Update to latest IMGT/HLA database

Dependencies

Systematic requirement

Programming

Third party packages

Citation

Getting help

spechla's People

Contributors

Stargazers

Watchers

Forkers

spechla's Issues

version: IPD-IMGT/HLA 3.38.0

version: IPD-IMGT/HLA 3.38.0

Recommend Projects

Recommend Topics

Recommend Org