jradrion / teflon Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 7.0 22.41 MB

TEFLoN uses paired-end illumina sequence data to discover and genotype transposable elements present in your samples.

Python 98.33% Shell 1.67%

teflon's People

Stargazers

Watchers

Forkers

hui-liu lablancoberdugo hsnsa cgnat hzpc-joostk xma82 genostack

teflon's Issues

Error when I try to run sample_pipeline.sh

Advice on parsing results and getting allele frequencies

Hi Jeffrey,

Thanks for making TEFLoN available. I've run the pipeline on a group of 12 individuals, and now I'm working on parsing and understanding the results. I'm trying to follow a set of principles that are relatively similar to what you and co-authors did in the 2017 GBE paper.

To call a "present" genotype for a TE insertion, I'm currently requiring:

3 or more "presence" reads (column 10) in that sample
Ratio of "presence" reads to all reads (column 13) must be 0.75 or greater in that sample

To call an "absent" genotype for a TE insertion, I'm requiring:

- 3 or more "absence reads (column 11) in that sample

1 or fewer "presence" reads (column 10) in that sample
Ratio of "presence" reads to all reads (column 13) must be 0.25 or smaller in that sample

Does the above seem sensible?

Also, ideally I would like to calculate the frequencies of particular TE insertion alleles across all my 12 samples, but for most TE insertions, I tend to have an ambiguous genotype call (column 13 = -9) in one or more samples, which obviously makes things calculating the frequency more complicated.

My current thought is to set a threshold for minimum # of samples with an unambiguous genotype call, and then estimate the allele frequency for each TE insertion using the samples that have a called genotype for that TE insertion.

Does an approach like this seem reasonable to you?
Thanks for an advice!
Dave

New error

Hi, when I run the teflon.v0.4.py, I face a error as follows:

I am new one for the TEFLoN software. So, How I can solve it?

Interpreting the breakpoint coordinates

Hi @jradrion

I'm currently working on revamping and adding new TE detection methods to the McClintock pipeline and am interested in integrating TEFLoN, I just have a few questions about interpreting the output. Specifically, I want to make sure I am interpreting the breakpoint coordinates correctly.

Are the coordinates 0-based or 1-based? I ask because TEFLoN can take a BED6 file as input which uses 0-based coordinates, and I'm not sure if this convention is maintained in the output genotype file
Do the breakpoint positions indicate the final position in the reference genome before you see evidence of an insertion, or the position where you see the transition from reference to insertion?
- for example: does a 5' breakpoint at position 10 mean you see split-read mapping like: 5 6 7 8 9 10 TE TE TE TE or 5 6 7 8 9 TE TE TE TE. My assumption is the former but I want to make sure.
Am I correct in interpreting a prediction with a 5' breakpoint larger than the 3' breakpoint as a non-reference insertion with a TSD?
- For example: does a prediction with a 5' breakpoint at 92 and a 3' breakpoint at 89 indicate there is a non-reference insertion with a TSD of length 4?

# genotype prediction
chrXVI  92  89  TY3     TY3    ....

# my interpretation
5' Breakpoint Evidence: 84 85 86 87 88 89 90 91 92 -- -- -- -- -- --
3' Breakpoint Evidence: -- -- -- -- -- 89 90 91 92 93 94 95 96 97 98

Ultimately, my goal is to convert the predictions from TEFLoN's genotype file to a BED format that contains the interval for non-reference insertion TSDs so I want to make sure I am interpreting the positions correctly so I'm not inadvertently making the predictions worse.

Thanks in advance for any help you can provide.

Best,

Preston

Killed

Hi,

I have attempted to run TEFLon both locally and on an HPC. However, in both settings, the job is killed before the teflon_prep has been completed. If I submit as a job, the memory 150GB is not enough to complete the job. Am I missing something?

Thank you!

IUPAC code in TE database

Hi,

I am using teflon_prep_custom.py to annotate the reference genome. In the TE database -l I am using, I found some letters rather than ATGC in the fasta file. I guess they are IUPAC code for some ambiguous bases.
I am wondering if teflon works with IUPAC code. Currently, it didn't raise any errors or warnings when I run teflon_prep_custom.py with IUPAC.

Thanks,

Index BAM with BWA or Samtools?

Hi @jradrion

We are trying your pipeline for TE discovery in potato (tetraploid). However we face a few issues. Hereunder one.

Should we index an alignment with bwa index or samtools index?

TEFLoN/teflon.v0.4.py

Lines 352 to 355 in f6a4dbc

 # index new reduced alignment 1 

 cmd="%s index %s" %(exeBWA,bamFILE) 

 print "cmd:", cmd 

 os.system(cmd)

TEFLoN/teflon.v0.4.py

Lines 389 to 392 in f6a4dbc

 # index new reduced alignment 2 

 cmd="%s index %s" %(exeBWA,bamFILE) 

 print "cmd:", cmd 

 os.system(cmd)

AFAIK it is samtools, but trying it with BWA does not return an error. However, the result does not make sense. :-)

The genotype folder empty.

Hi @jradrion

I try your pipeline in one human sample (coverage 40x), but for some reason the genotype folder is empty and all the files in countPos/ are empty too.

Follow my commands:

python teflon/TEFLoN/teflon_prep_annotation.py -wd output/teflon/ -a ann/hg38/TE_teflon.sorted.bed -t ann/hg38/TE_hierarchy.sorted.txt\
 -g /home/genomes/Homo_sapiens/hg38/hg38.fa\
 -p all_te 2> log.teflon

bwa index output/teflon/all_te.prep_MP/all_te.mappingRef.fa

bwa mem -t 36 -Y output/teflon/all_te.prep_MP/all_te.mappingRef.fa input/fastq/JSR_N_R1.fastq input/fastq/JSR_N_R2.fastq > /home/JSR_N.sam

python teflon/TEFLoN/teflon.v0.4.py -d output/teflon/all_te.prep_TF/ -s output/teflon/sample.txt -i JSR_N -eb /home/tools/bin/bwa -es /home/tools/bin/samtools\
 -l1 family -l2 family -q 10 -t 36 -sd 820

**-sd was calculated by teflon.v0.4.py**

python teflon/TEFLoN/teflon_collapse.py -d output/teflon/teflon.prep_TF/ -s output/teflon/sample.txt -es /home/tools/bin/samtools -n1 1 -n2 1 -q 20 -t 10

python teflon/TEFLoN/teflon_count.py  -d output/teflon/teflon.prep_TF/ -s output/teflon/sample.txt -i JSR_N -eb /home/tools/bin/bwa -es /home/tools/bin/samtools\
 -l2 family -q 20 -t 12

python teflon/TEFLoN/teflon_genotype.py  -d output/teflon/teflon.prep_TF/ -s output/teflon/sample.txt -dt diploid

Input files:
Just a head of:

hierarchyfile
id family order
chr10100000951100001262AluSg ALU nltr
chr10100001399100001712AluJb ALU nltr
chr10100002786100003067L1M5 LINE1 nltr
chr10100003067100003374AluSc8 ALU nltr
chr10100003374100003422L1M5 LINE1 nltr
chr10100003580100003705AluJb ALU nltr
chr10100003746100003836AluJb ALU nltr
chr10100003949100004255AluJb ALU nltr
chr10100004299100004632L1M5 LINE1 nltr

TE_BED
chr1 11504 11675 chr11150411675L1MC5a 484 -
chr1 26790 27053 chr12679027053AluSp 2070 +
chr1 29901 30198 chr12990130198L1MB3 1323 +
chr1 31435 31733 chr13143531733AluJo 2059 +
chr1 33047 33456 chr13304733456L1MB5 2058 +
chr1 33465 33509 chr13346533509Alu 233 +
chr1 33528 34041 chr13352834041L1PA6 4051 -
chr1 34047 34108 chr13404734108L1P1 456 +
chr1 35366 35499 chr13536635499AluJr 1000 +
chr1 39623 39924 chr13962339924AluSx 2292 +

The uniqid is chr+start+end_subfamily

Do you have any clue to solve this problem ?
Thanks,
Rafael

No reference genome in prefix.prep_MP, but multiple in prep_RM

Hello,

When using TEFLoN preparation script teflon_prep_custom.py multiple times, there is no reference genome in the MP folder, but there are multiple reference genome versions in the RM file -.fa.align, fa.cat.gz, fa.masked, fa.tbl, fa.out.

Therefore, I was wondering if this was a problem with the tool itself, or whether one of these versions can actually be provided as a genome for the main TEFLoN script.

P.S. The hierarchy and bed files are there and looks complete

teflon.v0.4.py IndexError: list index out of range

error after collapsing samples

hello!
i wanted to test teflon on some sequencing data of mine!
after finishing collapsing the samples i get following error:

Traceback (most recent call last):
  File "/software/TEFLoN/TEFLoN_env/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
error concatenating positions

this is the command i'm using

python /software/TEFLoN/teflon_collapse.py -wd /project/TEFLoN -d /project/TEFLoN/ref.prep_TF/ -s /project/TEFLoN/samples.txt -es /software/BIN/samtools -n1 5 -n2 5 -q 15 -t 19

i'm only a beginner at python, so im not really sure what to do now.
it would be great if you could help me out!

kind regards,
lisa

teflon_genotype.py fails

malformed mega_clustered.bed

Hi @jradrion

I've run into an issue with one of my samples while running TEFLoN. Here are the commands I used to prep annotations, map reads, and run TEFLoN

conda activate teflon
outdir=/scratch/pjb68507/test/teflon
cd $outdir
mkdir teflon_results
cores=32

python src/teflon/teflon_prep_annotation.py \
    -wd teflon_results/ \
    -a data/reference_te.bed \
    -t data/teflon_taxonomy.tsv \
    -f data/consensusTEs.fasta \
    -g data/dm6.fasta \
    -p teflon

bwa index teflon_results/teflon.prep_MP/teflon.mappingRef.fa

bwa mem \
    -t $cores \
    -Y teflon_results/teflon.prep_MP/teflon.mappingRef.fa \
    data/M_gor_S7_1.fq data/M_gor_S7_2.fq > mapped_reads.sam

samtools view -@ $cores -Sb mapped_reads.sam > mapped_reads.bam

samtools sort -@ $cores -o mapped_reads.sorted.bam mapped_reads.bam

samtools index mapped_reads.sorted.bam

echo -e "$outdir/mapped_reads.sorted.bam\tsample" > samples.tsv

python src/teflon/teflon.v0.4.py \
    -wd teflon_results/ \
    -d teflon_results/teflon.prep_TF/ \
    -s samples.tsv \
    -i sample \
    -l1 family \
    -l2 family \
    -t $cores \
    -q 20

I ran everything within a conda environment that can be recreated with this yaml:

name: teflon
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - python=2.7
  - samtools=1.3
  - bwa=0.7.17
  - repeatmasker=4.1.0
  - bedtools=2.17.0
  - gawk=5

while executing the teflon.v0.4.py step, I get an error from samtools:

cmd: samtools view -@ 32 -q 20 -L /scratch/pjb68507/test/teflon/teflon_results/sample.bed_files/mega_clustered.bed /scratch/pjb68507/test/teflon/mapped_reads.sorted.bam -b
Error running samtools: p.returncode = 1

This is caused by the mega_clustered.bed being malformed and unreadable by samtools.

$ awk -F"\t" 'NF!=3{print $0}' teflon_results/sample.bed_files/mega_clustered.bed
chrX    13128878        13117.6 287     2438
0645341
chr3L   6111493S18      1571    1718
42325   13542472

Most of the mega_clustered.bed file is fine, but it looks like some of the lines are being merged together or truncated, resulting in entries with more/less than three columns.
I've also received a similar error report on a McClintock issue when @zhjpeng ran TEFLoN through McClintock here: bergmanlab/mcclintock#76 (comment)
Others in my lab and I have run TEFLoN through McClintock many times without issue but it seems that this particular sample (and @zhjpeng 's sample) is triggering the error, and I'm able to replicate it on repeated runs.
I am running this with unpublished reads, but I'd be happy to provide them, or any of the other input or log files, if necessary to work out this bug.

Thanks for the help,

Preston

Standard deviation warning

Hi, thanks for making TEFLoN, I'm hoping it'll be very useful for my project.

I have a set of libraries with insert size standard deviation consistently over 100 - this is 150bp paired end reads generated on a Novoseq 6000. When running teflon.v0.4.py I get the following warning:

Insert size standard deviation estimated as 140. Use the override option if you suspect this is incorrect!
!!! Warning: insert size standard deviation reported as 140 !!!
Please ensure this is correct and use the override option!

I can get it to run using the suggested -sd option, but I was just wondering what the reason was why teflon warns about this scenario, and if I should expect any impact on the results? Does a larger insert SD affect the ability to identify TE insertion/deletions?

	# index new reduced alignment 1
	cmd="%s index %s" %(exeBWA,bamFILE)
	print "cmd:", cmd
	os.system(cmd)

	# index new reduced alignment 2
	cmd="%s index %s" %(exeBWA,bamFILE)
	print "cmd:", cmd
	os.system(cmd)

jradrion / teflon Goto Github PK

teflon's People

Stargazers

Watchers

Forkers

teflon's Issues

Recommend Projects

Recommend Topics

Recommend Org