jradrion / teflon Goto Github PK
View Code? Open in Web Editor NEWTEFLoN uses paired-end illumina sequence data to discover and genotype transposable elements present in your samples.
TEFLoN uses paired-end illumina sequence data to discover and genotype transposable elements present in your samples.
.
Hi Jeffrey,
Thanks for making TEFLoN available. I've run the pipeline on a group of 12 individuals, and now I'm working on parsing and understanding the results. I'm trying to follow a set of principles that are relatively similar to what you and co-authors did in the 2017 GBE paper.
To call a "present" genotype for a TE insertion, I'm currently requiring:
To call an "absent" genotype for a TE insertion, I'm requiring:
-โ 3 or more "absence reads (column 11) in that sample
Does the above seem sensible?
Also, ideally I would like to calculate the frequencies of particular TE insertion alleles across all my 12 samples, but for most TE insertions, I tend to have an ambiguous genotype call (column 13 = -9) in one or more samples, which obviously makes things calculating the frequency more complicated.
My current thought is to set a threshold for minimum # of samples with an unambiguous genotype call, and then estimate the allele frequency for each TE insertion using the samples that have a called genotype for that TE insertion.
Does an approach like this seem reasonable to you?
Thanks for an advice!
Dave
Hi @jradrion
I'm currently working on revamping and adding new TE detection methods to the McClintock pipeline and am interested in integrating TEFLoN, I just have a few questions about interpreting the output. Specifically, I want to make sure I am interpreting the breakpoint coordinates correctly.
10
mean you see split-read mapping like: 5 6 7 8 9 10 TE TE TE TE
or 5 6 7 8 9 TE TE TE TE
. My assumption is the former but I want to make sure.92
and a 3' breakpoint at 89
indicate there is a non-reference insertion with a TSD of length 4?# genotype prediction
chrXVI 92 89 TY3 TY3 ....
# my interpretation
5' Breakpoint Evidence: 84 85 86 87 88 89 90 91 92 -- -- -- -- -- --
3' Breakpoint Evidence: -- -- -- -- -- 89 90 91 92 93 94 95 96 97 98
Ultimately, my goal is to convert the predictions from TEFLoN's genotype file to a BED format that contains the interval for non-reference insertion TSDs so I want to make sure I am interpreting the positions correctly so I'm not inadvertently making the predictions worse.
Thanks in advance for any help you can provide.
Best,
Preston
Hi,
I have attempted to run TEFLon both locally and on an HPC. However, in both settings, the job is killed before the teflon_prep has been completed. If I submit as a job, the memory 150GB is not enough to complete the job. Am I missing something?
Thank you!
Hi,
I am using teflon_prep_custom.py
to annotate the reference genome. In the TE database -l
I am using, I found some letters rather than ATGC
in the fasta file. I guess they are IUPAC code for some ambiguous bases.
I am wondering if teflon works with IUPAC code. Currently, it didn't raise any errors or warnings when I run teflon_prep_custom.py
with IUPAC.
Thanks,
Hi @jradrion
We are trying your pipeline for TE discovery in potato (tetraploid). However we face a few issues. Hereunder one.
Should we index an alignment with bwa index
or samtools index
?
Lines 352 to 355 in f6a4dbc
Lines 389 to 392 in f6a4dbc
AFAIK it is samtools, but trying it with BWA does not return an error. However, the result does not make sense. :-)
Hi @jradrion
I try your pipeline in one human sample (coverage 40x), but for some reason the genotype folder is empty and all the files in countPos/ are empty too.
Follow my commands:
python teflon/TEFLoN/teflon_prep_annotation.py -wd output/teflon/ -a ann/hg38/TE_teflon.sorted.bed -t ann/hg38/TE_hierarchy.sorted.txt\
-g /home/genomes/Homo_sapiens/hg38/hg38.fa\
-p all_te 2> log.teflon
bwa index output/teflon/all_te.prep_MP/all_te.mappingRef.fa
bwa mem -t 36 -Y output/teflon/all_te.prep_MP/all_te.mappingRef.fa input/fastq/JSR_N_R1.fastq input/fastq/JSR_N_R2.fastq > /home/JSR_N.sam
python teflon/TEFLoN/teflon.v0.4.py -d output/teflon/all_te.prep_TF/ -s output/teflon/sample.txt -i JSR_N -eb /home/tools/bin/bwa -es /home/tools/bin/samtools\
-l1 family -l2 family -q 10 -t 36 -sd 820
**-sd was calculated by teflon.v0.4.py**
python teflon/TEFLoN/teflon_collapse.py -d output/teflon/teflon.prep_TF/ -s output/teflon/sample.txt -es /home/tools/bin/samtools -n1 1 -n2 1 -q 20 -t 10
python teflon/TEFLoN/teflon_count.py -d output/teflon/teflon.prep_TF/ -s output/teflon/sample.txt -i JSR_N -eb /home/tools/bin/bwa -es /home/tools/bin/samtools\
-l2 family -q 20 -t 12
python teflon/TEFLoN/teflon_genotype.py -d output/teflon/teflon.prep_TF/ -s output/teflon/sample.txt -dt diploid
Input files:
Just a head of:
hierarchyfile
id family order
chr10100000951100001262AluSg ALU nltr
chr10100001399100001712AluJb ALU nltr
chr10100002786100003067L1M5 LINE1 nltr
chr10100003067100003374AluSc8 ALU nltr
chr10100003374100003422L1M5 LINE1 nltr
chr10100003580100003705AluJb ALU nltr
chr10100003746100003836AluJb ALU nltr
chr10100003949100004255AluJb ALU nltr
chr10100004299100004632L1M5 LINE1 nltr
TE_BED
chr1 11504 11675 chr11150411675L1MC5a 484 -
chr1 26790 27053 chr12679027053AluSp 2070 +
chr1 29901 30198 chr12990130198L1MB3 1323 +
chr1 31435 31733 chr13143531733AluJo 2059 +
chr1 33047 33456 chr13304733456L1MB5 2058 +
chr1 33465 33509 chr13346533509Alu 233 +
chr1 33528 34041 chr13352834041L1PA6 4051 -
chr1 34047 34108 chr13404734108L1P1 456 +
chr1 35366 35499 chr13536635499AluJr 1000 +
chr1 39623 39924 chr13962339924AluSx 2292 +
The uniqid is chr+start+end_subfamily
Do you have any clue to solve this problem ?
Thanks,
Rafael
Hello,
When using TEFLoN preparation script teflon_prep_custom.py multiple times, there is no reference genome in the MP folder, but there are multiple reference genome versions in the RM file -.fa.align, fa.cat.gz, fa.masked, fa.tbl, fa.out.
Therefore, I was wondering if this was a problem with the tool itself, or whether one of these versions can actually be provided as a genome for the main TEFLoN script.
P.S. The hierarchy and bed files are there and looks complete
hello!
i wanted to test teflon on some sequencing data of mine!
after finishing collapsing the samples i get following error:
Traceback (most recent call last):
File "/software/TEFLoN/TEFLoN_env/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
error concatenating positions
this is the command i'm using
python /software/TEFLoN/teflon_collapse.py -wd /project/TEFLoN -d /project/TEFLoN/ref.prep_TF/ -s /project/TEFLoN/samples.txt -es /software/BIN/samtools -n1 5 -n2 5 -q 15 -t 19
i'm only a beginner at python, so im not really sure what to do now.
it would be great if you could help me out!
kind regards,
lisa
.
Hi @jradrion
I've run into an issue with one of my samples while running TEFLoN. Here are the commands I used to prep annotations, map reads, and run TEFLoN
conda activate teflon
outdir=/scratch/pjb68507/test/teflon
cd $outdir
mkdir teflon_results
cores=32
python src/teflon/teflon_prep_annotation.py \
-wd teflon_results/ \
-a data/reference_te.bed \
-t data/teflon_taxonomy.tsv \
-f data/consensusTEs.fasta \
-g data/dm6.fasta \
-p teflon
bwa index teflon_results/teflon.prep_MP/teflon.mappingRef.fa
bwa mem \
-t $cores \
-Y teflon_results/teflon.prep_MP/teflon.mappingRef.fa \
data/M_gor_S7_1.fq data/M_gor_S7_2.fq > mapped_reads.sam
samtools view -@ $cores -Sb mapped_reads.sam > mapped_reads.bam
samtools sort -@ $cores -o mapped_reads.sorted.bam mapped_reads.bam
samtools index mapped_reads.sorted.bam
echo -e "$outdir/mapped_reads.sorted.bam\tsample" > samples.tsv
python src/teflon/teflon.v0.4.py \
-wd teflon_results/ \
-d teflon_results/teflon.prep_TF/ \
-s samples.tsv \
-i sample \
-l1 family \
-l2 family \
-t $cores \
-q 20
name: teflon
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- python=2.7
- samtools=1.3
- bwa=0.7.17
- repeatmasker=4.1.0
- bedtools=2.17.0
- gawk=5
teflon.v0.4.py
step, I get an error from samtools:cmd: samtools view -@ 32 -q 20 -L /scratch/pjb68507/test/teflon/teflon_results/sample.bed_files/mega_clustered.bed /scratch/pjb68507/test/teflon/mapped_reads.sorted.bam -b
Error running samtools: p.returncode = 1
mega_clustered.bed
being malformed and unreadable by samtools.$ awk -F"\t" 'NF!=3{print $0}' teflon_results/sample.bed_files/mega_clustered.bed
chrX 13128878 13117.6 287 2438
0645341
chr3L 6111493S18 1571 1718
42325 13542472
mega_clustered.bed
file is fine, but it looks like some of the lines are being merged together or truncated, resulting in entries with more/less than three columns.Thanks for the help,
Preston
Hi, thanks for making TEFLoN, I'm hoping it'll be very useful for my project.
I have a set of libraries with insert size standard deviation consistently over 100 - this is 150bp paired end reads generated on a Novoseq 6000. When running teflon.v0.4.py I get the following warning:
Insert size standard deviation estimated as 140. Use the override option if you suspect this is incorrect!
!!! Warning: insert size standard deviation reported as 140 !!!
Please ensure this is correct and use the override option!
I can get it to run using the suggested -sd option, but I was just wondering what the reason was why teflon warns about this scenario, and if I should expect any impact on the results? Does a larger insert SD affect the ability to identify TE insertion/deletions?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.