oushujun / edta Goto Github PK

View Code? Open in Web Editor NEW

326.0 326.0 72.0 232.58 MB

Extensive de-novo TE Annotator

Home Page: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y

License: GNU General Public License v3.0

Perl 77.21% Shell 5.55% Python 14.78% PureBasic 1.30% R 1.15%

benchmarking genome-annotation pipeline transposable-elements

edta's Issues

ERROR: Stage 1 library not found in chr1.fa.mod.EDTA.combine/chr1.fa.mod.LTR.TIR.Helitron.fa.stg1 at EDTA.pl line 384.

Hello,

I am trying to use EDTA in order to annotate an avian genome(as a test I do it on a single chromosome), but i keep running into an error.
I have installed it with conda, following the script you have on github.
here is the copy of what I execute in order to run your script :

PATH=$PATH:/home/tkastylevsky/EDTA
cd
cd /home/tkastylevsky/FASTA_files/EDTA/gallus_gallus/chr1/
EDTA.pl -genome chr1.fa -anno 1  -force 1

(I tried to add the force 1 based on a solved issue on this github but it didn't help)

and this is what I get (some of it is in french, sorry, feel free to ask me if you need any translation, at first glance it seemed to me that everything was roughly understandable) :

########################################################

Extensive de-novo TE Annotator (EDTA) v1.7.6

Shujun Ou ([email protected])

########################################################

mercredi 29 janvier 2020, 18:08:10 (UTC+0100) Dependency checking:
All passed!
mercredi 29 janvier 2020, 18:08:20 (UTC+0100) Obtain raw TE libraries using various structure-based programs:
At least 1 parameter mandatory:

Input fasta file: --genome

Obtain raw TE libraries using various structure-based programs
perl EDTA_raw.pl [options]
--genome [File] The genome FASTA
--species [rice|maize|others] Specify the species for identification of TIR candidates. Default: others
--type [ltr|tir|helitron|all] Specify which type of raw TE candidates you want to get. Default: all
--overwrite [0|1] If previous results are found, decide to overwrite (1, rerun) or not (0, default).
--threads|-t [int] Number of theads to run this script. Default: 4
--help|-h Display this help info

cat: chr1.fa.mod.LTR.intact.fa: Aucun fichier ou dossier de ce type
cat: chr1.fa.mod.TIR.intact.fa: Aucun fichier ou dossier de ce type
cat: chr1.fa.mod.Helitron.intact.fa: Aucun fichier ou dossier de ce type
cat: chr1.fa.mod.LTR.intact.fa.gff3: Aucun fichier ou dossier de ce type
cat: chr1.fa.mod.TIR.intact.fa.gff: Aucun fichier ou dossier de ce type
cat: chr1.fa.mod.Helitron.intact.fa.gff: Aucun fichier ou dossier de ce type

perl bed2gff.pl EDTA.TE.combo.bed

mv: impossible d'évaluer 'chr1.fa.mod.EDTA.intact.bed.gff': Aucun fichier ou dossier de ce type
cp: impossible d'évaluer 'chr1.fa.mod.EDTA.intact.gff': Aucun fichier ou dossier de ce type
mercredi 29 janvier 2020, 18:08:20 (UTC+0100) Obtain raw TE libraries finished.
All intact TEs found by EDTA:
chr1.fa.mod.EDTA.intact.fa
chr1.fa.mod.EDTA.intact.gff

mercredi 29 janvier 2020, 18:08:20 (UTC+0100) Perform EDTA advcance filtering for raw TE candidates and generate the stage 1 library:

Genome file chr1.fa.mod not exists!

Perform EDTA basic and advcanced filterings for raw TE candidates and generate the stage 1 library
perl EDTA_processF.pl [options]
-genome [File] The genome FASTA
-ltr [File] The raw LTR library FASTA
-tir [File] The raw TIR library FASTA
-helitron [File] The raw Helitron library FASTA
-mindiff_ltr [float] The minimum fold difference in richness between LTRs and contaminants (default: 1)
-mindiff_tir [float] The minimum fold difference in richness between TIRs and contaminants (default: 1)
-mindiff_hel [float] The minimum fold difference in richness between Helitrons and contaminants (default: 4)
-repeatmasker [path] The directory containing RepeatMasker (default: read from ENV)
-blast [path] The directory containing Blastn (default: read from ENV)
-protlib [File] Protein-coding aa sequences to be removed from TE candidates. (default lib: alluniRefprexp082813 (plant))
You may use uniprot_sprot database available from here:
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/
-threads|-t [int] Number of theads to run this script
-help|-h Display this help info

ERROR: Stage 1 library not found in chr1.fa.mod.EDTA.combine/chr1.fa.mod.LTR.TIR.Helitron.fa.stg1 at /home/tkastylevsky/EDTA/EDTA.pl line 384.

I know, through other annotation methods (repeatmodeler) that there are LTR, TIR and helitrons on this chromosome.

Thank you in advance,

Combining existing TE library with EDTA results

I think I closed this a bit too early, I do have a question that isn't discussed in #8. If we plan on including homology-based TEs from RepBase or Dfam as well as the structure-based TEs from EDTA, do you suggest including the RepBase/Dfam libraries in the -curatedlib option of the EDTA run? Or should we run EDTA and then concatenate with RepBase/Dfam results?

Originally posted by @Neato-Nick in #18 (comment)

Identifying TIR uses only one CPU

Hi Shujun,
I am running EDTA.pl in a conda environment using --threads 30. The 'Identify LTR' step finished in less than one day and the 'Identify TIR' has been running for six days now. I've also noticed that this process is using only one CPU. Is it normal?

##### Extensive de-novo TE Annotator (EDTA) v1.7.9  ####
##### Shujun Ou ([email protected])             ####
########################################################



Mon Feb  3 19:37:57 -02 2020    Dependency checking:
                                All passed!

Mon Feb  3 19:38:41 -02 2020    Obtain raw TE libraries using various structure-based programs:
Mon Feb  3 19:38:41 -02 2020    EDTA_raw: Check dependencies, prepare working directories.

Mon Feb  3 19:38:53 -02 2020    Start to find LTR candidates.

Mon Feb  3 19:38:53 -02 2020    Identify LTR retrotransposon candidates from scratch.

Use of uninitialized value $chr_pre in hash element at /home/augustold/miniconda3/envs/EDTA/share/LTR_retriever/bin/call_seq_by_list.pl line 86.
Tue Feb  4 13:07:26 -02 2020    Finish finding LTR candidates.

Tue Feb  4 13:07:26 -02 2020    Start to find TIR candidates.

Tue Feb  4 13:07:26 -02 2020    Identify TIR candidates from scratch.

Species: others

Best wishes and thank you for providing this tool.

All transposons "unknown"

Hello,
in the file.fa.EDTA.TElib.fa, virtually all transposons are labelled as "unknown":
16712 are unknown
48 are Gypsy
Presuming my study organism is not having completely strange transposons, is it the kind of expected statistics?

Thank you

Call_seq.pl script giving an empty output.

When running EDTA_raw.pl script the output for both TIR and Helitron raw fasta files are empty. I think the problem is at the call_seq.pl script because the TIR.ext30.list gives an output such as:

000000F:152380..154395 000000F:152350..154425
000000F:292101..295163 000000F:292071..295193
000000F:429115..433751 000000F:429085..433781
000000F:433252..438167 000000F:433222..438197

But then the TIR.ext30.fa is empty

I tried to call the script alone:
perl $call_seq $seq.ext$extlen.list -C $genome
but it doesn't give any output neither.

Same situation applies for HelitronScanner.raw.ext.list and HelitronScanner.raw.ext.fa

The fasta header format is as follows:

000160F 000285294:B~~000285294:B~~000284495:B~000284495:B ctg_linear 11256 10841

but even with no spaces the problem persists:

000161F_000058666:E~~000058666:E~~000414072:B~000414072:B_ctg_linear_15599_15577

Any help or suggestion will be appreciated.
Thanks!

TIRlearner crashes Out of Memory

Hi !
I run into a memory issue trying to run TIR-Learner. Did you already run into it? And what can I do to solve this issue?

Here are the commands/outputs that I get:
$ nohup perl ../EDTA/EDTA_raw.pl -genome F2.genome.fasta -species Maize -type tir -threads 20 > essai_tir.out 2> essai_tir.err &
$ cat essai_tir.err
nohup: ignoring input
Wed Jan 15 19:24:43 CET 2020 EDTA_raw: Check files and dependencies, prepare working directories.

Wed Jan 15 19:24:43 CET 2020 Start to find TIR candidates.

ln: failed to create symbolic link 'F2.genome.fasta': Input/output error
Wed Jan 15 19:24:43 CET 2020 Identify TIR candidates from scratch.

Species: Maize
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!
Out of memory!

ERROR: no LOC list!

Usage: perl call_seq_by_list.pl MSU_format_list -C genome.fasta -out file.fa [options]
	itself	Output sequence specified in the list (default).
	up_[int]	Output sequences [int] bp upstream of the region.
	down_[int]	Output sequences [int] bp downstream of the region.
	-C [fasta]	A fasta file you want to extract sequence from.
	-out	Output file name. Default: MSU_format_list.fa
	-header	[0|1]	Output sequence with (1, default) or without (0) sequence header.
	-rmvoid	[0|1]	Remove empty sequence (1, default) or retain empty sequence (0) in output.
	-ex		Exclude sequence specified by the list. Default: Output sequence specified by the list.
	-cov	[0-1]	Work with -ex. If excluding too much of the target (default 1), discard the entire sequence.
	-purge	[0|1]	Work with -ex. Switch on=1/off=0(default) to clean up aligned region and joint unaligned sequences.
Example: 
	Call sequence of upper 2000 bp region in the list and output to result.fa
		perl call_seq_by_list.pl array_list -C rice.fasta up_2000 -out result.fa

Out of memory!
Out of memory!
Out of memory!
Out of memory!

libstdc++.so.6: version `GLIBCXX_3.4.20' not found

hi Shujun,

Unfortunately I'm still having trouble with this. Following on from my previous comment, I am now using my own installed version of RepeatMasker and everthing seems to work until it gets to TIR learner, where I am now getting the below error.


Fri Aug 16 19:10:44 CEST 2019	Dependency checking:
		All passed!
Fri Aug 16 19:10:56 CEST 2019	Obtain raw TE libraries using various structure-based programs:
/stn4/djeffrie/EDTA/bin/TIR-Learner1.19/../GenericRepeatFinder/bin//grf-main: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /stn4/djeffrie/EDTA/bin/TIR-Learner1.19/../GenericRepeatFinder/bin//grf-main)
/stn4/djeffrie/EDTA/bin/TIR-Learner1.19/../GenericRepeatFinder/bin//grf-main: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /stn4/djeffrie/EDTA/bin/TIR-Learner1.19/../GenericRepeatFinder/bin//grf-main)

Apparently I don't have the correct GLIBCXX version?

Any ideas?

Best

Dan

Originally posted by @DanJeffries in #11 (comment)

Fail on identification of TIRs (EDTA v 1.8) with the step-by-step installation

Hello,
I'm currently trying to run EDTA on the cluster of my laboratory, and I encounter an issue that looks similar to the one listed below in the EDTA issues, except I'm running on the 1.8 version. i installed it through the step by step conda installation (for some reason, the one line conda installation doesn't want to work on my devices).
I encounter this error :

Mon Feb 10 17:57:54 CET 2020	EDTA_raw: Check dependencies, prepare working directories.

Mon Feb 10 17:58:14 CET 2020	Start to find LTR candidates.

Mon Feb 10 17:58:14 CET 2020	Identify LTR retrotransposon candidates from scratch.

Mon Feb 10 18:39:20 CET 2020	Finish finding LTR candidates.

Mon Feb 10 18:39:20 CET 2020	Start to find TIR candidates.

Mon Feb 10 18:39:20 CET 2020	Identify TIR candidates from scratch.

Species: others
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
  File "/beegfs/data/tkastylevsky/programs/EDTA/bin/TIR-Learner2.4/Module3_New/getDataset.py", line 11, in <module>
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/sklearn/preprocessing/__init__.py", line 8, in <module>
    from .data import Binarizer
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 18, in <module>
    from scipy import stats
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/scipy/stats/__init__.py", line 348, in <module>
    from .stats import *
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/scipy/stats/stats.py", line 177, in <module>
    from . import distributions
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/scipy/stats/distributions.py", line 13, in <module>
    from . import _continuous_distns
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/scipy/stats/_continuous_distns.py", line 15, in <module>
    from scipy._lib._numpy_compat import broadcast_to
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/scipy/_lib/_numpy_compat.py", line 10, in <module>
    from numpy.testing.nosetester import import_nose
ModuleNotFoundError: No module named 'numpy.testing.nosetester'
cat: '*-+-DTA.fa': No such file or directory
cat: '*-+-DTC.fa': No such file or directory
cat: '*-+-DTH.fa': No such file or directory
cat: '*-+-DTM.fa': No such file or directory
cat: '*-+-DTT.fa': No such file or directory
cat: '*-+-NonTIR.fa': No such file or directory
cat: '*-+-*-+-*.gff3': No such file or directory
rm: cannot remove '*-+-*-+-*.gff3': No such file or directory
Traceback (most recent call last):
  File "/beegfs/data/tkastylevsky/programs/EDTA/bin/TIR-Learner2.4/Module3_New/CombineAll.py", line 75, in <module>
    f_m3=removeDupinSingle("%s.gff3"%(genome_Name+spliter+"Module3"))
  File "/beegfs/data/tkastylevsky/programs/EDTA/bin/TIR-Learner2.4/Module3_New/CombineAll.py", line 57, in removeDupinSingle
    f=pd.read_csv(file,header=None,sep="\t") #shujun
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 532, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/beegfs/data/tkastylevsky/programs/EDTA/bin/TIR-Learner2.4/Module3/GetAllSeq.py", line 32, in GetListFromFile
    f=open(file,"r+")
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/beegfs/data/tkastylevsky/programs/EDTA/bin/TIR-Learner2.4/Module3/GetAllSeq.py", line 63, in <module>
    pool.map(GetListFromFile,fileList) #shujun
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 288, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/beegfs/home/tkastylevsky/.conda/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 670, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
mv: cannot stat 'TIR-Learner/*FinalAnn*.gff3': No such file or directory
mv: cannot stat 'TIR-Learner/*FinalAnn*.fa': No such file or directory
Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.fa: No such file or directory at /beegfs/data/tkastylevsky/programs/EDTA/util/rename_tirlearner.pl line 18.
Warning: LOC list galgal6_chr1.fa.mod.TIR.ext30.list is empty.

Error: Error while loading sequenceCan't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.gff3: No such file or directory.
Warning: The TIR result file has 0 bp!

Mon Feb 10 21:29:38 CET 2020	Start to find Helitron candidates.

Mon Feb 10 21:29:38 CET 2020	Identify Helitron candidates from scratch.

Tue Feb 11 01:27:12 CET 2020	Finish finding Helitron candidates.

Tue Feb 11 01:27:12 CET 2020	Execution of EDTA_raw.pl is finished!

ERROR: Raw TIR results not found in galgal6_chr1.fa.mod.EDTA.raw/galgal6_chr1.fa.mod.TIR.raw.fa at /beegfs/data/tkastylevsky/programs/EDTA/EDTA.pl line 368.

thanks in advance,

TEsorter issue

Dear Shujun,

Thanks for developing EDTA. It's really helpful.
I am now running this pipeline for my genome but encounter an error:

2020-02-05 19:50:18,695 -INFO- generating gene anntations
Traceback (most recent call last):
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/bin/TEsorter", line 10, in
sys.exit(main())
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 976, in main
pipeline(Args())
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 171, in pipeline
for rc in Classifier(gff, db=args.hmm_database, fout=fc):
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 391, in classify
for rc in self.parse():
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 380, in parse
line = LTRgffLine(line)
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 609, in init
super(LTRgffLine, self).init(line)
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 604, in init
self.attributes = self.parse(self.attributes)
File "/media/bulk_01/users/cai020/software/miniconda3/envs/EDTA/lib/python3.6/site-packages/TEsorter/app.py", line 606, in parse
return dict(kv.split('=') for kv in attributes.split(';'))
ValueError: dictionary update sequence element #0 has length 3; 2 is required
Warning...unknown stuff <

my command line is below: (EDTA v1.7.9)
EDTA.pl -genome $genome -species others -step all -overwrite 0 -cds $cds -sensitive 0 -anno 1 -evaluate 1 -threads $thread -repeatmasker $repeatMasker

I checked your code and guess this might be caused by cleanup CDS with TEsorter, but not sure. I already generate $genome.mod.MAKER.masked, $genome.mod.EDTA.TEanno.gff/sum results. Now evaluating the level of inconsistency is running.
Could you please help me figure it out? Thank you very much in advance.

Best regards,
Chengcheng

Nomenclature discrepancy?

Hi,
Here are the count from the TE library genome.FLYE.sixLongest.fa.EDTA.TElib.fa

DNA/DTA	52
DNA/DTC	50
DNA/DTH	476
DNA/DTM	654
DNA/DTT	2722
DNA/Helitron	15
LTR/Gypsy	38
LTR/unknown	20
MITE/DTA	75
MITE/DTC	10
MITE/DTH	88
MITE/DTM	104
MITE/DTT	570

Then I ran RepeatMasker
RepeatMasker genome.FLYE.sixLongest.fa -no_is -pa 8 -lib genome.FLYE.sixLongest.fa.EDTA.TElib.fa

Here is the summary

==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements         1333       187637 bp    0.16 %
   SINEs:               20         1160 bp    0.00 %
   Penelope             63         3689 bp    0.00 %
   LINEs:              487        62803 bp    0.05 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex         12          561 bp    0.00 %
     R1/LOA/Jockey      23         2819 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B          50        23094 bp    0.02 %
     L1/CIN4           177        20812 bp    0.02 %
   LTR elements:       826       123674 bp    0.11 %
     BEL/Pao           105         7431 bp    0.01 %
     Ty1/Copia           2          131 bp    0.00 %
     Gypsy/DIRS1       256        55114 bp    0.05 %
       Retroviral      179        10844 bp    0.01 %

DNA transposons       2314       176348 bp    0.15 %
   hobo-Activator      689        43072 bp    0.04 %
   Tc1-IS630-Pogo      167        54954 bp    0.05 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac             18         2279 bp    0.00 %
   Tourist/Harbinger   249        12509 bp    0.01 %
   Other (Mirage,       24         1231 bp    0.00 %
    P-element, Transib)

Rolling-circles         77         8371 bp    0.01 %

Unclassified:           51         3907 bp    0.00 %

Total interspersed repeats:      367892 bp    0.32 %


Small RNA:             431       137483 bp    0.12 %

Satellites:            130         7935 bp    0.01 %
Simple repeats:      48930      1869437 bp    1.61 %
Low complexity:       9266       432567 bp    0.37 %
==================================================

The number for the DNA transposons do not seem to match.
For example, I have more DNA elements reported from the non-redundant EDTA output than from RepeatMasker, but I would expect the opposite since RepeatMasker should count the occurrence of each element. Or am I missing something?

TIR-Learner fails due to the lack of intact elements in some sequences

Hello (it's me again sorry),

following issue #14
I have a machine where I thought EDTA was running fine but it seems to work or not depending of the genome fasta provided. Here is what is happening with a fasta that seems to cause an error
I have removed any scaffolds below 5500 bp. The RepeatMasker and RepeatModeler used are not the ones from conda

Mon Oct  7 20:13:39 CEST 2019	Dependency checking:
				All passed!
Mon Oct  7 20:14:01 CEST 2019	Obtain raw TE libraries using various structure-based programs: 
Mon Oct  7 20:14:01 CEST 2019	EDTA_raw: Check files and dependencies, prepare working directories.

Mon Oct  7 20:14:01 CEST 2019	Start to find LTR candidates.

Mon Oct  7 20:14:01 CEST 2019	Identify LTR retrotransposon candidates from scratch.

Mon Oct  7 20:21:12 CEST 2019	Finish finding LTR candidates.

Mon Oct  7 20:21:12 CEST 2019	Start to find TIR candidates.

Mon Oct  7 20:21:12 CEST 2019	Identify TIR candidates from scratch.

Species: others
rm: cannot remove './TIR-Learner/*': No such file or directory
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/lege/anaconda3/envs/EDTA/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3_New/getDataset2.py", line 109, in Predict
    predicted_labels = model.predict(np.stack(prefeature))
  File "<__array_function__ internals>", line 6, in stack
  File "/home/lege/.local/lib/python3.6/site-packages/numpy/core/shape_base.py", line 421, in stack
    raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3_New/getDataset2.py", line 130, in <module>
    d = pool.map(Predict,files)
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
ValueError: need at least one array to stack
cat: '*-+-DTA.fa': No such file or directory
cat: '*-+-DTC.fa': No such file or directory
cat: '*-+-DTH.fa': No such file or directory
cat: '*-+-DTM.fa': No such file or directory
cat: '*-+-DTT.fa': No such file or directory
cat: '*-+-NonTIR.fa': No such file or directory
cat: '*-+-*-+-*.gff3': No such file or directory
rm: cannot remove '*-+-*-+-*.gff3': No such file or directory
Traceback (most recent call last):
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3_New/CombineAll.py", line 90, in <module>
    keep=removeIRFhomo("%s.gff3"%(genome_Name+spliter+dataset),remove,"%sClean.gff3"%(genome_Name+spliter+dataset+spliter))
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3_New/CombineAll.py", line 76, in removeIRFhomo
    f=pd.read_csv(file,header=None,sep="\t")
  File "/home/lege/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/lege/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/home/lege/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/home/lege/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/lege/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1917, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3/GetAllSeq.py", line 32, in GetListFromFile
    f=open(file,"r+")
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/sdc1/Alessandro/TR_2013/EDTA/bin/TIR-Learner1.23/Module3/GetAllSeq.py", line 63, in <module>
    pool.map(GetListFromFile,fileList) #shujun
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/lege/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
mv: cannot stat 'TIR-Learner/*FinalAnn.gff3': No such file or directory
mv: cannot stat 'TIR-Learner/*FinalAnn.fa': No such file or directory
Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.fa: No such file or directory at /mnt/sdc1/Alessandro/TR_2013/EDTA/util/rename_tirlearner.pl line 18.
Warning: LOC list long_scaffolds.fa.TIR.ext30.list is empty.
Warning: The TIR result file has 0 bp!

Mon Oct  7 20:57:12 CEST 2019	Start to find MITE candidates.

Mon Oct  7 20:57:12 CEST 2019	Identify MITE candidates from scratch.

Mon Oct  7 20:57:12 CEST 2019	Warning: Because MITE-Hunter is too slow and only contribute limited new TIR candidates, it is taken down temporary until a better solution is found.

As a temporary fix, the TIR-Learner is used to mock the MITE-Hunter result. Please run -type tir first.

Error: MITE results not found!

ERROR: Raw TIR results not found in long_scaffolds.fa.EDTA.raw/long_scaffolds.fa.TIR.raw.fa at ./EDTA/EDTA.pl line 177.

the fasta file can be sent to you if you would like to investigate.
Thanks a lot

RepeatModeler in conda

Hi, all

EDTA pipeline rely on the RepeatModeler in the conda, but it have a known issue, the conda version seems cannot produce the consensi.fa.
Dfam-consortium/RepeatModeler#38

If you want to find TE in your genome by RepeatModeler, please install the software by yourself, assign the -repeatmodeler and -repeatmasker to the install path, and then use the consensi.fa.classified as your RepeatModerler raw fa.

RepeatMasker Classification by EDTA lib

Hi Shujun,

I use the genome.fa.EDTA.TElib.fa produced by EDTA.pl as lib to run the RpeatMasker. But the result clafficication only have LTR elements and DNA elements without specific classfication (such as LTR/Copia). How can I get more detailed repeat classicication by RpeatMasker. Do I need to run the RepeatMasker in homo mode (set -species), then combine the two lib as final result?

Here is the command and result.

The first 10 lines of genome.fa.EDTA.TElib.fa

>TE_00000000#DNA/DTH
>TE_00000001#DNA/Helitron
>TE_00000002#DNA/DTC
>TE_00000003#DNA/Helitron
>TE_00000004#DNA/DTT
>TE_00000005#DNA/Helitron
>TE_00000006#DNA/Helitron
>TE_00000007#DNA/Helitron
>TE_00000008#DNA/DTT
>TE_00000009#DNA/Helitron

RepeatMasker

RepeatMasker -pa 24 -lib genome.fa.EDTA.TElib.fa -dir ./ -xsmall -gff -e ncbi -q -no_is -norna -nolow -div 40 -cutoff 225 genome.fa

==================================================
file name: genome.fa
sequences:           125
total length:  336324563 bp  (336315300 bp excl N/X-runs)
GC level:         33.22 %
bases masked:  189970773 bp ( 56.48 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:                0            0 bp    0.00 %
      ALUs            0            0 bp    0.00 %
      MIRs            0            0 bp    0.00 %

LINEs:                0            0 bp    0.00 %
      LINE1           0            0 bp    0.00 %
      LINE2           0            0 bp    0.00 %
      L3/CR1          0            0 bp    0.00 %

LTR elements:     97559     83216326 bp   24.74 %
      ERVL            0            0 bp    0.00 %
      ERVL-MaLRs      0            0 bp    0.00 %
      ERV_classI      0            0 bp    0.00 %
      ERV_classII     0            0 bp    0.00 %

DNA elements:    203839     83514886 bp   24.83 %
     hAT-Charlie      0            0 bp    0.00 %
     TcMar-Tigger     0            0 bp    0.00 %

Unclassified:    123082     29657324 bp    8.82 %

Total interspersed repeats:196388536 bp   58.39 %


Small RNA:            0            0 bp    0.00 %

Satellites:           0            0 bp    0.00 %
Simple repeats:       0            0 bp    0.00 %
Low complexity:       0            0 bp    0.00 %

==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


The query species was assumed to be homo
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127

run with rmblastn version 2.6.0+
The query was compared to classified sequences in "genome.fa.EDTA.TElib.fa"

Cheers,
Zhigui

Issue with identifying TIR candidates

Hello,
I am trying to run EDTA in a conda environment and the setup is well done but at TIR identification step I have the following error:

Tue Jan 28 12:37:40 CET 2020    Finish finding LTR candidates.

Tue Jan 28 12:37:40 CET 2020    Start to find TIR candidates.

Tue Jan 28 12:37:40 CET 2020    Identify TIR candidates from scratch.

Species: others


/mnt/vol2/conda/miniconda3/envs/EDTA/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is de
precated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])

terminate called after throwing an instance of 'std::system_error'
  what():  terminate called after throwing an instance of 'terminate called after throwing an instance of 'Resource temporarily unavailablestd::system_errorstd::system_error
'
'
  what():  Resource temporarily unavailable
terminate called after throwing an instance of 'std::system_error'

Any clues about the possible solution to this error?

IndexError: list index out of range

Dear Shujun,

When I run the EDTA with EDTA/EDTA.pl -genome non-redundant.shortname.fa -species others -step all -t 28, but it got some error when identify TIR candidates as following:

EDTA/bin/TIR-Learner2.4/Module2/RunGRF.py", line 79, in <module>
    if (len(str(records[0].seq))>int(length)+500):
IndexError: list index out of range

So how I can fix this error, Thank you!

EDTA run on a large genome

Hello,

I'd like to try out the EDTA pipeline to construct a repeat library for a large (20Gbp) genome assembly. Would you expect this to be scalable to a genome of this size? Would it be possible to partition the genome and EDTA separately on each partition of the assembly?

Any tips or guidance would be much appreciated.
Thank you!
Lauren

dirname: missing operand

I can ran successfully this two commands:
perl $EDTA_raw --genome $TAIR10_mod -species others -type ltr --overwrite 0 --threads 8
perl $EDTA_raw --genome $TAIR10_mod -species others -type helitron --overwrite 0 --threads 8

But this one:
perl $EDTA_raw --genome $TAIR10_mod -species others -type tir --overwrite 0 --threads 8

Gives the following error:


EDTA_raw: Check dependencies, prepare working directories.
Start to find LTR candidates.

Existing result file Arabidopsis_thaliana.TAIR10.dna.toplevel_14lines.fa.mod.LTR.raw.fa found! Will keep this file without rerunning this module.
 Please specify -overwrite 1 if you want to rerun this module.
Finish finding LTR candidates.
Start to find TIR candidates.
 Identify TIR candidates from scratch.

Species: others
dirname: missing operand
Try 'dirname --help' for more information.
Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.fa: No such file or directory at EDTA/util/rename_tirlearner.pl line 18.
Warning: LOC list Arabidopsis_thaliana.TAIR10.dna.toplevel_14lines.fa.mod.TIR.ext30.list is empty.
Error: Error while loading sequenceCan't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.gff3: No such file or directory.
Warning: The TIR result file has 0 bp!

Any suggestions?
Thank you!

Debug lines found in EDTA.pl

We tried to run the pipeline using our genome assembly fasta file, xxx.fa. Unfortunately,
the error message showed up "xxx.fa.masked does not contain any sequences!"
What's going on?
Apparently, at line 48 of the code of EDTA.pl , "if (0){", should be changed to "if (1){".

Is MITE-Hunter still there?

Hello,
the bioArxiv paper describes the usef of MITE-Hunter but the new figure on github suggests it's not there any more. If I understand correctly it has been disabled for the moment, right?

Identifying flanking LTR pairs

Hello,

I've used EDTA to successfully annotate and mask my plant genome (via RepeatMasker). However, I am also interested in the actual flanking LTR pairs for each LTR retrotransposon.

I know that LTR_finder and LTR harvest report these on their own. By running them individually on a segment of my genome, I'm able to only regenerate some of these pairs (maybe less than 10% of the total unique types found by EDTA). And furthermore, many of them do not match the reported positions found by running the full EDTA pipeline.

What would be the best way to find the corresponding LTR pairs for each LTR subfamily reported?

Much appreciated,

Bryan

ERROR: Raw LTR/TIR/Helitron results not found in *.EDTA.raw/

I run EDTA pipeline for identifying TE using about 100 fungi isolate genome sequences. All genome sequences were de novo assembly. Around 70% isolates can get good results using EDTA pipeline. However, others can not get. with the error as following: I have tried lots of times.

Mon Dec 9 17:13:26 EST 2019 EDTA_raw: Check files and dependencies, prepare working directories.

Mon Dec 9 17:13:26 EST 2019 Start to find LTR candidates.

Mon Dec 9 17:13:26 EST 2019 Identify LTR retrotransposon candidates from scratch.

awk: fatal: cannot open file `L009.fa.pass.list' for reading (No such file or directory)
Warning: LOC list - is empty.

Error: Error while loading sequencecp: cannot stat ‘L009.fa.LTRlib.fa’: No such file or directory
cp: cannot stat ‘L009.fa.LTRlib.fa’: No such file or directory
Error: LTR results not found!

ERROR: Raw LTR results not found in L009.fa.EDTA.raw/L009.fa.LTR.raw.fa at /home/AAFC-AAC/fuf/EDTA/EDTA.pl line 250.
Did you meet like this error before?
Thanks,
Fuyou

Running EDTA efficiently on a cluster

Hi Shujun,

Just a quick question. I have completed some initial tests on a small fraction (~150Mb) of a ~5Gb genome and am ready to give the real thing a try! However, as I'm sure is the case for many users, I have to tactically dodge run-time limits whilst maximising the resources I can use on the various queues on my cluster. In my case for example I can run a job for 24 hours with a lot of resources, or a job for 10 days with limited resources. So one question I have is:

Can I independently and simultaneously run the TE library steps for tir, ltr and helitron (i.e. divide an conquer) into the same output folder and then use these for the final steps in a later job? Or is there something that would get confused if I did this?

Also if you have any other tips for maximising efficiency when constrained by cluster resources I'd be very happy to hear them. Specifically if you could give some guidance as to whether parallelism or memory are more important for each step that would already be very helpful!

Best wishes, and thanks again for an awesome tool and paper!

Dan

EDTA v1.5 fail on a small dataset

Shujun, I tested the v1.5 with a small data set. It showed errors as:

########################################################

Extensive de-novo TE Annotator (EDTA) v1.5

Shujun Ou ([email protected])

########################################################

Mon Aug 26 12:33:52 CDT 2019 Dependency checking:
All passed!
Mon Aug 26 12:33:57 CDT 2019 Obtain raw TE libraries using various structure-based programs:
Mon Aug 26 12:33:57 CDT 2019 EDTA_raw: Check files and dependencies, prepare working directories.

Mon Aug 26 12:33:57 CDT 2019 Start to find LTR candidates.

Mon Aug 26 12:33:57 CDT 2019 Identify LTR retrotransposon candidates from scratch.

    Usage: perl cleanup.pl -f sample.fa [options] > sample.cln.fa 
Options:
	-misschar	n	Define the letter representing unknown sequences; case insensitive; default: n
	-Nscreen	[0|1]	Enable (1) or disable (0) the -nc parameter; default: 1
	-nc		[int]	Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
	-nr		[0-1]	Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
	-minlen		[int]	Minimum sequence length filter after clean up; default: 100 (bp)
	-cleanN		[0|1]	Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
	-trf		[0|1]	Enable (1) or disable (0) tandem repeat finder (trf); default: 1
	-trf_path	path	Path to the trf program

cp: cannot stat ‘TF05-1v012.fasta.mod.retriever.scn.adj’: No such file or directory
cp: cannot stat ‘TF05-1v012.fasta.LTRlib.fa’: No such file or directory
cp: cannot stat ‘TF05-1v012.fasta.LTRlib.fa’: No such file or directory
Error: LTR results not found!

ERROR: Raw LTR results not found in TF05-1v012.fasta.EDTA.raw/TF05-1v012.fasta.LTR.raw.fa at /homes/liu3zhen/.conda/envs/EDTA3/EDTA/EDTA.pl line 176.

Originally posted by @liu3zhenlab in #12 (comment)

GRF error: Input genome sequence size is short. Can't perform detection. Segmentation fault

Hello,
I have this error, however the pipeline is still running. Is it a benign warning?
I indeed have some very short sequences in my fasta, however I also have scaffolds of several Mb. But I don't know on which scaffolds it failed. I think that would be helpful to know that?

And what is the minimal length of a sequence?

thank you

How to speed up for big genome?

Hi dear Shujun,
I have configured the environment for computing about the EDTA.
But I work on genome for amphibians, the genome size is bigger than other animals. I have run EDTA_raw for TIR, LTR, helitron.EDTA_raw.pl -genome frog1_genome.chromosome.fa -type tir -thrads 16. It's been running for 48 hours and it's not finished yet.
Is there any methods for speed up for big genomes?

Thank you for your attention and reply.

Zhangyi

Further classification of DNA transposon super-families

Hi Shujun,
Is there any method to further classify the DNA transposon that name as DNA/DTT, DNA/DTA by EDTA into specific superfamily names such as Harbinger, Mu, AC/DS and others?

Best regards,
Junpeng

About repeat elements in the results of EDTA!

Hello,
I tried a small dataset and got the results as following:
Confusion matrix of BL06.R11.pilon.fasta.EDTA.TE.fa.stat for the all category
DNA/DTC DNA/DTH DNA/DTM LTR/Copia LTR/unknown MITE/DTM Misclas_rate
DNA/DTC 7 0 0 0 0 0 0.0000
DNA/DTH 0 1163 1 0 0 0 0.0009
DNA/DTM 0 0 7936 0 3 1 0.0005
LTR/Copia 0 0 0 259 0 0 0.0000
LTR/unknown 1 1 4 0 25193 1 0.0003
MITE/DTM 0 0 2 0 0 168 0.0118
So my question is that EDTA can analyze the repeat elments, such as AT-rich, GC-rich, short repeat elments, like (AT)n.
Thanks,
Fuyou

"The RMblast engine is not installed in RepeatMasker!" When specifying conda env installation --prefix

Hi Shujun

I have been trying to install EDTA on my server but I have an annoying situation of a storage quota on my home directory meaning that the default location for the conda env isn't big enough to complete the installation. I am trying to get around it using:

conda create --prefix /scratch/djeffrie/EDTAenv

The installation seems to work fine. However when I run the pipeline I get the error:

The RMblast engine is not installed in RepeatMasker!

I see some issues for TIR_retriever with the same error but I can't figure out if its the same problem or not. I followed the suggestion [here]
(oushujun/LTR_retriever#43) of running

RepeatMasker -e ncbi -q -pa 1 -no_is -norna -nolow dummy060817.fa.$rand -lib dummy060817.fa.$rand

but I didn't get the expected output relating to the taxonomy data file, I got the error

RepeatMasker::setspecies: Could not find user specified library dummy060817.fa..

Would you have any solutions for how to get round this? Perhaps its a problem of using the --prefix argument? Or maybe just the server?

Best,

Dan

Cannot install repeatmodeler or repeatmasker

Sorry this question may be irrelevant to EDTA. I am having problems installing repeatmodeler or repeatmasker. Could you please help me with this? Thanks. When I run "conda install -y -c bioconda repeatmodeler", the error messages look like this:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with current_repodata.json, will retry with next repodata source.
Initial quick solve with frozen env failed. Unfreezing env and trying again.
Solving environment: failed with current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed
Initial quick solve with frozen env failed. Unfreezing env and trying again.
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Package tk conflicts for:
python=3.6 -> tk[version='8.6.|>=8.6.7,<8.7.0a0|>=8.6.8,<8.7.0a0']
Package libstdcxx-ng conflicts for:
python=3.6 -> libstdcxx-ng[version='>=7.2.0|>=7.3.0']
Package repeatscout conflicts for:
repeatmodeler -> repeatscout
Package perl-threaded conflicts for:
repeatmodeler -> perl-threaded
Package readline conflicts for:
python=3.6 -> readline[version='7.|>=7.0,<8.0a0']
Package perl conflicts for:
repeatmodeler -> perl[version='5.22.0.|>=5.26.2,<5.27.0a0']
Package pip conflicts for:
python=3.6 -> pip
Package recon conflicts for:
repeatmodeler -> recon
Package libffi conflicts for:
python=3.6 -> libffi[version='3.2.|>=3.2.1,<4.0a0']
Package ncurses conflicts for:
python=3.6 -> ncurses[version='6.0.|>=6.0,<7.0a0|>=6.1,<7.0a0']
Package zlib conflicts for:
python=3.6 -> zlib[version='>=1.2.11,<1.3.0a0']
Package xz conflicts for:
python=3.6 -> xz[version='>=5.2.3,<6.0a0|>=5.2.4,<6.0a0']
Package libgcc-ng conflicts for:
python=3.6 -> libgcc-ng[version='>=7.2.0|>=7.3.0']
Package trf conflicts for:
repeatmodeler -> trf
Package openssl conflicts for:
python=3.6 -> openssl[version='1.0.|1.0.*,>=1.0.2l,<1.0.3a|>=1.0.2m,<1.0.3a|>=1.0.2n,<1.0.3a|>=1.0.2o,<1.0.3a|

=1.0.2p,<1.0.3a|>=1.1.1a,<1.1.2a|>=1.1.1c,<1.1.2a']
Package repeatmasker conflicts for:
repeatmodeler -> repeatmasker
Package perl-text-soundex conflicts for:
repeatmodeler -> perl-text-soundex
Package rmblast conflicts for:
repeatmodeler -> rmblast
Package sqlite conflicts for:
python=3.6 -> sqlite[version='>=3.20.1,<4.0a0|>=3.22.0,<4.0a0|>=3.23.1,<4.0a0|>=3.24.0,<4.0a0|>=3.25.2,<4.0a0|>
=3.26.0,<4.0a0|>=3.29.0,<4.0a0']

number of used threads by LTR_FINDER

Hello, I am running the whole pipeline on a huge server. I specified 64 cores, but for the last 4 hours, the program (LTR_FINDER) is using only 6 threads that running on ~20% each resulting in ~single core run. Wonder what might have gone wrong.

I installed EDTA using conda (following instructions from README) and run it as follows

perl EDTA/EDTA.pl -genome my_genome.fasta -species others -step all -anno 1 -threads 64

In htop the executed program looks like this:

.../LTR_FINDER_parallel -seq scaffolds.fasta -threads 64 -harvest_out -size 1000000 -time 300

Thanks for making EDTA, it was a good twist in a benchmarking paper :-). By the way, did you try to compare EDTA with PiRATE? It also seems like a quite comprehensive pipeline, but I could find a comparison of the two.

RepeatModeler version used

Hi !
Thank you very much for this great tool! I was really pleased to discover it.
I have comments/questions related to RepeatModeler.
The version available within bioconda was wrong until recently (I fixed it before Christmas).

The RepeatModeler fix involved a small update of the RepeatMasker recipe. It also include trf by default now.
So I guess you could update the installation procedure:
conda install -n EDTA -y cd-hit repeatmodeler muscle mdust blast-legacy java-jdk perl perl-text-soundex multiprocess regex tensorflow=1.14.0 keras=2.2.4 scikit-learn=0.19.0 biopython pandas glob2 python=3.6.

RepeatModeler 2.0 now supports LTR structural search using a combination of LTR_harvest and LTR_retriever. How this will affect the result of EDTA? Do you have a benchmark? Should we avoid to use RepeatModeler LTR detection?

Short contig issue of TIR-Learner.sh

Hello,
I am rerunning the last push in the same folder and get errors, here is the STDOUT and STDERR
This is a follow-up of this issue:
#10

./EDTA/EDTA.pl -genome Avaga.Masurca.Graal.min5500.fa -species others -step all -t 48 2>&1 |tee EDTA.log

########################################################
##### Extensive de-novo TE Annotator (EDTA) v1.5    ####
##### Shujun Ou ([email protected])             ####
########################################################



Mo Aug 19 10:52:23 CEST 2019	Dependency checking:
				All passed!
Mo Aug 19 10:52:33 CEST 2019	Obtain raw TE libraries using various structure-based programs: 
Mo Aug 19 10:52:33 CEST 2019	EDTA_raw: Check files and dependencies, prepare working directories.

Mo Aug 19 10:52:33 CEST 2019	Start to find LTR candidates.

Mo Aug 19 10:52:33 CEST 2019	Existing result file Avaga.Masurca.Graal.min5500.fa.LTRlib.fa found! Will keep this file without rerunning this module.
	Please specify -overwrite 1 if you want to rerun this module.

Mo Aug 19 10:52:33 CEST 2019	Finish finding LTR candidates.

Mo Aug 19 10:52:33 CEST 2019	Start to find TIR candidates.

Mo Aug 19 10:52:33 CEST 2019	Identify TIR candidates from scratch.

Species: others
/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/TIR-Learner.sh: 95: [: others: unexpected operator
/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/TIR-Learner.sh: 95: [: others: unexpected operator
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/Module3_New/getDataset2.py", line 107, in Predict
    model = load_model(path+"/Module3_New/"+'CNN0724.h5')
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/saving.py", line 249, in load_model
    optimizer_config, custom_objects=custom_objects)
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/optimizers.py", line 838, in deserialize
    printable_module_name='optimizer')
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/utils/generic_utils.py", line 194, in deserialize_keras_object
    return cls.from_config(cls_config)
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/optimizers.py", line 159, in from_config
    return cls(**config)
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/optimizers.py", line 471, in __init__
    super(Adam, self).__init__(**kwargs)
  File "/home/urbe/.local/lib/python3.6/site-packages/tensorflow/python/keras/optimizers.py", line 68, in __init__
    'passed to optimizer: ' + str(k))
TypeError: Unexpected keyword argument passed to optimizer: name
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/Module3_New/getDataset2.py", line 131, in <module>
    d = pool.map(Predict,files)
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
TypeError: Unexpected keyword argument passed to optimizer: name
cat: '*-+-DTA.fa': No such file or directory
cat: '*-+-DTC.fa': No such file or directory
cat: '*-+-DTH.fa': No such file or directory
cat: '*-+-DTM.fa': No such file or directory
cat: '*-+-DTT.fa': No such file or directory
cat: '*-+-NonTIR.fa': No such file or directory
cat: '*-+-*-+-*.gff3': No such file or directory
rm: cannot remove '*-+-*-+-*.gff3': No such file or directory
Traceback (most recent call last):
  File "/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/Module3_New/CombineAll.py", line 90, in <module>
    keep=removeIRFhomo("%s.gff3"%(genome_Name+spliter+dataset),remove,"%sClean.gff3"%(genome_Name+spliter+dataset+spliter))
  File "/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/Module3_New/CombineAll.py", line 76, in removeIRFhomo
    f=pd.read_csv(file,header=None,sep="\t")
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 429, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/urbe/anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 1853, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Traceback (most recent call last):
  File "/media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/bin/TIR-Learner1.22/Module3/GetAllSeq.py", line 62, in <module>
    file=open(f,"r+")
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
mv: cannot stat 'TIR-Learner/*FinalAnn.gff3': No such file or directory
mv: cannot stat 'TIR-Learner/*FinalAnn.fa': No such file or directory
Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.fa: No such file or directory at /media/urbe/MyBDrive/Alessandro/TR_annotation/EDTA/util/rename_tirlearner.pl line 18.
Warning: LOC list Avaga.Masurca.Graal.min5500.fa.TIR.ext30.list is empty.
Warning: The TIR result file has 0 bp!

Mo Aug 19 10:52:56 CEST 2019	Start to find MITE candidates.

Mo Aug 19 10:52:56 CEST 2019	Existing result file Avaga.Masurca.Graal.min5500.fa.MITE.raw.fa found! Will keep this file without rerunning this module.
	Please specify -overwrite 1 if you want to rerun this module.

Mo Aug 19 10:52:56 CEST 2019	Finish finding MITE candidates.

Mo Aug 19 10:52:56 CEST 2019	Start to find Helitron candidates.

Mo Aug 19 10:52:56 CEST 2019	Existing result file Avaga.Masurca.Graal.min5500.fa.Helitron.raw.fa found! Will keep this file without rerunning this module.
	Please specify -overwrite 1 if you want to rerun this module.

Mo Aug 19 10:52:56 CEST 2019	Finish finding Helitron candidates.

Happy EDTA users with successful cases

Hi all,

Just update the testing result. It seems that new release TIR can close this issue.

Please install a new env for the EDTA 20190802 release
Follow the step by the Shujun provided.

EDTA_raw
EDTA_processF
EDTA -step final

The time and resource of my plant genome (336M plant genome, 58% repeat estimated by the GenomeScope, 24 cores machine)

Step	maxvmem	time(h)	raw_fa size
Helitron	7.914GB	2.352222	1.3Mb
MITE	1.529GB	1.815278	4.9kb
TIR	42.127GB	4.895556	20Mb
LTR	19.049GB	1.417222	2.5Mb
EDTA_Final	19.388GB	19.42389	19Mb

Thanks for the developing.

Bests,
Zhigui

Originally posted by @baozg in #4 (comment)

Pipeline failed at LTR_FINDER_parallel step

Dear Shujun,

I ran the EDTA pipeline v1.7.1 for a genome with the following command. It failed at the step of identify LTR retrotransposon candidates from scratch.

perl /LabShares/Tools/EDTA/EDTA/EDTA.pl -genome DR_OL_ens90.fa -step all -cds DR_OL_cds_ens90.fa -sensitive 1 -anno 1
The STDERR showed an error:
Unsuccessful stat on filename containing newline at /LabShares/Tools/EDTA/EDTA/bin/LTR_FINDER_parallel/LTR_FINDER_parallel line 156, line 10314.

In the folder of LTR, a list of intermediate files have been generated:

alluniRefprexp082813.197723
alluniRefprexp082813.197723.phr
alluniRefprexp082813.197723.pin
alluniRefprexp082813.197723.psq
DR_OL_ens90.fa.finder.combine.scn
DR_OL_ens90.fa.harvest.scn
DR_OL_ens90.fa.list
DR_OL_ens90.fa.LTR.intact.fa
DR_OL_ens90.fa.LTR.intact.fa.ori
DR_OL_ens90.fa.LTR.intact.fa.ori.dusted
DR_OL_ens90.fa.LTR.intact.fa.ori.dusted.cleanup
DR_OL_ens90.fa.rawLTR.scn
Tpases020812DNA.197723
Tpases020812DNA.197723.phr
Tpases020812DNA.197723.pin
Tpases020812DNA.197723.psq
Tpases020812LINE.197723
Tpases020812LINE.197723.phr
Tpases020812LINE.197723.pin
Tpases020812LINE.197723.psq

All DR_OL_ens90.fa.LTR.intact* files are empty. Could you give me some suggestions to fix this?

Here I have the full STDERR attached for your reference.

Thank you so much.

Best,
Yixuan
AN_EDTA_DR_OL_ens90.txt

Keeping all full-length transposons?

Hello,
I might have a suggestion:
I was wondering if it wouldn't be useful for the final user to be able to get a file with the coordinates of the transposon, for example if one is interested to look at their position in the genome.

Thanks for EDTA, it's a cool pipeline.

Specifying RepeatMasker query species

Hi,

I'm annotating a genome pretty distant from homo sapiens. Checking the RM_ output folder, it looks like the call to RepeatMasker just queries this as the default species ("The query species was assumed to be homo" in the RM_/.fasta.tbl output in the *.fasta.EDTA.final/ folder). Is there any way to change this in my call to EDTA so I can most effectively use a homology-based TE calling method? Alternatively, I could just run RepeatMasker myself at the end using *.fasta.EDTA.TElib.fa as a custom library

MITE-Hunter produces no results

Shujun,

Thank you for updating EDTA. I am using 1.3 on a maize genome and the MITE step took a long time (~11 days). The problem is that no MITE raw sequences were output after TIR and MITE runs. Now the running is at Helitron. I will update after the run is finished.

-Sanzhen

Use of uninitialized value within @ARGV

Dear Shujun
I managed to generate the raw files for LTR, TIR, MITE (Copy of TIR) and Helitrons. I am getting the following error while running EDTA_processF.pl.

/usr/local_sbs/source_files/EDTA/EDTA_processF.pl -genome HaploidAssemblyPilonPolishedCleaned.fasta -ltr HaploidAssemblyPilonPolishedCleaned.fasta.EDTA.raw/HaploidAssemblyPilonPolishedCleaned.fasta.LTR.raw.fa -tir HaploidAssemblyPilonPolishedCleaned.fasta.EDTA.raw/HaploidAssemblyPilonPolishedCleaned.fasta.TIR.raw.fa -helitron HaploidAssemblyPilonPolishedCleaned.fasta.EDTA.raw/HaploidAssemblyPilonPolishedCleaned.fasta.Helitron.raw.fa -mite HaploidAssemblyPilonPolishedCleaned.fasta.EDTA.raw/HaploidAssemblyPilonPolishedCleaned.fasta.MITE.raw.fa

Use of uninitialized value within @argv in pattern match (m//) at /usr/local/local_sbs/source_files/EDTA/util/cleanup_nested.pl line 41.
Use of uninitialized value $blastplus in string eq at /usr/local/local_sbs/source_files/EDTA/util/cleanup_nested.pl line 49.

Could you help me with this?

The picture shows the final files that were generated.

Fails on identification of TIRs

Shujun,

I've installed the Docker version to our HPC. EDTA progesses through the LTR finding, but then crashes when trying to identify the TIR. I'm pasting in the full slurm below. Any help would be greatly appreciated.
Thanks, Jeff

> ##### Shujun Ou ([email protected])             ####
> ########################################################
>
>
>
> Fri Jan 24 02:09:53 UTC 2020    Dependency checking:
>                                 All passed!
> Fri Jan 24 02:10:03 UTC 2020    Obtain raw TE libraries using various structure-based programs:
> Fri Jan 24 02:10:03 UTC 2020    EDTA_raw: Check files and dependencies, prepare working directories.
>
> Fri Jan 24 02:10:03 UTC 2020    Start to find LTR candidates.
>
> Fri Jan 24 02:10:03 UTC 2020    Identify LTR retrotransposon candidates from scratch.
>
> Warning: LOC list ordered_atriplex_hortensis_04Apr2019_hkF1T_namedcorrectly_clean_00.fasta.mod.ltrTE.veryfalse is empty.
> Fri Jan 24 02:27:35 UTC 2020    Finish finding LTR candidates.
>
> Fri Jan 24 02:27:35 UTC 2020    Start to find TIR candidates.
>
> Fri Jan 24 02:27:35 UTC 2020    Identify TIR candidates from scratch.
>
> Species: others
> 2020-01-24 02:55:41.202424: F tensorflow/python/lib/core/bfloat16.cc:675] Check failed: PyBfloat16_Type.tp_base != nullptr
> Aborted (core dumped)
> cat: '*-+-DTA.fa': No such file or directory
> cat: '*-+-DTC.fa': No such file or directory
> cat: '*-+-DTH.fa': No such file or directory
> cat: '*-+-DTM.fa': No such file or directory
> cat: '*-+-DTT.fa': No such file or directory
> cat: '*-+-NonTIR.fa': No such file or directory
> cat: '*-+-*-+-*.gff3': No such file or directory
> rm: cannot remove '*-+-*-+-*.gff3': No such file or directory
> Traceback (most recent call last):
>   File "/EDTA/bin/TIR-Learner2.4/Module3_New/CombineAll.py", line 75, in <module>
>     f_m3=removeDupinSingle("%s.gff3"%(genome_Name+spliter+"Module3"))
>   File "/EDTA/bin/TIR-Learner2.4/Module3_New/CombineAll.py", line 57, in removeDupinSingle
>     f=pd.read_csv(file,header=None,sep="\t") #shujun
>   File "/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
>     return _read(filepath_or_buffer, kwds)
>   File "/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py", line 457, in _read
>     parser = TextFileReader(fp_or_buf, **kwds)
>   File "/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
>     self._make_engine(self.engine)
>   File "/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
>     self._engine = CParserWrapper(self.f, **self.options)
>   File "/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py", line 1917, in __init__
>     self._reader = parsers.TextReader(src, **kwds)
>   File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
> pandas.errors.EmptyDataError: No columns to parse from file
> multiprocessing.pool.RemoteTraceback:
> """
> Traceback (most recent call last):
>   File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 119, in worker
>     result = (True, func(*args, **kwds))
>   File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
>     return list(map(*args))
>   File "/EDTA/bin/TIR-Learner2.4/Module3/GetAllSeq.py", line 32, in GetListFromFile
>     f=open(file,"r+")
> FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
> """
>
> The above exception was the direct cause of the following exception:
>
> Traceback (most recent call last):
>   File "/EDTA/bin/TIR-Learner2.4/Module3/GetAllSeq.py", line 63, in <module>
>     pool.map(GetListFromFile,fileList) #shujun
>   File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 266, in map
>     return self._map_async(func, iterable, mapstar, chunksize).get()
>   File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 644, in get
>     raise self._value
> FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'
> mv: cannot stat 'TIR-Learner/*FinalAnn*.gff3': No such file or directory
> mv: cannot stat 'TIR-Learner/*FinalAnn*.fa': No such file or directory
> Can't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.fa: No such file or directory at /EDTA/util/rename_tirlearner.pl line 18.
> Warning: LOC list ordered_atriplex_hortensis_04Apr2019_hkF1T_namedcorrectly_clean_00.fasta.TIR.ext30.list is empty.
>
> Error: Error while loading sequenceCan't open ./TIR-Learner-Result/TIR-Learner_FinalAnn.gff3: No such file or directory.
> Warning: The TIR result file has 0 bp!
>
> Fri Jan 24 02:55:51 UTC 2020    Start to find Helitron candidates.
>
> Fri Jan 24 02:55:51 UTC 2020    Identify Helitron candidates from scratch.
>
> Fri Jan 24 03:49:16 UTC 2020    Finish finding Helitron candidates.
>
> Fri Jan 24 03:49:16 UTC 2020    Execution of EDTA_raw.pl is finished!
>
> ERROR: Raw TIR results not found in ordered_atriplex_hortensis_04Apr2019_hkF1T_namedcorrectly_clean_00.fasta.EDTA.raw/ordered_atriplex_hortensis_04Apr2019_hkF1T_namedcorrectly_clean_00.fasta.TIR.raw.fa at /EDTA/EDTA.pl line 285.
> cleaning up image```

Wildcard in `TIR-Learner2.4.sh` expands to too many elements and cause an error to cp

Hello again :-), I have run into the following error:

243: EDTA/bin/TIR-Learner2.4/TIR-Learner2.4.sh: cp: Argument list too long

the line that dropped the error message in the script is

cp -r $genomeName/$genomeName* temp/

where variable genomeName is statically assigned to TIR-Learner at the very beginning of the script.

The temp/ directory is now empty, which I am not sure if it's a problem or not.

Can't locate object method "end" via package "Thread::Queue

Hello，
I used EDTA but I got an error when using the following script：

perl /PARA/pp811/anaconda3/bin/EDTA/EDTA.pl -genome ref.fasta -cds CDS.fasta -anno 1 -evaluate 1

And then I got the following error output:

Can't locate object method "end" via package "Thread::Queue" at /PARA/pp811/anaconda3/bin/EDTA/bin/LTR_FINDER_parallel/LTR_FINDER_parallel line 115, <List> line 1. cat: ref.fasta.finder.combine.scn: No such file or directory Can't locate object method "end" via package "Thread::Queue" at /PARA/pp811/anaconda3/bin/EDTA/bin/LTR_retriever/bin/LTR.identifier.pl line 125. cp: cannot stat ref.fasta.mod.retriever.scn.adj': No such file or directory
awk: cmd. line:1: fatal: cannot open file ref.fasta.pass.list' for reading (No such file or directory) Warning: LOC list - is empty. Error: Error while loading sequencecp: cannot stat ref.fasta.LTRlib.fa': No such file or directory
cp: cannot stat `VF36.GPM.fasta.LTRlib.fa': No such file or directory
Error: LTR results not found!

ERROR: Raw LTR results not found in ref.fasta.EDTA.raw/ref.fasta.LTR.raw.fa at /PARA/pp811/anaconda3/bin/EDTA/EDTA.pl line 284.`

Of all the results I've gotten so far, only LTR raw file，both TIR and Helitron raw fasta files are empty.
Any help or suggestion will be appreciated.
Thanks!

Which lib should I use for Analysis about the animal genome?

Hi, dear professor Shujun,
I want to use EDTA to analysis some animal genome for de nove predict the TE. However, It looks like the EDTA's lib has Rice, I don't find a parameter for specify an animal lib.
Could the EDTA specify an animal lib? And, How about the effect of EDTA work on animal?

ZhangYi.

cannot stat 'ref.fa.LTR.intact.fa.gff3': No such file or directory

Hello,

I have yesterday started the EDTA pipeline, and I am very excited. However, I get the error that certain LTR files are not present, after 1 hour of run time. Do you know what is going on? I call the script as follows:
perl /data/modules/python/python-anaconda2-2019.10-EDTA/envs/EDTA/share/EDTA/EDTA.pl -genome ref.fa -species others -step all -curatedlib library7birds.fa -sensitive -repeatmasker /data/biosoftware/RepeatMasker/RepeatMasker/ 1 -anno 1 -evaluate 1 -t 15

The input library and genome are soft links (ln -s).

I then get the following error output:

perl rename_LTR.pl genome.fa target_sequence.fa LTR_retriever.defalse

cp: cannot stat 'ref.fa.LTR.intact.fa.gff3': No such file or directory
cp: cannot stat 'ref.fa.LTRlib.fa': No such file or directory
cp: cannot stat 'ref.fa.LTRlib.fa': No such file or directory
Error: LTR results not found!

ERROR: Raw LTR results not found in ref.fa.EDTA.raw/ref.fa.LTR.raw.fa at /data/modules/python/python-anaconda2-2019.10-EDTA/envs/EDTA/share/EDTA/EDTA.pl line 284.

suggest to add `-n EDTA` for `conda install`

My conda version is Miniconda2 4.4.10, whose base use python2. conda install python=3.6 damaged the env of base, following import error for conda:

$ conda -h
Traceback (most recent call last):
  File "~/conda/bin/conda", line 7, in <module>
    from conda.cli import main
ImportError: No module named conda.cli

So it is better to not install python=3.6 in conda's base.

Crashing with TIR-Learner

Hi,

I copy and pasted the installation instructions from the README and am running the the script in the active EDTA environment. It seems that the EDTA.pl script chokes trying to use TIR-Learner. Looking at my output, all the correct folders and such are there. After crashing, the Helitron, MITE, and TIR folders are empty but the LTR folder is not. The only file in the parent output folder is genome.fasta.LTR.raw.fa.

Is there a way to run the Perl pipeline script but just not use TIR-Learner, or even just not call TIRs? I'm still interested in the other features, and even if I could just use EDTA for Helitrons, LTRs, MITEs, filtering, consensus calling, and repeat classifying I would be happy.

The lines before the crash start with what's seen in #2 (comment). Then it's a traceback starting from ~/bin/EDTA/bin/TIR-Learner1.12/Module1/Fullcov.py, line 52, in <module> ProcessHomology(genome_Name). After that, there's some cryptic errors including
cat: '*DTA-+-select.fa': No such file or directory
cat: '*-+-*-+-*.gff3': No such file or directory
There's a few more error traces after that, with each Traceback followed by various errors from files not being found by rm, cp, mv, cat.

TIR-Learner1.12/Module1 (above)
TIR-Learner1.12/Module1/Lowcomp_M1.py
TIR-Learner1.12/Module2/Lowcomp_M2.py
TIR-Learner1.12/

Lastly, in the last few lines before the crash, I get these lines which tell me that it certainly is a problem with TIR-Learner
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3' mv: cannot stat 'TIR-Learner/*FinalAnn.gff3': No such file or directory mv: cannot stat 'TIR-Learner/*FinalAnn.fa': No such file or directory cp: cannot stat 'TIR-Learner-Result/TIR-Learner_FinalAnn.fa': No such file or directory Error: TIR results not found!

ERROR: Raw TIR results not found in genome.fasta.EDTA.raw/genome.fasta.TIR.raw.fa at ~bin/EDTA/EDTA.pl line 145.

While bug testing I've just been using the first two scaffolds of my genome. That file is attached.

Thanks!

PR-102_JGI_twoscafs.fasta.zip

EDTA ignoring the parameter -threads

Hello,
I noticed that blastx and TIR learner ignore the -threads settings, and take all the available threads.

EDIT: sorry, was my mistake, closing

python2 was called when running cleanup_TE.pl

Dear Shujun,

When the -cds option is added, it seems like EDTA switches to use python2 for TEsorter. See cleanup_TE.pl line 36.

python2 $TEsorter $cds -p $threads;

This caused many incompatible issues for like Biopython between python2 and python3.

I installed a separate conda env of python2 for the TEsorter, but still got error in test:

(py2) [qiushi.li@itbioyeaman03 test]$ python ../TEsorter.py rice6.9.5.liban
2019-11-27 07:08:19,201 -WARNING- No module named drmaa
grid computing is not available
2019-11-27 07:08:19,203 -INFO- VARS: {'seq_type': 'nucl', 'min_coverage': 20, 'disable_pass2': False, 'tmp_dir': './tmp', 'processors': 4, 'sequence': 'rice6.9.5.liban', 'no_library': False, 'p2_identity': 80.0, 'no_cleanup': False, 'force_write_hmmscan': False, 'p2_length': 80.0, 'prefix': 'rice6.9.5.liban.rexdb', 'max_evalue': 0.001, 'p2_coverage': 80.0, 'pass2_rule': '80-80-80', 'hmm_database': 'rexdb', 'no_reverse': False}
2019-11-27 07:08:19,203 -INFO- checking dependencies:
2019-11-27 07:08:19,213 -INFO- hmmer 3.2.1 OK
Traceback (most recent call last):
File "../TEsorter.py", line 974, in
pipeline(Args())
File "../TEsorter.py", line 116, in pipeline
Dependency().check_blast()
File "../TEsorter.py", line 920, in check_blast
version = self.check_blast_version(program)
File "../TEsorter.py", line 939, in check_blast_version
version = re.compile(r'blast\S* ([\d.+]+)').search(out).groups()[0]
AttributeError: 'NoneType'

Best,
Qiushi

Can't find label error

Dear Shujun,

Please see this error,

Dependency checking: All passed!
Can't find label ALL at /data/home/qiushi_volumn1/programs/EDTA/EDTA.pl line 118.

perl version info:
This is perl 5, version 26, subversion 2 (v5.26.2) built for x86_64-linux-thread-multi

Many thanks~
Your loyal fans

Job is always killed

Hi, Shujun,

I am testing edta on our school's computer. However, the job is always killed. Here is my script:
module load edta/20190108
module load ltrretriever/1.6

EDTA.pl -genome Zm-I-REFERENCE-FL-1.0.fa -species Maize -threads 2

Here is the error message:
########################################################

Extensive de-novo TE Annotator (EDTA) v1.3

Shujun Ou ([email protected])

########################################################

Tue Aug 6 13:57:01 EDT 2019 Dependency checking:
All passed!
Tue Aug 6 13:57:13 EDT 2019 Obtain raw TE libraries using various structure-based programs:
sh: line 1: 32154 Killed /apps/edta/20190108/edta/bin/genometools-1.5.10/bin/gt suffixerator -db Zm-I-REFERENCE-FL-1.0.fa -indexname Zm-I-REFERENCE-FL-1.0.fa -
Can't locate object method "end" via package "Thread::Queue" at /apps/edta/20190108/edta/bin/LTR_FINDER_parallel/LTR_FINDER_parallel line 115, line 10732.
cat: Zm-I-REFERENCE-FL-1.0.fa.finder.combine.scn: No such file or directory
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Ignoring FASTA modifier(s) found because the input was not expected to have any.
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Ignoring FASTA modifier(s) found because the input was not expected to have any.
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Ignoring FASTA modifier(s) found because the input was not expected to have any.
grep: Zm-I-REFERENCE-FL-1.0.fa.retriever.scn: No such file or directory
Argument "" isn't numeric in numeric gt (>) at /apps/edta/20190108/edta/bin/LTR_retriever/LTR_retriever line 355.

ERROR: No candidate is found in the file(s) you specified.

cp: cannot stat ‘Zm-I-REFERENCE-FL-1.0.fa.LTRlib.fa’: No such file or directory
cp: cannot stat ‘Zm-I-REFERENCE-FL-1.0.fa.LTRlib.fa’: No such file or directory
Error: LTR results not found!

ERROR: Raw LTR results not found in Zm-I-REFERENCE-FL-1.0.fa.EDTA.raw/Zm-I-REFERENCE-FL-1.0.fa.LTR.raw.fa at /apps/edta/20190108/edta/EDTA.pl line 170.
slurmstepd: error: Detected 1 oom-kill event(s) in step 40042225.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

I found in LTR folder Zm-I-REFERENCE-FL-1.0.fa.harvest.scn and Zm-I-REFERENCE-FL-1.0.fa.rawLTR.scn are empty.

Looking forward to your reply!

Best,

Ying

Double ID annotation?

Hello,
not sure if it's EDTA or RepeatMasker, but I ran on the EDTA output

RepeatMasker -pa 4 -no_is -norna -nolow -div 40 -lib genome.sixLongest.fa.EDTA.TElib.fa -cutoff 225 -gff genome.FLYE.sixLongest.fa
buildSummary.pl genome.FLYE.sixLongest.fa.out > summary.tbl

and some sequences in the output repeat tables have a "doubled" ID like
TE_00001277_INT-int while the sequence ID is
>TE_00001277_INT#LTR/Gypsy

Any idea of where the "int" after the "INT" comes from? I concede it seems absolutely benign but I want to be sure it doesn't clue to a bigger problem.

Thank you

oushujun / edta Goto Github PK

edta's Issues

Extensive de-novo TE Annotator (EDTA) v1.7.6

Shujun Ou ([email protected])

Extensive de-novo TE Annotator (EDTA) v1.5

Shujun Ou ([email protected])

Extensive de-novo TE Annotator (EDTA) v1.3

Shujun Ou ([email protected])

Recommend Projects

Recommend Topics

Recommend Org