chenlianfu / geta Goto Github PK

View Code? Open in Web Editor NEW

64.0 64.0 25.0 18.92 MB

License: GNU General Public License v3.0

Makefile 0.13% Perl 98.38% Shell 0.50% R 0.15% Python 0.85%

geta's People

Contributors

Stargazers

Watchers

geta's Issues

Can GETA be used without RNA-seq data？

Hi,
May I ask you, Can GETA be used for annotation if RNA-seq data is not available？

Thank you
Huangchao

重复序列trf注释

陈老师您好！
关于geta重复序列在注释当中，trf的注释向您请教一下。我试过哺乳动物的几个注释，应该是有不到10%的trf是属于trf的序列，geta当中是不是可以将这部分添加进去？

Confirm that use of BLAST's `-max_target_seqs` is intentional

Hi there,

This is a semi-automated message from a fellow bioinformatician. Through a GitHub search, I found that the following source files make use of BLAST's -max_target_seqs parameter:

TransDecoder-v5.3.0/sample_data/cufflinks_example/runMe.sh

Based on the recently published report, Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows, there is a strong chance that this parameter is misused in your repository.

If the use of this parameter was intentional, please feel free to ignore and close this issue but I would highly recommend to add a comment to your source code to notify others about this use case. If this is a duplicate issue, please accept my apologies for the redundancy as this simple automation is not smart enough to identify such issues.

Thank you!
-- Arman (armish/blast-patrol)

Use miniprot2 to accurately map protein ?

Use miniprot2 to accurately map protein sequences onto subsites genewise ? More speed and accuracy.

Cannot wget RepBaseRepeatMaskerEdition-20181026

I cannot download the trf file using:
"wget --http-user=chenlianfu_china --http-password=u2o7rn https://www.girinst.org/server/RepBase/protected/repeatmaskerlibraries/RepBaseRepeatMaskerEdition-20181026.tar.gz -P ~/software/"

It seems that the author changed the password (u2o7rn).
Where should I find the trf?
Thanks.

Step 6: CombineGeneModels #6.1 出现问题

陈老师你好！我在执行geta到step6的6.1（第一轮基因预测结果整合：以AUGUSTUS结果为主，进行三种基因预测结果的整合）的cmdString3遇到问题，具体如下。
Step 6: CombineGeneModels
~/biosoft/geta-2.6.1/bin/GFF3Clear --genome ~/workspace/annotation/output.tmp/genome.fasta --no_attr_add --coverage 0.8 combine.2.gff3 > geneModels.b.gff3 2> /dev/null
问题是这一步花了很多个小时都跑不完，卡在这里，导致后续流程走不了。请问这是正常的吗，如果尝试跑了很久都跑不完，请问能把geneModels.b.gff3相关的代码注释掉，从而完成流程吗。

祝好！

too many gene models by augustus

Hi ,Dr.chen:
I have used geta pipeline several times, and I found that there are too many gene models predicted by augustus. Is there any good way to optimize it.

RNA-seq对cds序列比对率太低

陈老师好！
我在评估注释质量时，用注释用的RNA-seq 比对到geta 注释出来的cds序列 (botiew2), 结果RNA-seq 比对率才50%，但是第二步hisat2比对基因组时比对率是95%, 请问这是正常的吗？

out.homolog_prediction.gff3为空

陈老师您好！我用geta注释基因组，最后跑出的结果中同源预测的结果文件out.homolog_prediction.gff3为空。查看了日志文件，没有发现报错，同时也去out.tmp/4.homolog文件夹中查看相关日志，也是没有报错信息，homolog_gene_region.tab文件是正常的，但是genewise.gff、genewise.gff3、genewise.start_stop_hints.gff以及genewise.gene_id_with_stop_codon.txt文件都是空的。请问这个是哪里出了问题，应该怎么解决呢？期待您的回复！谢谢！

运行报错 merge_repeatMasker_out.pl line 40

您好，我想求助一下，我在开始运行之后，出现了以下报错，这是什么原因呢，请问这样怎么解决呀？
No such file or directory at ~/geta-master/bin/merge_repeatMasker_out.pl line 40.
failed to execute: ~/geta-master/bin/merge_repeatMasker_out.pl ~/cevi.tmp/genome.fasta repeatMasker/RepeatMasker_out.out repeatModeler/RepeatModeler_out.out > genome.repeat.stats
CMD: ~~/geta-master/bin/ParaFly -c~~/cevi.tmp/0.RepeatMasker/repeatMasker/para_RepeatMasker.tmp/command.RepeatMasker.list -CPU 27 &> ~/cevi.tmp/0.RepeatMasker/repeatMasker/para_RepeatMasker.tmp/ParaFly.log
CMD: ~/geta-master/bin/ParaFly -c ~/cevi.tmp/0.RepeatMasker/repeatModeler/para_RepeatMasker.tmp/command.RepeatMasker.list -CPU 27 &> ~/cevi.tmp/0.RepeatMasker/repeatModeler/para_RepeatMasker.tmp/ParaFly.log
Can not open file ~/cevi.tmp/0.RepeatMasker/repeatMasker/para_RepeatMasker.tmp/seq_1/seq_1.fasta.out, No such file or directory at /geta-master/bin/para_RepeatMasker line 138.
Can not open file/cevi.tmp/0.RepeatMasker/repeatModeler/para_RepeatMasker.tmp/seq_1/seq_1.fasta.out, No such file or directory at ~/geta-master/bin/para_RepeatMasker line 138.

需要安装LTRfinder吗

报错如：“failed to execute: RepeatModeler -pa 8 -database species -LTRStruct &> RepeatModeler.log”。
还想问下改正错误后geta流程还能不能接着之前的结果继续运行。需要什么命令

augustus的模型训练这步很慢

默认参数情况下，这个步骤跑了3天了，还没完成。

Glitch report For RepeatModeler(RepeatModeler版本判断有bug)

在使用Geta的最新版本时遇到一个调用RepeatModeler的bug
在geta.pl第327行中
对repeatmodeler版本的判断有一个字符判断错误，将导致repeatmodeler运行失败

if ( $RepeatModeler_version =~ m/(\d+)\.(\d+)\.(\d+)/ && ($1 < 2 or ( $1 == 2 && $1 == 0 && $3 <= 3)) )

应该为

if ( $RepeatModeler_version =~ m/(\d+)\.(\d+)\.(\d+)/ && ($1 < 2 or ( $1 == 2 && $2 == 0 && $3 <= 3)) )

geta

Dear Mr. Chen

The geta test data shows "Page Not Found"

thank you

注释结果注释出两个不同基因

您好，我在用geta注释时发现有同一位置注释出两个不同基因，链的方向相同，数目还不少，请问这种问题如何解决？

没有转录组的情况，用GeMoMa的结果作为AUGUSTUS的输入文件

陈老师，您好！
感谢您这个友好的注释流程，很有帮助！
最近在做哺乳动物基因组注释，由于没有转录组的情况，同源注释的软件比较发现GeMoMa的注释和近缘物种是很相似的。而genewise出来的结果当作augustus的训练集的时候，augustus预测出来的busco，不太高。而且基因结构分布和近缘物种也有差异。

所以，正在用GeMoMa的结果作为AUGUSTUS的输入文件，由于augustus训练遇到genewise的两个输入文件，其中一个是genewiss.gff3,另一个是genewise.start_stop_hints.gff，您是否可以提供一个脚本将GeMoMa的结果替换那个genewise的结果来进行训练。感激！

期待您的回复，
刘晓刚

Can not create file /Date/geta/VP_GENOME_final/annotation/VPXY/VPXX.tmp/6.combineGeneModels/paraAlternative_splicing_analysis.ge.tmp/seq340.base_depth.txt, Too many open files at /home/geta/soft/geta-master/bin/paraAlternative_splicing_analysis line 127, <IN> line 174922.

the questions of repeatmask results

there are lots of "N" and lower case letters within repeatmask results even these regions are not gaps. so if you can check the part of repeatmask wihin your pipe

must be at least two fold cross validation

Hi I got this kind of warning when I run geta using a 15 Mb fragment, but when I used the whole chromosome, the error disappeared, is there any limmitation for software?
CMD: optimize_augustus.pl --species="arabidopsis" --rounds=5 --cpus=0 --kfold=0 --onlytrain=training.gb.onlytrain genes.gb.train.train > optimize.out
must be at least two fold cross validation at /public1/home/geta/software/Augustus-3.3.2/scripts/optimize_augustus.pl line 296.
Failed to execute: optimize_augustus.pl --species="arabidopsis" --rounds=5 --cpus=0 --kfold=0 --onlytrain=training.gb.onlytrain genes.g

合并gff3结果中发现问题

陈老师您好，我在使用GETA注释得到最终gff3后发现部分外显子出现如下问题：

其中红色框标注的外显子行和utr行应该删除，CDS行经过确认是正确的
麻烦陈老师确认下，谢谢

祝好

anfnotation tools contig level assebly

Thanks for the brilliant work of these scripts and bioinformatic software developers.

I need to assess the function of contigs assembled from plants pacbio reads. can anyone suggest me the basic protocol of finding genes out of contigs?

My worry is i didn't have the chromosome level assembly using scaffold. my question is can these tools still work for my contig-level assembly?

Thanks

为什么不用stringtie的结果作为augustus的训练集？

你好，
看了很多的注释流程，对于转录组数据，都宁愿绕个大弯，denovo组装后 est比回基因组再作为augustus的训练集，为什么不直接用stringtie等软件的组装结果（包括geta也不用，而是自己找transfrag）作为augustus的训练集？
祝好！

最后protein.fasta文件中，很多氨基酸序列含有多个终止信号*，gff3注释文件中部分gene位置注释为负值

陈老师您好，
首先非常感谢您非常棒的基因注释软件。整个流程跑下来后，检查最后结果时候发现:
1.蛋白序列文件中，有许多氨基酸序列含有2个及以上的终止密码子产生的* 号;
2.gff3注释文件中部分gene位置注释为负值。
请问可能是什么原因导致的呢，还是我自己分析流程中有问题导致的？谢谢。

分析流程及参数如下：
geta.pl --RM_species Embryophyta --pe1 355_1.fq.gz,292.1.clean_R1.fq.gz --pe2 355_2.fq.gz,292.1.clean_R2.fq.gz --protein homo.protein.fa --augustus_species 20240311 --out_prefix test --config conf_all_defaults.txt --cpu 40 --gene_prefix Ah01Ggene --HMM_db Pfam-AB.hmm reference_genome.fa

蛋白序列信息：

Ah01Ggene000854.t01 [Parent=Ah01Ggene000854] [Transcript_Ratio=100.00%] [Integrity=complete] [Source=homolog0776.t01]
MSKNRDKEPPLNFDPNIKKTVRRCQQQARAFRSAESLRDNSKEEAEVITMEPNNNNNQPKRTLDSYTAPNPTFYGSSIIVHPMNANNFELKPQLITLVQQDCQFYGLPRENPNLFISNFLQICDTVKTNRVYPDVYWLLLFSFTVRDQEKQWLDTQPQESLDTWDNVVSRFLNKFSPPQRVTNLTTDV* TFRQQEGAFLYETWERYKVMLKSVLPTCFQT* YSCRSFIMGLLRPPGPLWIILQEDPFIRKALKRL S* LR* LLTTTIYTLL* KSP* GKESWN* MLWIPLLLRIKPCLNR* MPLLNTWLDYKSQLLITKMLLMT* VVNFLKVRVMIMVNFPLNSLIT* AISPDLPIMIYFPRFIIRGGGITQILDGKINHRGNHTSTTTTTVLWVILIRIILTVTTDIFNPLNHIMYPPLLRNLLT* NL* LQNLPRILII* CRKPKYQLETWWFRWVN*

gff3基因注释信息：
jcf7180006177890 GETA gene -2 1041 . - . ID=Ah01Ggene000129;Name=gene1203;Type=fair_gene_models_predicted_by_homolog;>
jcf7180006177890 GETA mRNA -2 1041 . - . ID=Ah01Ggene000129.t01;Parent=Ah01Ggene000129;Name=gene1203.mRNA;Type=fair_g>
jcf7180006177890 GETA five_prime_UTR 1030 1041 . - . ID=Ah01Ggene000129.t01.utr5p1;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA exon 929 1041 . - . ID=Ah01Ggene000129.t01.exon1;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA CDS 929 1029 . - 0 ID=Ah01Ggene000129.t01.CDS1;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA intron 855 928 0 - . ID=Ah01Ggene000129.t01.intron1;Parent=Ah01Ggene000129.t01;Supported_times=20>
jcf7180006177890 GETA exon 764 854 . - . ID=Ah01Ggene000129.t01.exon2;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA CDS 764 854 . - 1 ID=Ah01Ggene000129.t01.CDS2;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA intron 643 763 0 - . ID=Ah01Ggene000129.t01.intron2;Parent=Ah01Ggene000129.t01;Supported_times=20>
jcf7180006177890 GETA exon -2 642 . - . ID=Ah01Ggene000129.t01.exon3;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA CDS -2 642 . - 0 ID=Ah01Ggene000129.t01.CDS3;Parent=Ah01Ggene000129.t01;

求助第三步Transcript 运行报错

Step 3: Transcript
PWD: ~/3_geta/3/p/out.tmp/3.transcript
CMD(Skipped): ~/software/geta-geta-2.4.12//bin/split_sam_from_non_aligned_region ../2.hisat2/hisat2.sorted.sam splited_sam_out 10 > splited_sam_files.list
Thu Feb 23 15:12:44 2023: CMD: ParaFly -c command.sam2transfrag.list -CPU 30 &> /dev/null
CMD(Skipped): ParaFly -c command.sam2transfrag.list -CPU 30 &> /dev/null
Thu Feb 23 15:41:25 2023: CMD: ~/software/geta-geta-2.4.12//TransDecoder-v5.5.0/TransDecoder.LongOrfs -m 100 -G universal -t transfrag.strand.fasta -S &> /dev/null
failed to execute: ~/software/geta-geta-2.4.12//TransDecoder-v5.5.0/TransDecoder.LongOrfs -m 100 -G universal -t transfrag.strand.fasta -S &> /dev/null
老师请问一下遇到这种情况是什么原因呢该怎么解决

diamond报错Computing alignments... Error: std::bad_alloc

请问这种报错是内存的问题吗，该如何解决呢？是否可以通过调整基因组切分的大小来规避，但是应该如何设置呢？如果想用回blast不用diamond是否可以呢？

error:para_RepeatMasker

failed to execute: para_RepeatMasker --out_prefix RepeatModeler_out --lib RM_/.classified --cpu 4 --tmp_dir para_RepeatMasker.tmp /home/changchuanjun/gene_structure_annotation_20240224/Fuji_Ral-Fuji_Del_20240226/Fuji_Ral_Hap1/out.tmp/genome.fasta &> para_RepeatMasker.log

I want to ask how to solve this problem?

the accuracy of gene structure

hi, Dr. chen,
i have some questions about the accuracy of the annotated gene structure.
I annotated one mammal genome with geta pipeline, and compared the disctributions of gene structure in "eukaryotic_gene_model_statistics.stats" to its relative species, and found that most of the items, like everage length of gene, cds, intron, etc not match their the relatives. and what should i do to recorrect the results.

hopefully get your replay soon,
thanks!

Some issues related to links, files and commands

Hi there,

I am trying your code for annotation of a de novo genome assembly. And there are some issues that I wanted to report, so you can improve your code. I do not have the experience required to fix them in your code, but I am going around these issues manually (which is sometimes confusing being a beginner).

1- Links to files: read.1.fastq and reads.2.fastq (in step 1. trimmomatic) and augustus.gff3 (in step 5. augustus) aren't recognized by later commands.

2- Command: rm -rf path/to/augustus/config/species/file removes the configuration file (in step 5. augustus) and the program augustus can't find it after that.

3- File: augustus.1.gff3 contains a first line with the string "All commands completed successfully :-)", which causes the following commands to appear in the command.combineGeneModels.list (in step 6. combineGeneModels):

_/geta-master//bin/combineGeneModels --overlap 30 --min_augustus_transcriptSupport_percentage 10.0 --min_augustus_intronSupport_number 1 --min_augustus_intronSupport_ratio 0.01 combineGeneModels_tmp/All commands completed successfully. :-)
_plus_augustus.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_plus_transfrag.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_plus_genewise.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_plus_intron.gff > combineGeneModels_tmp/All commands completed successfully. :-)
_plus.1.gff3 2> combineGeneModels_tmp/All commands completed successfully. :-)
_plus.2.gff3
/geta-master//bin/combineGeneModels --overlap 30 --min_augustus_transcriptSupport_percentage 10.0 --min_augustus_intronSupport_number 1 --min_augustus_intronSupport_ratio 0.01 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_augustus.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_transfrag.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_genewise.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_intron.gff > combineGeneModels_tmp/All commands completed successfully. :-)
_minus.1.gff3 2> combineGeneModels_tmp/All commands completed successfully. :-)
minus.2.gff3

And the commands in the file genetarate the following FailedCommands (in step 6. combineGeneModels):

__plus_augustus.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_genewise.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_intron.gff > combineGeneModels_tmp/All commands completed successfully. :-)
_plus_intron.gff > combineGeneModels_tmp/All commands completed successfully. :-)
_plus.2.gff3
_plus_genewise.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_augustus.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
/geta-master//bin/combineGeneModels --overlap 30 --min_augustus_transcriptSupport_percentage 10.0 --min_augustus_intronSupport_number 1 --min_augustus_intronSupport_ratio 0.01 combineGeneModels_tmp/All commands completed successfully. :-)
/geta-master//bin/combineGeneModels --overlap 30 --min_augustus_transcriptSupport_percentage 10.0 --min_augustus_intronSupport_number 1 --min_augustus_intronSupport_ratio 0.01 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_transfrag.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_plus_transfrag.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus.2.gff3
_minus.1.gff3 2> combineGeneModels_tmp/All commands completed successfully. :-)
plus.1.gff3 2> combineGeneModels_tmp/All commands completed successfully. :-)

4- The website: 122.205.95.116/geta/ in the README.md file does not work. If you open that link, the new page says: "Sorry, Page Not Found".

And I also wanted to ask you: How can we cite your code?

Thanks and have a nice day,
Dessiree Zerpa

BGM2AT

hello,
when I run BGM2AT of 5.augustus, there is no some subscripts called by, like, gff2gbSmallDNA.pl,randomSplit.pl.where can I find these?help…

Questions for Geta

Dear Chen,
I am Won Yim at the University of Nevada, Reno.
Xingtan Zhang introduced your software to use, I finished multiple genome with Geta.
I forked and added Hpc grid runner instead of Parafly to use our HPC system. https://hpcgridrunner.github.io/ . This might help to utilize HPC.
Sometimes, geneModels2AugusutsTrainingInput generated error due to empty training input, I update a little bit.
geneModels2AugusutsTrainingInput
I think Geta might need to add Braker2 and splan, If you don't mind, Our team could help that part.
Honestly, we are working on similar project and want to integrate it through Snakemake.

Modification of non-creatable array value attempted, subscript -1 at /home/geta/soft/geta-master/bin/homolog_prediction.03HitToGenePrediction line 130, <IN> line 4

陈老师，您好，我跑geta的时候，报这个错“Modification of non-creatable array value attempted, subscript -1 at /home/geta/soft/geta-master/bin/homolog_prediction.03HitToGenePrediction line 130, line 4”，这是什么情况？我第一次跑的时候都没问题，后面两次跑不同的基因组，都遇到这个问题。其中有一个用的是同一个样本

chenlianfu / geta Goto Github PK

geta's People

Contributors

Stargazers

Watchers

Forkers

geta's Issues

Recommend Projects

Recommend Topics

Recommend Org