Giter VIP home page Giter VIP logo

geta's People

Contributors

chenlianfu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

geta's Issues

重复序列trf注释

陈老师 您好!
关于geta重复序列在注释当中,trf的注释向您请教一下。我试过哺乳动物的几个注释,应该是有不到10%的trf是属于trf的序列,geta当中是不是可以将这部分添加进去?

Confirm that use of BLAST's `-max_target_seqs` is intentional

Hi there,

This is a semi-automated message from a fellow bioinformatician. Through a GitHub search, I found that the following source files make use of BLAST's -max_target_seqs parameter:

Based on the recently published report, Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows, there is a strong chance that this parameter is misused in your repository.

If the use of this parameter was intentional, please feel free to ignore and close this issue but I would highly recommend to add a comment to your source code to notify others about this use case. If this is a duplicate issue, please accept my apologies for the redundancy as this simple automation is not smart enough to identify such issues.

Thank you!
-- Arman (armish/blast-patrol)

Step 6: CombineGeneModels #6.1 出现问题

陈老师你好!我在执行geta到step6的6.1(第一轮基因预测结果整合:以AUGUSTUS结果为主,进行三种基因预测结果的整合)的cmdString3遇到问题,具体如下。
Step 6: CombineGeneModels
~/biosoft/geta-2.6.1/bin/GFF3Clear --genome ~/workspace/annotation/output.tmp/genome.fasta --no_attr_add --coverage 0.8 combine.2.gff3 > geneModels.b.gff3 2> /dev/null
问题是这一步花了很多个小时都跑不完,卡在这里,导致后续流程走不了。请问这是正常的吗,如果尝试跑了很久都跑不完,请问能把geneModels.b.gff3相关的代码注释掉,从而完成流程吗。

祝好!

too many gene models by augustus

Hi ,Dr.chen:
I have used geta pipeline several times, and I found that there are too many gene models predicted by augustus. Is there any good way to optimize it.

RNA-seq对cds序列比对率太低

陈老师好!
我在评估注释质量时,用注释用的RNA-seq 比对到geta 注释出来的cds序列 (botiew2), 结果RNA-seq 比对率才50%,但是第二步hisat2比对基因组时比对率是95%, 请问这是正常的吗?

out.homolog_prediction.gff3为空

陈老师您好!我用geta注释基因组,最后跑出的结果中同源预测的结果文件out.homolog_prediction.gff3为空。查看了日志文件,没有发现报错,同时也去out.tmp/4.homolog文件夹中查看相关日志,也是没有报错信息,homolog_gene_region.tab文件是正常的,但是genewise.gff、genewise.gff3、genewise.start_stop_hints.gff以及genewise.gene_id_with_stop_codon.txt文件都是空的。请问这个是哪里出了问题,应该怎么解决呢?期待您的回复!谢谢!

运行报错 merge_repeatMasker_out.pl line 40

您好,我想求助一下,我在开始运行之后,出现了以下报错,这是什么原因呢,请问这样怎么解决呀?
No such file or directory at ~/geta-master/bin/merge_repeatMasker_out.pl line 40.
failed to execute: ~/geta-master/bin/merge_repeatMasker_out.pl ~/cevi.tmp/genome.fasta repeatMasker/RepeatMasker_out.out repeatModeler/RepeatModeler_out.out > genome.repeat.stats
CMD: /geta-master/bin/ParaFly -c/cevi.tmp/0.RepeatMasker/repeatMasker/para_RepeatMasker.tmp/command.RepeatMasker.list -CPU 27 &> ~/cevi.tmp/0.RepeatMasker/repeatMasker/para_RepeatMasker.tmp/ParaFly.log
CMD: ~/geta-master/bin/ParaFly -c ~/cevi.tmp/0.RepeatMasker/repeatModeler/para_RepeatMasker.tmp/command.RepeatMasker.list -CPU 27 &> ~/cevi.tmp/0.RepeatMasker/repeatModeler/para_RepeatMasker.tmp/ParaFly.log
Can not open file ~/cevi.tmp/0.RepeatMasker/repeatMasker/para_RepeatMasker.tmp/seq_1/seq_1.fasta.out, No such file or directory at /geta-master/bin/para_RepeatMasker line 138.
Can not open file
/cevi.tmp/0.RepeatMasker/repeatModeler/para_RepeatMasker.tmp/seq_1/seq_1.fasta.out, No such file or directory at ~/geta-master/bin/para_RepeatMasker line 138.

需要安装LTRfinder吗

报错如:“failed to execute: RepeatModeler -pa 8 -database species -LTRStruct &> RepeatModeler.log”。
还想问下改正错误后geta流程还能不能接着之前的结果继续运行。需要什么命令

Glitch report For RepeatModeler(RepeatModeler版本判断有bug)

在使用Geta的最新版本时遇到一个调用RepeatModeler的bug
在geta.pl第327行中
对repeatmodeler版本的判断有一个字符判断错误,将导致repeatmodeler运行失败

if ( $RepeatModeler_version =~ m/(\d+)\.(\d+)\.(\d+)/ && ($1 < 2 or ( $1 == 2 && $1 == 0 && $3 <= 3)) )

应该为

if ( $RepeatModeler_version =~ m/(\d+)\.(\d+)\.(\d+)/ && ($1 < 2 or ( $1 == 2 && $2 == 0 && $3 <= 3)) )

geta

Dear Mr. Chen

The geta test data shows "Page Not Found"

thank you

注释结果注释出两个不同基因

image
您好,我在用geta注释时发现有同一位置注释出两个不同基因,链的方向相同,数目还不少,请问这种问题如何解决?

没有转录组的情况,用GeMoMa的结果作为AUGUSTUS的输入文件

陈老师,您好!
感谢您这个友好的注释流程,很有帮助!
最近在做哺乳动物基因组注释,由于没有转录组的情况,同源注释的软件比较发现GeMoMa的注释和近缘物种是很相似的。而genewise出来的结果当作augustus的训练集的时候,augustus预测出来的busco,不太高。而且基因结构分布和近缘物种也有差异。

所以,正在用GeMoMa的结果作为AUGUSTUS的输入文件,由于augustus训练遇到genewise的两个输入文件,其中一个是genewiss.gff3,另一个是genewise.start_stop_hints.gff,您是否可以提供一个脚本将GeMoMa的结果替换那个genewise的结果来进行训练。感激!

期待您的回复,
刘晓刚

the questions of repeatmask results

there are lots of "N" and lower case letters within repeatmask results even these regions are not gaps. so if you can check the part of repeatmask wihin your pipe

must be at least two fold cross validation

Hi I got this kind of warning when I run geta using a 15 Mb fragment, but when I used the whole chromosome, the error disappeared, is there any limmitation for software?
CMD: optimize_augustus.pl --species="arabidopsis" --rounds=5 --cpus=0 --kfold=0 --onlytrain=training.gb.onlytrain genes.gb.train.train > optimize.out
must be at least two fold cross validation at /public1/home/geta/software/Augustus-3.3.2/scripts/optimize_augustus.pl line 296.
Failed to execute: optimize_augustus.pl --species="arabidopsis" --rounds=5 --cpus=0 --kfold=0 --onlytrain=training.gb.onlytrain genes.g

合并gff3结果中发现问题

陈老师您好,我在使用GETA注释得到最终gff3后发现部分外显子出现如下问题:

图片

其中红色框标注的外显子行和utr行应该删除,CDS行经过确认是正确的
麻烦陈老师确认下,谢谢

祝好

anfnotation tools contig level assebly

Thanks for the brilliant work of these scripts and bioinformatic software developers.

I need to assess the function of contigs assembled from plants pacbio reads. can anyone suggest me the basic protocol of finding genes out of contigs?

My worry is i didn't have the chromosome level assembly using scaffold. my question is can these tools still work for my contig-level assembly?

Thanks

为什么不用stringtie的结果作为augustus的训练集?

你好,
看了很多的注释流程,对于转录组数据,都宁愿绕个大弯,denovo组装后 est比回基因组再作为augustus的训练集,为什么不直接用stringtie等软件的组装结果(包括geta也不用,而是自己找transfrag)作为augustus的训练集?
祝好!

最后protein.fasta文件中,很多氨基酸序列含有多个终止信号*,gff3注释文件中 部分gene位置注释为负值

陈老师您好,
首先非常感谢您非常棒的基因注释软件。整个流程跑下来后,检查最后结果时候发现:
1.蛋白序列文件中,有许多氨基酸序列含有2个及以上的终止密码子产生的* 号;
2.gff3注释文件中部分gene位置注释为负值。
请问可能是什么原因导致的呢,还是我自己分析流程中有问题导致的?谢谢。

分析流程及参数如下:
geta.pl --RM_species Embryophyta --pe1 355_1.fq.gz,292.1.clean_R1.fq.gz --pe2 355_2.fq.gz,292.1.clean_R2.fq.gz --protein homo.protein.fa --augustus_species 20240311 --out_prefix test --config conf_all_defaults.txt --cpu 40 --gene_prefix Ah01Ggene --HMM_db Pfam-AB.hmm reference_genome.fa

蛋白序列信息:

Ah01Ggene000854.t01 [Parent=Ah01Ggene000854] [Transcript_Ratio=100.00%] [Integrity=complete] [Source=homolog0776.t01]
MSKNRDKEPPLNFDPNIKKTVRRCQQQARAFRSAESLRDNSKEEAEVITMEPNNNNNQPKRTLDSYTAPNPTFYGSSIIVHPMNANNFELKPQLITLVQQDCQFYGLPRENPNLFISNFLQICDTVKTNRVYPDVYWLLLFSFTVRDQEKQWLDTQPQESLDTWDNVVSRFLNKFSPPQRVTNLTTDV* TFRQQEGAFLYETWERYKVMLKSVLPTCFQT* YSCRSFIMGLLRPPGPLWIILQEDPFIRKALKRL S* LR* LLTTTIYTLL* KSP* GKESWN* MLWIPLLLRIKPCLNR* MPLLNTWLDYKSQLLITKMLLMT* VVNFLKVRVMIMVNFPLNSLIT* AISPDLPIMIYFPRFIIRGGGITQILDGKINHRGNHTSTTTTTVLWVILIRIILTVTTDIFNPLNHIMYPPLLRNLLT* NL* LQNLPRILII* CRKPKYQLETWWFRWVN*

gff3基因注释信息:
jcf7180006177890 GETA gene -2 1041 . - . ID=Ah01Ggene000129;Name=gene1203;Type=fair_gene_models_predicted_by_homolog;>
jcf7180006177890 GETA mRNA -2 1041 . - . ID=Ah01Ggene000129.t01;Parent=Ah01Ggene000129;Name=gene1203.mRNA;Type=fair_g>
jcf7180006177890 GETA five_prime_UTR 1030 1041 . - . ID=Ah01Ggene000129.t01.utr5p1;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA exon 929 1041 . - . ID=Ah01Ggene000129.t01.exon1;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA CDS 929 1029 . - 0 ID=Ah01Ggene000129.t01.CDS1;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA intron 855 928 0 - . ID=Ah01Ggene000129.t01.intron1;Parent=Ah01Ggene000129.t01;Supported_times=20>
jcf7180006177890 GETA exon 764 854 . - . ID=Ah01Ggene000129.t01.exon2;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA CDS 764 854 . - 1 ID=Ah01Ggene000129.t01.CDS2;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA intron 643 763 0 - . ID=Ah01Ggene000129.t01.intron2;Parent=Ah01Ggene000129.t01;Supported_times=20>
jcf7180006177890 GETA exon -2 642 . - . ID=Ah01Ggene000129.t01.exon3;Parent=Ah01Ggene000129.t01;
jcf7180006177890 GETA CDS -2 642 . - 0 ID=Ah01Ggene000129.t01.CDS3;Parent=Ah01Ggene000129.t01;

求助第三步Transcript 运行报错

Step 3: Transcript
PWD: ~/3_geta/3/p/out.tmp/3.transcript
CMD(Skipped): ~/software/geta-geta-2.4.12//bin/split_sam_from_non_aligned_region ../2.hisat2/hisat2.sorted.sam splited_sam_out 10 > splited_sam_files.list
Thu Feb 23 15:12:44 2023: CMD: ParaFly -c command.sam2transfrag.list -CPU 30 &> /dev/null
CMD(Skipped): ParaFly -c command.sam2transfrag.list -CPU 30 &> /dev/null
Thu Feb 23 15:41:25 2023: CMD: ~/software/geta-geta-2.4.12//TransDecoder-v5.5.0/TransDecoder.LongOrfs -m 100 -G universal -t transfrag.strand.fasta -S &> /dev/null
failed to execute: ~/software/geta-geta-2.4.12//TransDecoder-v5.5.0/TransDecoder.LongOrfs -m 100 -G universal -t transfrag.strand.fasta -S &> /dev/null
老师请问一下 遇到这种情况是什么原因呢 该怎么解决

error:para_RepeatMasker

failed to execute: para_RepeatMasker --out_prefix RepeatModeler_out --lib RM_/.classified --cpu 4 --tmp_dir para_RepeatMasker.tmp /home/changchuanjun/gene_structure_annotation_20240224/Fuji_Ral-Fuji_Del_20240226/Fuji_Ral_Hap1/out.tmp/genome.fasta &> para_RepeatMasker.log
1708945984474
I want to ask how to solve this problem?

the accuracy of gene structure

hi, Dr. chen,
i have some questions about the accuracy of the annotated gene structure.
I annotated one mammal genome with geta pipeline, and compared the disctributions of gene structure in "eukaryotic_gene_model_statistics.stats" to its relative species, and found that most of the items, like everage length of gene, cds, intron, etc not match their the relatives. and what should i do to recorrect the results.

hopefully get your replay soon,
thanks!

Some issues related to links, files and commands

Hi there,

I am trying your code for annotation of a de novo genome assembly. And there are some issues that I wanted to report, so you can improve your code. I do not have the experience required to fix them in your code, but I am going around these issues manually (which is sometimes confusing being a beginner).

1- Links to files: read.1.fastq and reads.2.fastq (in step 1. trimmomatic) and augustus.gff3 (in step 5. augustus) aren't recognized by later commands.

2- Command: rm -rf path/to/augustus/config/species/file removes the configuration file (in step 5. augustus) and the program augustus can't find it after that.

3- File: augustus.1.gff3 contains a first line with the string "All commands completed successfully :-)", which causes the following commands to appear in the command.combineGeneModels.list (in step 6. combineGeneModels):

_/geta-master//bin/combineGeneModels --overlap 30 --min_augustus_transcriptSupport_percentage 10.0 --min_augustus_intronSupport_number 1 --min_augustus_intronSupport_ratio 0.01 combineGeneModels_tmp/All commands completed successfully. :-)
_plus_augustus.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_plus_transfrag.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_plus_genewise.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_plus_intron.gff > combineGeneModels_tmp/All commands completed successfully. :-)
_plus.1.gff3 2> combineGeneModels_tmp/All commands completed successfully. :-)
_plus.2.gff3
/geta-master//bin/combineGeneModels --overlap 30 --min_augustus_transcriptSupport_percentage 10.0 --min_augustus_intronSupport_number 1 --min_augustus_intronSupport_ratio 0.01 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_augustus.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_transfrag.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_genewise.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_intron.gff > combineGeneModels_tmp/All commands completed successfully. :-)
_minus.1.gff3 2> combineGeneModels_tmp/All commands completed successfully. :-)
minus.2.gff3

And the commands in the file genetarate the following FailedCommands (in step 6. combineGeneModels):

__plus_augustus.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_genewise.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_intron.gff > combineGeneModels_tmp/All commands completed successfully. :-)
_plus_intron.gff > combineGeneModels_tmp/All commands completed successfully. :-)
_plus.2.gff3
_plus_genewise.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_augustus.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
/geta-master//bin/combineGeneModels --overlap 30 --min_augustus_transcriptSupport_percentage 10.0 --min_augustus_intronSupport_number 1 --min_augustus_intronSupport_ratio 0.01 combineGeneModels_tmp/All commands completed successfully. :-)
/geta-master//bin/combineGeneModels --overlap 30 --min_augustus_transcriptSupport_percentage 10.0 --min_augustus_intronSupport_number 1 --min_augustus_intronSupport_ratio 0.01 combineGeneModels_tmp/All commands completed successfully. :-)
_minus_transfrag.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_plus_transfrag.gff3 combineGeneModels_tmp/All commands completed successfully. :-)
_minus.2.gff3
_minus.1.gff3 2> combineGeneModels_tmp/All commands completed successfully. :-)
plus.1.gff3 2> combineGeneModels_tmp/All commands completed successfully. :-)

4- The website: 122.205.95.116/geta/ in the README.md file does not work. If you open that link, the new page says: "Sorry, Page Not Found".

And I also wanted to ask you: How can we cite your code?

Thanks and have a nice day,
Dessiree Zerpa

BGM2AT

hello,
when I run BGM2AT of 5.augustus, there is no some subscripts called by, like, gff2gbSmallDNA.pl,randomSplit.pl.where can I find these?help…

Questions for Geta

Dear Chen,
I am Won Yim at the University of Nevada, Reno.
Xingtan Zhang introduced your software to use, I finished multiple genome with Geta.
I forked and added Hpc grid runner instead of Parafly to use our HPC system. https://hpcgridrunner.github.io/ . This might help to utilize HPC.
Sometimes, geneModels2AugusutsTrainingInput generated error due to empty training input, I update a little bit.
geneModels2AugusutsTrainingInput
I think Geta might need to add Braker2 and splan, If you don't mind, Our team could help that part.
Honestly, we are working on similar project and want to integrate it through Snakemake.

Modification of non-creatable array value attempted, subscript -1 at /home/geta/soft/geta-master/bin/homolog_prediction.03HitToGenePrediction line 130, <IN> line 4

陈老师,您好,我跑geta的时候,报这个错“Modification of non-creatable array value attempted, subscript -1 at /home/geta/soft/geta-master/bin/homolog_prediction.03HitToGenePrediction line 130, line 4”,这是什么情况?我第一次跑的时候都没问题,后面两次跑不同的基因组,都遇到这个问题。其中有一个用的是同一个样本

注释结果同近缘物种的基因结构比较

陈老师,您好!
有一个问题想请教您。关于注释结果验证的方法。
除了busco,转录组回帖占比和蛋白功能注释占比以外,另一个常用的方法(比如华大那边注释结果一般会给出一个注释结果和近缘物种的基因结构分布的比较图,比如,cds,intron, exon和mRNA这几个方面的分布)。所以 我在使用过geta注释之后,除了前面几种检测方法之后,最后也用Glean软件包那个脚本做了分布,结果显示,有比较多的冗余,出现在小的基因片段区(<1kb)。尝试过多种办法去除,最后还是没有达到和近缘物种的结构很吻合。

您有没有什么好的办法 能够将geta最后的注释结果和近缘物种的分布达到吻合。
感谢!

三代全长及注释基因过多

陈老师,您好!
目前随着三代测序成本下降,三代全长也被越来越多人使用;另外基因组里也发现存在一些较长的基因,需要三代全长来提高其注释准确性;请问陈老师,geta流程在注释过程中能否融入三代全长的信息?
另外我在用gete做注释时发现注释出来的基因过多,一般我研究的物种已发表的注释结果在4万左右,我用去冗余的基因组注释却得到了8万多的基因,而且短基因(蛋白长度100aa以内的)也不过在8000千的数量;请问这是什么原因引起的,怎么解决呢?谢谢!
祝好!

4G大小基因组报错

陈老师,您好!
4G大小的基因组运行到这一步报错
Died at /w/00/g/g05/WangYubo/software/RepeatModeler-2.0.1/BuildDatabase line 333.
failed to execute: BuildDatabase -name species -engine ncbi /w/00/g/g05/user681/annotation/dym.hap1/out.tmp/genome.fasta
请问是什么原因?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.