bionf / hamstr Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 2.0 28.52 MB

Feature-aware orthology prediction tool. Provided data package: Quest for Orthologs reference gene sets 2019.

Home Page: https://bionf.github.io/HaMStR/

License: GNU General Public License v3.0

Shell 9.08% Perl 63.33% Python 27.59%

homology homology-search orthology orthology-inference

hamstr's People

Contributors

Stargazers

Watchers

Forkers

stefanbiermann trvinh

hamstr's Issues

scoreThreshold

Currently, the scoreThreshold is constitutively on. Implement a switch to de-activate it.

Maybe a bug: Linked source for ....not found

Hello,
I have been using this command for a long time, it woks well at HaMStR v.13.2.10. I installed HaMStR v.13.2.11 by ' conda install -c bionf hamstr' in a new system and then met this error. Is this a bug?
I am using a core_ortholog set trained by ourselves. Since input files are protein sequences, there are only *_prot.fa in blastdir. It seems that this hamstr doesn't recognize option "-protein".
Thank you!

command:
hamstr -hmmset=arthNreg2 -protein -cleartmp -cpu 12 -central -strict -refspec=drome_4 -sequence_file=Paranesidea_sp_croco.fas -taxon=Paranesidea_sp -representative
error:
checking for reference fasta files: Linked source for /home/depengli/HaMStR/blast_dir/drome_4/drome_4.fa not found in /mnt/d/Croco_hamstr_workdir! at /home/depengli/HaMStR/bin/hamstr line 1172.

accelerated mode

uni-directional search using core-specific cutoff (bit, fas,...)

original sequence not found

I add a my species to the reference databases use add addTaxon1s command, but it always show the "original sequence not found" when i choose the new species as ref-species with common "h1s --seqFile infile.fa --seqName test3 --refspec ARIFI@158543@1 --force", can you help me solved it ?

long-format for phyloprofile input

Since the matrix file does not support co-orthologs, we need to use the long format.

@holgerbgm you can use my script here to replace the current parseOneSeq.pl
parseOneSeq_mod.pl.txt
(remove the .txt extension after copying)

Usage:
perl parseOneSeq_mod.pl -i <path_to_hamstr_output> -o <job_name.profile>

where path_to_hamstr_output is the folder that contains *.extended.profile files.

auto deploying to pypi

check this link
https://docs.travis-ci.com/user/deployment/pypi/
or this
https://circleci.com/blog/Adding-container-security-scanning-anchore/

move related documents (posters) to gh-pages

Create gh-pages branch and move related documents (e.g. poster pdf files in www folder) to that branch. It will keep the master branch clean. See https://github.com/BIONF/phyloprofile-data/tree/gh-pages

FAS: Input

Reading in large input files with FAS can take a moment. This is not critical when doing single FAS runs but might accumulate to larger runtimes during a oneseq run.
Adding a new input format that can be read faster might be something to work on for future releases, maybe even binary files for the oneseq library species.

hmmsearch ranking

Allow the ranking of hmmsearch hits by either global score (current implementation) or by the score of the best domain.

For loop not working

Hello

I tried running a loop for different proteins using the following command:

for PROTEIN in *fa; do oneSeq -seqFile=$PROTEIN -seqname=$PROTEIN -refspec=HUMAN@9606@1 -minDist=genus -maxDist=phylum -coreOrth=5 -cleanup -global -cpu=20 -countercheck; done

It runs with the first protein until it creates the *.extended.fa file and then it starts doing the second protein without any other output for the first protein. If I run the proteins one by one it works fine.

Best,
Cata

data redundancy and unused script(s)

Packages in Bio folder are also exist in BioPerl/Bio folder.
taxonomy downloaded from server still contains alt data. Will these old data be needed any where? The tar file taxdump.tar.gz can be also deleted.
script /bin/fas/Pfam/Pfam-hmms/count_clan_numbers.pl is unused?

clean up result files (data package), set new checksum

Problem during annotation

Dear Ingo,

During the annotation of one dataset I got the following error:

Problem occurred: 2!
problem 1!
problem 1!
Problem occurred: 2!
problem 1!
problem 1!
problem 1!
problem 1!
problem 1!
Problem occurred: 2!
problem 1!
problem 1!
problem 1!
problem 1!
problem 1!
problem 1!
problem 1!
Problem occurred: 2!
problem 1!
problem 1!
Problem occurred: 2!
problem 1!
problem 1!
Problem occurred: 2!
Problem occurred: 2!
problem 1!
Problem occurred: 2!
problem 1!
problem 1!
problem 1!
problem 1!
Problem occurred: 2!
Problem occurred: 2!
problem 1!
problem 1!
problem 1!
problem 1!
problem 1!
problem 1!
Problem occurred: 2!
problem 1!
Problem occurred: 2!
Problem occurred: 2!
problem 1!
Problem occurred: 2!
problem 1!
Problem occurred: 2!
problem 1!
problem 1!
Problem occurred: 2!
problem 1!
Problem occurred: 2!
problem 1!
problem 1!
Problem occurred: 2!
Problem occurred: 2!
problem 1!
problem 1!
Problem occurred: 2!
Problem occurred: 2!
problem 1!
Problem occurred: 2!
Problem occurred: 2!
Problem occurred: 2!
problem 1!
problem 1!
problem 1!
problem 1!
problem 1!
Problem occurred: 2!
problem 1!
problem 1!
Problem occurred: 2!
Problem occurred: 2!
finished.
Deleting temporary output file...
--> annotation finished.
tool start: 23/02/2020 13:13:19:0
tool end  : 24/02/2020 4:12:16:0
#####################

Any clues on what it means ?

Thanks!

Best,
Cata

Error running sample file

Hello

I installed HaMStR using conda. I ran the setup_hamstr | tee log.txt and it appears that there was no error during the setup.

I tried running the sample file but I get the following error.

There was an error running HaMStR v.13.2.10

VERSION:	HaMStR v.13.2.10

HOSTNAME	lmu-thesis1


USER DEFINED PARAMTERS (inc. default values)

	 -append:	not set
	 -blastpath:	/home/ubuntu/tools/HaMStR/blast_dir/
	 -checkCoorthologsRef:	not set
	 -cleartmp:	not set
	 -concat:	not set
	 -est:	not set
	 -eval_blast:	0.0001
	 -eval_hmmer:	0.0001
	 -filter:	F
	 -hit_limit:	not set
	 -hmm:	not set
	 -hmmpath:	/home/ubuntu/tools/HaMStR/core_orthologs/
	 -hmmset:	ppkdOcJ
	 -intron:	k
	 -longhead:	not set
	 -nonoverlapping_cos:	not set
	 -outpath:	/home/ubuntu/tools/HaMStR/output/ppkdOcJ
	 -protein:	1
	 -rbh:	not set
	 -refspec:	HUMAN@9606@1
	 -relaxed:	not set
	 -representative:	not set
	 -reuse:	not set
	 -sequence_file:	DROME@[email protected]
	 -show_hmmsets:	not set
	 -sort_global_align:	not set
	 -strict:	not set
	 -taxon:	DROME@7227@1
	 -ublast:	0
INFILE PROCESSING

	A modified infile DROME@[email protected] already exists. Using this one
	Newlines from the infile have been removed

CHECKING FOR PROGRAMS

	check for blastp succeeded
	check for hmmsearch succeeded

CHECKING FOR HMMs

	running HaMStR with all hmms in /home/ubuntu/tools/HaMStR/core_orthologs//ppkdOcJ/hmm_dir
	check for /home/ubuntu/tools/HaMStR/core_orthologs//ppkdOcJ/ppkdOcJ.fa succeeded

CHECKING TAXON NAME

	using default taxon DROME@7227@1 for all sequences

CHECKING FOR REFERENCE TAXON

	 Reference species for the re-blast: HUMAN@9606@1

CHECKING FOR BLAST DATABASES

	check for /home/ubuntu/tools/HaMStR/blast_dir//HUMAN@9606@1/HUMAN@9606@1 succeeded

CHECKING FOR REFERENCE FASTA FILES

	Removing line breaks from /home/ubuntu/tools/HaMStR/genome_dir/HUMAN@9606@1/HUMAN@[email protected].
FATAL: Problems running the script nentferner.pl

PROGRAM OPTIONS

	hmmsearch will run with an e-value limit of 0.0001
	re-blast hit_limit: none applied
	Blast will run with an evalue limit of 0.0001

	check for low complexity filter setting succeeded. Chosen value is F
	HaMStR was called without the -representative option. More than one ortholog may be identified per core-ortholog group!Evaluation of predicted HaMStR orthologs.
Error: Could not find /home/ubuntu/tools/HaMStR/data/ppkdOcJ.extended.fa

Do you know what could be causing this error ?

Thank you!

Regards,
Cata

check existing annotation before calculating FAS

I noticed that the FAS calculation was not immediately cancelled even if there is no annotation available for one proteins (either seed or query), which takes unnecessary time for whatever processing. I have some (thousand) cases where it took minutes to do something and gave me nothing :-)
I suggest to check for the availability of the annotations before doing any further processing and give a proper error message to say e.g. which protein has not been annotated.

update pfam db (30.0 -> 32.0)

The data download for oneseq has an older version of the pfam db (version 30.0) newest version is (32.0)

CAST not working

Hello

I am manually creating a new data set. In the step 6 (Create the annotation files for your taxon with the provided perl script) I get the following error:

Can't exec "/home/ubuntu/annotation_fas/CAST/cast": No such file or directory at /home/ubuntu/anaconda3/envs/hamstr/lib/python3.8/site-packages/greedyFAS/annoFAS.pl line 909.
Use of uninitialized value $castOut in scalar chomp at /home/ubuntu/anaconda3/envs/hamstr/lib/python3.8/site-packages/greedyFAS/annoFAS.pl line 910.
Use of uninitialized value $castOut in concatenation (.) or string at /home/ubuntu/anaconda3/envs/hamstr/lib/python3.8/site-packages/greedyFAS/annoFAS.pl line 911.

I used the following command:

annoFAS --fasta=/home/ubuntu/tools/HaMStR/genome_dir/TEST@00001@1/TEST@[email protected] --path=/home/ubuntu/tools/HaMStR/weight_dir --name=TEST@00001@1

It seems that CAST is not compiled to work with our system. I couldn't find the CAST v1.0 on their webpage.

Best,
Cata

potential of using out-dated data/tools

Provided taxonomy data together with the preprocessed genome data can lead to the using of invalid NCBI taxonomy, which causes problem when visualizing output using PhyloProfile. I suggest we should:

download latest NCBI taxonomy for the first time install hamstr
create script to check for outdated tax ID in the preprocessed genome data (either remove that species, or replace the invalid ID by the new one)

The same can be applied for annotation tools (pfam, smart, tmhmm,...). But it requires automatic tests for testing the compatible of the new version with the current data (for example check if any using functions got deprecated from the new version).

Only the preprocessed genome data (blast_dir, genome_dir, weight_dir) should be downloaded from our server.

temporary FAS input files

When calculating the FAS score, oneseq extracts the the feature architecture of the found ortholog from the species input creating a lot of temporary files. Because of the seed_id/query_id options of FAS, this is no longer neccessary and should be changed.

bionf / hamstr Goto Github PK

hamstr's People

Contributors

Stargazers

Watchers

Forkers

hamstr's Issues

Recommend Projects

Recommend Topics

Recommend Org