pacificbiosciences / angel Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 14.0 1019 KB

Robust Open Reading Frame prediction (ANGLE re-implementation)

License: Other

Python 79.48% Pep8 1.30% Cython 19.22%

angel's People

Contributors

Stargazers

Watchers

Forkers

fripp lamz138138 fw1121 mugabi0978 litswu tw7649116 wangdi2014 leosfan pythseq biogeeker global-localhost global19 global19-atlassian-net bioinfonerd-forks

angel's Issues

angel_train.py hangs with 12 cores

angel_train.py hangs with 12 cores, but when I reduced it to 5 it was able to process to completion

Seq "tostring()" method finally outdated

Just a heads up:
Dumb orf prediction no longer works as of the newest Biopython version 1.73 since ".tostring()" method is outdated.

Some problems about ANGEL.cds

Hi:
I have two problem about ANGEL.cds.
This is my commandline:

dumb_predict.py --use_rev_strand remainpolyT.fasta remainpolyT.dumb --cpus 10

angel_make_training_set.py remainpolyT.dumb.final remainpolyT.dumb.final.traning --cpus 10

angel_train.py remainpolyT.dumb.final.traning.cds remainpolyT.dumb.final.traning.utr remainpolyT.dumb.final.classifier.pickle

angel_predict.py remainpolyT.fasta remainpolyT.dumb.final.classifier.pickle remainpolyT --use_rev_strand --output_mode=best --cpus 10

Then I find the result like this:

>LF210511/f1p0/4173|m.790 type:suspicious-NA len:257 strand:- pos:827-1597
ATGAGTAACCGCCATCTGCCGGCTGGGCAGAATATACAGGGAGGATCTGGCGTCCTAGGTGCCGACATGGTCGGTCCTGGAGGGCCTCGTCGGAGGCAGCCTCCTCCCTTTGTTCCCCAGTCCCAGTACCAGCAGCAACATCATCACCATCAAGCCGTAAATCACATGTATAACAACAACTACATGAACTATGGACAGCAGCAGTATTATGGATACCCGCCGCAGTATCAGACAGGTCACTACCAGAACGCTCAGTACCACAACGCGCAGTATCAAGGTGGACAATACCAAGCTGCACAGTTCCAGAATGGCCAGTACCAAAACGCACAGTACCACAACGCCGGAATGCCTTCACCCGGTGCTTATATGGGCTACCAGCAGCACTACGGACGATCGCCGCCCGTTCACCAGTTTGTCCCCATGTCTGGTGTGAGCGTACCCCCGAGCTTCCCAACCCGCCCAGCTCAGCAACAATCTCCTGCTCTGCCGACTCAGCCTCCTGCTCCAGCCTCACTTCCACCCCAGACTCCTACTTCAACCCACTCGTCGCAGATAATTCCTACTTCAACCCCCCCGGTCACGCAGGAGACTGAGCCAGCACCCCCCGCTCCTCCTGTTGCCCCCGCTGAGCCCCCACGACAACCTTCACCCGTTCCTGTCCCTGCTGCTGCTGCTGTTCCTGCCCCTGTTCATGTTCATGTTCATATTCCTGTTCCACAGGAACCATTCCGTGCACCTGTAAGTCTTGGCAGTTTCAATGTATTAAACTAA
>LF210511/f1p0/4173|m.791 type:suspicious-NA len:275 strand:- pos:2-826
ACTAACTTCGCTCAGCTGCCATGGTACTCTCACCCGGATGAAAAGTTCCCTGTTCGAACTAAAAGGCCGGGGCGATGGAGGAAGCGTCTCAATGCGGACAGTGCAAATGTTTCCCTGCCGGCTATTGACCAACATAACGCTGCTGCAGAGCAGGCCAGCGTTCCCGAAGCCAGCTCTACCGAACCTTCTGTCTCGGCCCTTACACCCGCGACATCAACGGCTCCATCGGAGGCTGCGGCAACTCCTCGCCAGTCTTCCGAGACTCCTGCTTCTGTTCAGCAACGATCACGCGCCAACACCGCCACCAGCGCTACCTCAACTTCGACGAACCGTCCTGCCACACGCTCCTCCGCTACTCCCGCCCCTGCTCTTCCATCGCTTCCTAAGGCAAACACTAAGGATGCTAAGCCTGCACGTGCTGAAAAGCCGGTAAACGGCGACGCAGCTACCGAAAGTGCCCCTGAGCAGGAGGTCACCGCTGAAGATTCCGAGAAGCCCGCGGAGTCTGAATCAACCGCTGCTGGGCCAGCTCCTGCTGTCAAAGCTCCACCTTCTAGTTGGGCGAAGCTTTTCTCGAAGCCCGCTTCTGCAGCTGCTGGAAAGACTGAGGAGTCTAATGGCGCCGCTCCCGTTGACACTGTTGCTAATGGCCGTGCCACCGAAAGCCCTGCTGGAACCCCTAATGGAGCTGCTCCCAGCTTCTCGAAAGTTAACGCCAACTCCGTTGCGGAGGCTATTCACACGTTCCATGTTGGTCTCGCGGATCAAGTTTCATTCCTCGAGCCCCGCGGTCTGATCAACACCGGGAACATGTGTTACATGAAC
>LF210511/f1p0/4173|m.792 type:suspicious-NA len:136 strand:- pos:2332-2739
ATGCCCAAGTACAAGTTGATTAGCGTGGTGTACCATCATGGTAAGAACGCTAGTGGTGGACATTACACTGTCGATGTGCGACGACAGGAAGGGCGCGAGTGGATTCGTATTGATGATACTTCCATCCGCCGAGTTCGAAGTGAAGATGTCGCTGAGGGCGGCGAAGAGGAGGAAGTAAAGAATACTCGTAAGGATGGCTCTTCATTGGGCAACCGGTTCGGTGCTGTTCTGGACGAAGACGCTGGAGATGATGACGGATGGAGCAAGGTCACTAGCCCTGCTGGAGGAGCAAAGAAATGGAGCAGCGTTGCCAACGGTACCAACGGCACTCCCAAGGCCGCCAAGCCGATCAAGGATAACATCAAGGACAACAAGGTTGCCTACCTGCTCTTCTACCAACGAGTATAA

First.
I set the the option --output_mode=best. Why LF210511/f1p0/4173 have three cds?
Second.
Didn't the all the prediction of CDS begin with ATG?

Fatal error at dumb_predict.py

Hi I am getting fatal error. Is this an installation error or something wrong with my input file? I used fasta.gz of polished isoforms. Thanks.

Traceback (most recent call last):
File "/isg/shared/apps/angel/2.7/bin/dumb_predict.py", line 4, in
import('pkg_resources').run_script('Angel==2.7', 'dumb_predict.py')
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/pkg_resources/init.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/pkg_resources/init.py", line 1462, in run_script
exec(code, namespace, namespace)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/EGG-INFO/scripts/dumb_predict.py", line 18, in
transdecoder_main(args.fasta_filename, args.output_prefix, args.min_aa_length, args.use_rev_strand, args.use_firstORF, args.cpus)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/DumbORF.py", line 169, in transdecoder_main
result = predict_longest_ORFs(seq, min_aa_length, use_firstORF) # result is {best_frame: [(best_flag, best_s, best_e)]}
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/ORFutils.py", line 17, in g
raise TypeError, "Input string must consist of only ATCG!"
TypeError: Input string must consist of only ATCG!
running CD-HIT to generate non-redundant set....

Fatal Error:
Failed to open the database file
Program halted !!

angel_train.py error

angel_train.py test.FG.dumb.final.traning.cds test.FG.dumb.final.traning.utr test.FG.dumb.final.classifier.pickle --cpus 10

Traceback (most recent call last):
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/bin/angel_train.py", line 4, in <module>
    __import__('pkg_resources').run_script('Angel==2.4', 'angel_train.py')
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1494, in run_script
    exec(code, namespace, namespace)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/Angel-2.4-py2.7-linux-x86_64.egg/EGG-INFO/scripts/angel_train.py", line 17, in <module>
    ANGEL_training(args.cds_filename, args.utr_filename, args.output_pickle, num_workers=args.cpus)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/Angel-2.4-py2.7-linux-x86_64.egg/Angel/SmartORF.py", line 97, in ANGEL_training
    data_neg = get_data_parallel(o_all, utr, [0, 1, 2], num_workers)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/Angel-2.4-py2.7-linux-x86_64.egg/Angel/SmartORF.py", line 61, in get_data_parallel
    obj = queue.get(timeout=60)
  File "/disk/luping/tools/Python-2.7.14/lib/python2.7/multiprocessing/queues.py", line 132, in get
    raise Empty
Queue.Empty

I don't konw how to fix it!

Feature request: --firstORF instead of longest ORF

In many cases the first ORF instead of the longest ORF is the one that is translated.

Add a --firstORF option to ANGEL.

A problem with running ANGEL

I'm learning ANGEL this days, and when I was running 'angel_train.py', I have encountered such errors:

**Traceback (most recent call last):

File "/wtmp/user181/software/Anaconda/ANGEL/angel_train.py", line 22, in
ANGEL_training(args.cds_filename, args.utr_filename, args.output_pickle, num_workers=args.cpus)
File "/wtmp/user181/software/Anaconda/ANGEL/build/lib.linux-x86_64-2.7/Angel/SmartORF.py", line 96, in ANGEL_training
data_pos = get_data_parallel(o_all, coding, [0], num_workers)
File "/wtmp/user181/software/Anaconda/ANGEL/build/lib.linux-x86_64-2.7/Angel/SmartORF.py", line 61, in get_data_parallel
obj = queue.get(timeout=60)
File "/wtmp/user181/software/Anaconda/anaconda2/lib/python2.7/multiprocessing/queues.py", line 132, in get
raise Empty
Queue.Empty**

And the script just like the below:

python /wtmp/user181/software/Anaconda/ANGEL/angel_train.py /wtmp/user181/data/cu_hq_pac/isoform.fl_nr.dumb.final.training.cds /wtmp/user181/data/cu_hq_pac/isoform.fl_nr.dumb.final.training.utr /wtmp/user181/data/cu_hq_pac/isoform.fl_nr.dumb.final.classifier.pickle --cpu 2

I don't know how to deal with this problem, so can you give me some suggestions to help me?
Thanks a lot!

Some questions about cds analysis

Hi,

These days I'm working on a small transcriptome project where I use the ANGEL software to predict the potential cds sequences.

I first inputted the Pacbio-based reads and used TOFU software to get transcript sequences with the number of 5027. Then I did LncRNA analysis and got LncRNA sequences with the number of 948. I then followed the steps on the ANGEL manual and I have already got some cds results.

However, there are some questions that seem a little bit confusing to me:

Question 1: The total number of cds sequences is 5354. I found that there are some transcripts that have more than 1 cds sequence. So I chose the longest cds in this type of transcript. The final number of cds is 4712.
However, if you do a simple calculation, you will find that the sum of cds and LncRNA is 4712 + 948 = 5660, which much larger than 5027. I'm wondering why the result goes like this with correct inputs.

Question 2: In your ANGEL manual, you listed the potential result types like this:
<seq_id> type:<tag>-<completeness> len:<ORF length (aa)> strand:<strand> pos:<CDS range>
'Where tag is confident, likely, or suspicious for ANGEL predictions, and dumb for dumb ORF predictions.
completeness is either complete, 5partial, 3partial, or internal based on the presence or absence of start and stop codons.'
However, I found that there is another type of completeness named NA. What does it mean?

Question 3: As I mentioned in Question 1, there are some transcripts that have more than 1 predicted cds sequence. What is the criteria of my choice? The first one? The longest one? Or should I keep all of them?

Question 4: Based on your research experience, could you please give me some suggestion about the proper order of the analysis that I have done, which means the order of LncRNA analysis, CDS prediction and novel transcript annotation?

Thank you very much for this good software and I'm very appreciate your answer.

Best

A problem with running angel_train.py

Dear Magdoll,
Hellow!
When I was running angel_train.py to do ANGEL classifier training, I got some trouble here. The command needs more than 400G memory to run, otherwise the memory of our computer is limit to 124G. So, what can I do to reduce the memory in this step?
There are 18000 sequences and 28262787 bases in total in the dumb.final.cds file.
Thanks a lot!

issues with build

using a virtual environment, followed the instructions to build angel and got the error message:
"
running build
running build_py
running build_ext
building 'Angel.c_ORFscores' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -g -I/usr/include/python2.7 -c src/c_ORFscores.cpp -o build/temp.linux-x86_64-2.7/src/c_ORFscores.o
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
In file included from /usr/local/include/assert.h:5:0,
from /usr/include/python2.7/Python.h:56,
from src/c_ORFscores.cpp:16:
/usr/local/include/except.h:15:32: error: conflicting declaration 'typedef struct Except_Frame_T* Except_Frame_T'
typedef struct Except_Frame_T *Except_Frame_T;
^
/usr/local/include/except.h:15:16: note: previous declaration as 'struct Except_Frame_T'
typedef struct Except_Frame_T *Except_Frame_T;
^
/usr/local/include/except.h:17:18: error: field 'prev' has incomplete type 'Except_Frame_T'
Except_Frame_T prev;
^
/usr/local/include/except.h:16:8: note: definition of 'struct Except_Frame_T' is not complete until the closing brace
struct Except_Frame_T {
^
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1"

doesn't really give any clues about how to fix, anyone have an idea?

angel_predict.py takes forever with --use_rev_strand option

Hi,
I'm working on a transcriptome that uses both PacBio and Ilumina reads but the --use_rev_strand option makes the process last for weeks. after a couple of weeks it is still running without much progress.
I'm using ANGEL version 2.7 and the command is:
angel_predict.py file.fasta filename.dumb.final.classifier.pickle filename_ANGEL --use_rev_strand --output_mode=best --min_angel_aa_length 30 --min_dumb_aa_length 30
Thanks in advance,
Lior

All transcripts strand are forward (+)

Hi @Magdoll,
I have successfully ran ANGEL got results. i'm confused why are they all the same strand(+) in 'AD_Angel.ANGEL.pep' file. But in the initial fasta file, It is reverse strand, for example PB.3.2 in files(shown on screenshot ), By the way, I didn't use the parameter "--use_rev_strand", is that reason, i'm not sure.

Cant extract all .cds predictions with a simple grep - only type:confident

Hi,

Here's my code for running angel_predict.py in output_mode=best and output_mode=all respectively:
a) angel_predict.py transcripts.fasta dumb.final.classifier.pickle best_predictions --use_rev_strand --output_mode=best
b) angel_predict.py transcripts.fasta dumb.final.classifier.pickle all_predictions --use_rev_strand --output_mode=all

my output best_predictions.cds and all_predictions.cds has a few type:suspicious predictions for the same contig and I can't extract their fasta headers with the standard grep "^>" command; I only get the headers that have the type:confident in them as output.

Therefore, my questions are:

Why does --output_mode=best give more than one prediction per contig
Are types that are = Confident encoded differently, therefore making them hidden for grep detection?

Kind regards,
Emily

problem running dumb_predict.py -"Cannot find cd-hit! Abort!"

ANGEL classifier training: not UTR file

Hi @nlhepler @Magdoll
“angel_train.py takes a CDS FASTA file and a UTR FASTA file and outputs a trained classifier pickle file.”
if I haven't the UTR file, such as the species 'saccharomyces_cerevisiae', can I use the ANGEL to do Open Reading Frame prediction ?

scipy required

one more installation required prior to ANGEL working.

pip install scipy

needs to be run prior to running the angel classifier step (after dumb and creating the non-redundant training set)

Questions about ANGEL, thanks!

Hi!

I had learned the pipeline of ANGEL, however, I felt confused in some steps, following is question:

In the paper of ANGLE it need no error data for classifier and error-data in getting parameters for Markov chains, but there are only one train steps in ANGEL, didn't this because ANGEL use only one train data sets?
ANGEL produce dumb ORF first, then create non-redundant train data sets, did these means all sequences are protein coding sequence, but there are many non-coding sequences in genome, so are these two steps reasonable?
If the answer of 2) is not reasonable, can we use utr and cds sequences annotated by refseq as input in angel_train.py? Then, does these mean dump ORF prediction is useless?

Thanks for any suggestion!

Best wishes!

UnicodeDecodeError at angel_predict.py

Hi @Magdoll,
I'm running the ANGEL v3.0 with the example data and I got the following error:

$angel_predict.py test.fa MCF7_2015.dumb.final.training.pickle test_angel Reading classifer pickle: MCF7_2015.dumb.final.training.pickle /root/anaconda3/envs/py37/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.ensemble.weight_boosting module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API. warnings.warn(message, FutureWarning) /root/anaconda3/envs/py37/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.tree.tree module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.tree. Anything that cannot be imported from sklearn.tree is now part of the private API. warnings.warn(message, FutureWarning) Traceback (most recent call last): File "/root/ANGEL/angel_predict.py", line 22, in <module> distribute_ANGEL_predict(args.fasta_filename, args.output_prefix, args.classifier_pickle, args.cpus, args.min_angel_aa_length, args.min_dumb_aa_length, args.use_rev_strand, args.output_mode, args.max_angel_secondORF_distance) File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/Angel-3.0-py3.7-linux-x86_64.egg/Angel/SmartORF.py", line 254, in distribute_ANGEL_predict a = load(f) UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 1: ordinal not in range(128).

Any advice will be appreciated.

Some problem about CD-HIT when using angel_make_training_set.py

Hellow,
After I finished the Dumb ORF prediction step, three files were generated, like ANGEL.utr, ANGEL.pep and ANGEL.cds. The ANGEL.cds file is the nucleotide sequence. However, when I the used the angel_make_training_set.py, it ran the CD-HIT program and the flowing is the command:
"Program: CD-HIT, V4.7 (+OpenMP), Jun 27 2017, 17:58:25
Command: cd-hit -T 8 -M 0 -i ANGEL.cds -o ANGEL.nr90.cds -c 0.90 -n 5"
I want to ask your kind mind whether I can use the 'cd-hit-est' command take the place of "cd-hit". And when the input file is ANGEL.cds, it would take a lot of time to run.
Thanks very much!

angel_train.py hanging with refseq sequences

Dear Magdoll,

I am writing to inquire whether it was possible for you to train the known refseq sequences with ANGEL. I have downloaded the refseq sequences from ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/. Thereafter I selected the sequences that are known by selecting sequences that starts with "NM_" and removed sequences with unambiguous sequences such as "N".

I have been successful with dumb_predict.py, angel_make_training_set.py, but angel_train.py hangs after printing "Done with records".

Do you have any tips on using the refseq sequences in predicting the open reading frame from the isoseqs? Do you otherwise recommend using the isoseqs itself for training and predicting the ORF in the isoseqs? I thought we should keep the training set and the prediction set separate.

Best,
Jin

Bio.Alphabet has been removed from Biopython 1.78

Hi @Magdoll,
I'm using Bioython 1.78 to run the ANGEL v3.0 and I got this error:

Traceback (most recent call last): File "angel_train.py", line 4, in <module> from Angel.SmartORF import ANGEL_training File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/Angel-3.0-py3.7-linux-x86_64.egg/Angel/SmartORF.py", line 13, in <module> from Angel import c_ORFscores, ORFscores, DumbORF File "c_ORFscores.pyx", line 5, in init c_ORFscores File "/root/anaconda3/envs/py37/lib/python3.7/site-packages/Bio/Alphabet/__init__.py", line 21, in <module> "Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information." ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

Any advice will be appreciated.

Best wishes!

Complete cds or partial cds?

@nlhepler @Magdoll
Each ANGEL output sequence ID has fixed format.completeness is either complete, 5partial, 3partial, or internal based on the presence or absence of start and stop codons.
In my opinion, completeness indicates that the sequence has complete ORF. But I want to know whether completeness also means there is complete cds or not? You know, ORF is different from CDS. In my ANGEL output results, a full-length transcript(PB.422.1) has complete ORF but why there is only 3'UTR and no 5'UTR while other transcripts have both of them. I am anticipating your reply.

posted wrong place, please delete this thread

wrong post. sorry

CD HIT

with ANGEL i can specify the number of cpus with the "--cpus" option, but when CD-HIT runs (the first step of dumb_prediction), it only uses 1 thread no matter how many cpus I specified on the command line. This slows down execution considerably.

dumb orf prediction

I have a question- why do not all transcripts get predicted with dumb orf prediction? Some proportion of transcript sequences do not end up in the ".final.pep" file.

feature: adding CDS track to output GFF

Post-ANGEL, given a GFF and ORF prediction result, annotate the GFF with CDS tracks.

This will probably not be included in ANGEL itself, but be added in the Cupcake repo to support ANGEL features.

Q: ORFs of NA type

Hello Liz,

I have got a question. After running ANGEL, I realized that to roughly a third of the obtained ORFs "NA" prediction type was assigned. Of those cases, about half is classified as 'likely' and the other half as 'suspicious'. I am wondering whether something like this is normally observed or whether I should be concerned about those "NA" instances.

Could it be that those ORFs come from heavily degraded RNA since the algorithm was not able to even assign a "degradation type"?

Thank you very much.

Best,
Eva

Unable to get classifier.pickle result

Hi，
when I run angle_train.py , I can not get the pickle file, and The program has been running for 10 days ,Execution record file is as follows

The script is like below:

export PATH="/NJPROJ1/PB/personal_dir/liuchuanying/software/Miniconda2/bin:$PATH"
source activate /NJPROJ1/PB/personal_dir/liuchuanying/software/Miniconda2/envs/anaCogent
export PATH="/NJPROJ2/PB/pipeline/Pacbio_Isoseq_noref_V3.0/software/cd-hit-v4.6.8-2017-0621/:$PATH"

cd /NJPROJ1/PB/personal_dir/liuchuanying/Angel/Gallus_chicken/cd-hit_angel_train
/NJPROJ1/PB/personal_dir/liuchuanying/software/Miniconda2/envs/anaCogent/bin/angel_train.py /NJPROJ1/PB/personal_dir/liuchuanying/Angel/Gallus_chicken/cd-hit_angel_train/Gallus.dumb.final.training.cds /NJPROJ1/PB/personal_dir/liuchuanying/Angel/Gallus_chicken/cd-hit_angel_train/Gallus.dumb.final.training.utr Gallus_chicken.classifier.pickle --cpus 12

The program is delivered multiple times and is stuck in "Done with records".And there is no shortage of memory.

So, I want to know the reason why I cannot get the pickle file.

Thanks a lot.

angel_predict.py ：max_depth

Hi @Magdoll ,I got KeyError: 'max_depth'
$angel_predict.py test.fa MCF7_2015.dumb.final.training.pickle test_angel Reading classifer pickle: MCF7_2015.dumb.final.training.pickle Traceback (most recent call last): File "/share/home/mpy/anaconda2/envs/anaCogent/bin/angel_predict.py", line 4, in <module> __import__('pkg_resources').run_script('Angel==2.4', 'angel_predict.py') File "/share/home/mpy/anaconda2/envs/anaCogent/lib/python2.7/site-packages/pkg_resources/__init__.py", line 658, in run_script self.require(requires)[0].run_script(script_name, ns) File "/share/home/mpy/anaconda2/envs/anaCogent/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1438, in run_script exec(code, namespace, namespace) File "/share/home/mpy/anaconda2/envs/anaCogent/lib/python2.7/site-packages/Angel-2.4-py2.7-linux-x86_64.egg/EGG-INFO/scripts/angel_predict.py", line 22, in <module> distribute_ANGEL_predict(args.fasta_filename, args.output_prefix, args.classifier_pickle, args.cpus, args.min_angel_aa_length, args.min_dumb_aa_length, args.use_rev_strand, args.output_mode, args.max_angel_secondORF_distance) File "/share/home/mpy/anaconda2/envs/anaCogent/lib/python2.7/site-packages/Angel-2.4-py2.7-linux-x86_64.egg/Angel/SmartORF.py", line 238, in distribute_ANGEL_predict a = load(f) File "sklearn/tree/_tree.pyx", line 649, in sklearn.tree._tree.Tree.__setstate__ KeyError: 'max_depth'

error on angel_train.py, pickle not produced

Hi, Please let me know what this error is about? Its seems like some parts of the process for angel_train.py worked but ultimately there is no pickle produced. Thanks in advance.

/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Bio/Seq.py:2715: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning)
running get_data_parallel for coding, chunk 0
launching all workers
launching worker Process-1
launching worker Process-2
processing record PB.99.1|chr1:11796161-11810824(+)|i1a_c13849/f19p49/1138/RT|m.1, frame 0
launching worker Process-3
launching worker Process-4
Done with records
processing record PB.99.2|chr1:11796161-11810824(+)|i1a_c1909/f68p136/1135/RT|m.2, frame 0
Process Process-1:
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/SmartORF.py", line 35, in add_data_worker
processing record PB.99.4|chr1:11796161-11810824(+)|i1a_c49914/f6p67/1234/RT|m.4, frame 0
Process Process-2:
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Process Process-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/SmartORF.py", line 35, in add_data_worker
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(self._args, **self._kwargs)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/SmartORF.py", line 35, in add_data_worker
stuff = ORFscores.make_data_smart(rec.seq, o_all, frame_shift=i)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/ORFscores.py", line 72, in make_data_smart
stuff = ORFscores.make_data_smart(rec.seq, o_all, frame_shift=i)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/ORFscores.py", line 72, in make_data_smart
arr = make_amino_scores(aa_freq)+make_diamino_scores(di_freq,o.diamino_range)+make_codon_scores(aa_freq,codon_freq)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/ORFscores.py", line 50, in make_codon_scores
arr.append(aa_freq['']1./codon_freq[codon])
arr = make_amino_scores(aa_freq)+make_diamino_scores(di_freq,o.diamino_range)+make_codon_scores(aa_freq,codon_freq)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/ORFscores.py", line 50, in make_codon_scores
arr.append(aa_freq['']1./codon_freq[codon])
ZeroDivisionError: float division by zero
stuff = ORFscores.make_data_smart(rec.seq, o_all, frame_shift=i)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/ORFscores.py", line 72, in make_data_smart
arr = make_amino_scores(aa_freq)+make_diamino_scores(di_freq,o.diamino_range)+make_codon_scores(aa_freq,codon_freq)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/ORFscores.py", line 50, in make_codon_scores
arr.append(aa_freq['']*1./codon_freq[codon])
ZeroDivisionError: float division by zero
Traceback (most recent call last):
File "/isg/shared/apps/angel/2.7/bin/angel_train.py", line 4, in
import('pkg_resources').run_script('Angel==2.7', 'angel_train.py')
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/pkg_resources/init.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/pkg_resources/init.py", line 1462, in run_script
exec(code, namespace, namespace)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/EGG-INFO/scripts/angel_train.py", line 17, in
ANGEL_training(args.cds_filename, args.utr_filename, args.output_pickle, num_workers=args.cpus)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/SmartORF.py", line 100, in ANGEL_training
data_pos = get_data_parallel(o_all, coding, [0], num_workers)
File "/isg/shared/apps/angel/2.7/lib/python2.7/site-packages/Angel-2.7-py2.7-linux-x86_64.egg/Angel/SmartORF.py", line 65, in get_data_parallel
obj = queue.get(timeout=60)
File "/usr/lib64/python2.7/multiprocessing/queues.py", line 132, in get
raise Empty

angel_predict.py gets stuck and does nothing

@nlhepler @Magdoll , I have been successfully run with dumb_predict.py, angel_make_training_set.py, angel_train.py but angel_predict.py hangs and does nothing suddenly. I've tried it several times , and the results are the same.
503G server memory is available. The size of fasta file is 50.82MB(36314 sequences), the number of CPU i used is 8.
I'm looking forward to your reply.

memory issue while running angel_train.py

Dear @Magdoll,

I'm facing a memory issue while running angel_train.py on a 200G MEM server. After running it for about 30 minutes my memory capacity reached it max (200G) and it still haven't finish with the calculation yet. Does ANGEL really need this big amount of memory for only one Iso-sequencing data? I'm not used to Python and not sure if I can limit the memory usage for this purpose.
Thanks in advance for any kind of help!

Best,
Dewi

Problems running angel_predict.py

Hello Magdoll,

I try to run ANGEL. All the steps until angel_predict.py working fine but then I get errors.
I have IsoSeq data from different rice cultivars and used as training data set the cDNA file of the reference Nipponbare (IRGSPv1.36).

When I run angel_predict.py (either for all or for the single split files) I always get the same error in the end:

Traceback (most recent call last):
File "/home/mpimp-golm.mpg.de/schaarschmidt/.conda/envs/anaCogent/bin/angel_predict.py", line 4, in
import('pkg_resources').run_script('Angel==2.4', 'angel_predict.py')
File "/home/mpimp-golm.mpg.de/schaarschmidt/.local/lib/python2.7/site-packages/pkg_resources/init.py", line 738, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/mpimp-golm.mpg.de/schaarschmidt/.local/lib/python2.7/site-packages/pkg_resources/init.py", line 1499, in run_script
exec(code, namespace, namespace)
File "/home/mpimp-golm.mpg.de/schaarschmidt/.conda/envs/anaCogent/lib/python2.7/site-packages/Angel-2.4-py2.7-linux-x86_64.egg/EGG-INFO/scripts/angel_predict.py", line 22, in
distribute_ANGEL_predict(args.fasta_filename, args.output_prefix, args.classifier_pickle, args.cpus, args.min_angel_aa_length, args.min_dumb_aa_length, args.use_rev_strand, args.output_mode, args.max_angel_secondORF_distance)
File "/home/mpimp-golm.mpg.de/schaarschmidt/.conda/envs/anaCogent/lib/python2.7/site-packages/Angel-2.4-py2.7-linux-x86_64.egg/Angel/SmartORF.py", line 288, in distribute_ANGEL_predict
pool.map(ANGEL_predict_worker_helper, data)
File "/home/mpimp-golm.mpg.de/schaarschmidt/.conda/envs/anaCogent/lib/python2.7/multiprocessing/pool.py", line 253, in map
return self.map_async(func, iterable, chunksize).get()
File "/home/mpimp-golm.mpg.de/schaarschmidt/.conda/envs/anaCogent/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
KeyError: 'X'

Every step is performed as you describe in your tutorial so far. I also tried to run it just at one core, but it failed. Additionally, I got the same error in max depth (as reported in #24) when I am running the test data which you have provided.

I hope you can help and thank you very much already
steffi

ANGEL intermidiate files clean up

Hi Liz,

This is not really an issue, but I would like to double check one thing about angel run. I run successfully all steps and in the last step the script finished with:
"Closing Pool....
Joining Pool....
All workers completed.
Output written to data.ANGEL.cds, data.ANGEL.pep, data.ANGEL.utr
"
But in the angel folder I see a bunch of intermidiate files that were not deleted - 176 data.split_*.fa and data.split_*fa.ANGEL, data.split_*fa.ANGEL.DONE.
There are also files data.ANGEL.cds, data.ANGEL.pep, data.ANGEL.utr which I take is a final output. I wonder whether it is normal that the intermidiate files are not deleted? Since the tool run for a while I do not feel like deleting them in case if it means the run is incomplete.

Best
Ksenia

Can not get the 'pickle' file when running 'angle train.py'

when I run angle_train.py , I can not get the pickle file, and the program is running like below picture.

And the script is like below:
angel_train.py cu.predict.final.training.cds cu.predict.final.training.utr cu.predict.final.classifier.pickle --cpus 4