microbelab / deepmicrobes Goto Github PK

View Code? Open in Web Editor NEW

80.0 6.0 19.0 1.28 MB

DeepMicrobes: taxonomic classification for metagenomics with deep learning

Home Page: https://doi.org/10.1093/nargab/lqaa009

License: Apache License 2.0

Python 51.42% Shell 9.37% Perl 39.21%

deep-learning metagenomics next-generation-sequencing bioinformatics microbiome

deepmicrobes's Introduction

DeepMicrobes

DeepMicrobes: taxonomic classification for metagenomics with deep learning
Supplementary data for the paper is available at https://github.com/MicrobeLab/DeepMicrobes-data
IMPORTANT: The new DeepMicrobes (beta version) is available now. Please feel free to contact us if you have any suggestions or encounter any errors.

Usage

Contact

Any issues with the DeepMicrobes framework can be filed with GitHub issue tracker. We are committed to maintain this repository to assist users and tackle errors.

Email

[email protected] (Qiaoxing Liang)

Citation

Qiaoxing Liang, Paul W Bible, Yu Liu, Bin Zou, Lai Wei, DeepMicrobes: taxonomic classification for metagenomics with deep learning, NAR Genomics and Bioinformatics, Volume 2, Issue 1, March 2020, lqaa009, https://doi.org/10.1093/nargab/lqaa009

deepmicrobes's People

Contributors

Stargazers

Watchers

Forkers

gagaleung ramtinz zengxiang-zhao kristapsbe yangzyangjin aurelielabarre animesh schlogl2017 salimdason sebastiankrog brettin arunabio shenjean dssilk akshay163 liu5796796 ahk24

deepmicrobes's Issues

Prediction using the seq2species model

First off, thank you for your very nice paper, really interesting results! I am trying to reproduce your results, but I am getting stuck at the prediction using the Seq2Species model. I have installed DeepMicrobes and am trying to predict the species of 100bp fasta file (which contains 16s rRNA of E.coli).I do not have paired end reads.These are the commands I currently run:
(DeepMicrobes) bart@Bart-HP-PAV14:~/DeepMicrobes/pipelines$ seq2tfrec_onehot.py --input_seq=../test_fasta_100bp.fa --output_tfrec=../temp.onehot.tfrec --is_train=False --seq_type=fasta
Which seems to run fine and then:
(DeepMicrobes) bart@Bart-HP-PAV14:~/DeepMicrobes/pipelines$ ./predict_seq2species.sh -i ../temp.onehot.tfrec -p 1 -m ../weights_seq2species/ -o test_output
But this gives the following error:

(DeepMicrobes) bart@Bart-HP-PAV14:~/DeepMicrobes/pipelines$ ./predict_seq2species.sh -i ../temp.onehot.tfrec -p 1 -m ../weights_seq2species/ -o test_outputPrediction started ...
2020-05-28 12:32:42.030930: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
RUNNING MODE:  predict_paired_class
RUNNING MODE:  predict_paired_class
I0528 12:32:42.035207 139629123098432 tf_logging.py:115] Using default config.
I0528 12:32:42.035573 139629123098432 tf_logging.py:115] Using config: {'_model_dir': '../weights_seq2species/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7efdcec60438>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
I0528 12:32:42.096365 139629123098432 tf_logging.py:115] Calling model_fn.
I0528 12:32:42.222223 139629123098432 tf_logging.py:115] Done calling model_fn.
I0528 12:32:42.312916 139629123098432 tf_logging.py:115] Graph was finalized.
I0528 12:32:42.314475 139629123098432 tf_logging.py:115] Restoring parameters from ../weights_seq2species/model.ckpt-0
I0528 12:32:42.752866 139629123098432 tf_logging.py:115] Running local_init_op.
I0528 12:32:42.756458 139629123098432 tf_logging.py:115] Done running local_init_op.
Traceback (most recent call last):
  File "/home/bart/DeepMicrobes/DeepMicrobes.py", line 370, in <module>
    absl_app.run(main)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/absl/app.py", line 278, in run
    _run_main(main, args)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/absl/app.py", line 239, in _run_main
    sys.exit(main(argv))
  File "/home/bart/DeepMicrobes/DeepMicrobes.py", line 353, in main
    flags.FLAGS.translate)
  File "/home/bart/DeepMicrobes/models/format_prediction.py", line 93, in paired_report
    batch_prob = average_paired_end(batch_prob, num_classes)
  File "/home/bart/DeepMicrobes/models/format_prediction.py", line 18, in average_paired_end
    prob_matrix = np.mean(np.reshape(prob_matrix, (-1, 4, num_classes)), axis=1)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 232, in reshape
    return _wrapfunc(a, 'reshape', newshape, order=order)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
    return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 2505 into shape (4,2505)
paste: test_output.category_paired.txt: No such file or directory
rm: cannot remove 'test_output.category_paired.txt': No such file or directory
rm: cannot remove 'test_output.prob_paired.txt': No such file or directory
Result: test_output.result.txt
(DeepMicrobes) bart@Bart-HP-PAV14:~/DeepMicrobes/pipelines$

This reshape also does not make sense to me. I would just like the probability for each class. It seems the probabilities are held in prob_matrix[0] but I don't know which index corresponds to which class (species).
Any help would greatly be appreciated.

Question about training a DeepMicrobes model to predict the taxonomy from phylum to species

Hi! I have read the DeepMicrobes paper and it is great work! I'm planning to train the DeepMicrobes model using my data and I want it to predict the taxonomy of a given DNA sequence from phylum to species (six ranks in total).

My data looks like this:

TaxID_1.fasta
>sequence_0
ATCG...
>sequence_1
ATCG...
...
>sequence_n
ATCG...

...

TaxID_n.fasta
>sequence_0
ATCG...
>sequence_1
ATCG...
...
>sequence_n
ATCG...

Where each fasta file represents one species and the TaxID is the taxonomy id in the NCBI database. Each fasta file may contain many DNA sequences. I have read the instruction about how to convert fasta sequences to TFRecord. However, I'm confused about 'The script parses category labels from sequence IDs starting with prefix|label (e.g., >this_is_prefix|0).' I wonder what is the prefix and why the number 0 can be the label. Also, what are they corresponding to my data?

Another question is do I need to train six different models to predict six different ranks of taxonomy? For example, one model for phylum, one for class, one for order, etc.

Thank you very much if you could give me some suggestions!

Error training a custom model

Hello, I am writing because I am trying to train a custom model for DeepMicrobes, and keep getting the same error whenever I try to train the model on the TFrecord I have created.

The stack trace I get is very long, but I believe the key issue is this:

Traceback (most recent call last):

  File ~\anaconda3\lib\site-packages\tensorflow\python\client\session.py:1378 in _do_call
    return fn(*args)

  File ~\anaconda3\lib\site-packages\tensorflow\python\client\session.py:1361 in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,

  File ~\anaconda3\lib\site-packages\tensorflow\python\client\session.py:1454 in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,

InvalidArgumentError: indices[12,53] = 526337 is not in [0, 526337)
	 [[{{node token_embedding/embedding_lookup}}]]

526337 happens to be exactly the size of the vocabulary file I am using, and it is somehow getting an index out of bounds error on its lookup. How could the embedding of a DNA read use a value that's not in the vocabulary?

I have tried this both using the properly installed version of DeepMicrobes with TensorFlow 1.9, and also by porting the code to TensorFlow 2 myself, but both versions get the same error, with the only thing changing with every run being the indices[xx, yy] location at which it goes out of bounds.

Are there any reasons for why this might be happening?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I was facing this error on WSL

(directml) animeshs@DMED7596:~/ayu$ uname -a
Linux DMED7596 5.4.91-microsoft-standard-WSL2 #1 SMP Mon Jan 25 18:39:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
(directml) animeshs@DMED7596:~/ayu$ bash DeepMicrobes/pipelines/tfrec_predict_kmer.sh  -f fastq/s13._1.fastq -r fastq/s13._2.fastq  -o s13dm -v /home/animeshs/ayu/DeepMicrobes-data/vocabulary/tokens_merged_12mers.txt.gz
...
INFO:tensorflow:Processing test set
INFO:tensorflow:Parsing vocabulary
Traceback (most recent call last):
  File "/home/animeshs/ayu/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 243, in <module>
    main()
  File "/home/animeshs/ayu/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 233, in main
    test_set_convert2tfrecord(input_seq, output_tfrec, kmer, vocab, seq_type)
  File "/home/animeshs/ayu/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 144, in test_set_convert2tfrecord
    word_to_dic = vocab_dict(vocab)
  File "/home/animeshs/ayu/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 37, in vocab_dict
    for line in handle:
  File "/home/animeshs/miniconda3/envs/directml/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
cat: 'subset*.tfrec': No such file or directory
rm: cannot remove 'subset*.tfrec': No such file or directory
Finished.

Workaround is to gunzip the .gz file like the DeepMicrobes-data/vocabulary/tokens_merged_12mers.txt.gz supplied as vocalbulary/-v and (directml) animeshs@DMED7596:~/ayu$ bash DeepMicrobes/pipelines/tfrec_predict_kmer.sh -f fastq/s13._1.fastq -r fastq/s13._2.fastq -o s13dm -v /home/animeshs/ayu/DeepMicrobes-data/vocabulary/tokens_merged_12mers.txt worked 👍🏼

failed prediction only in some samples

I'm using a docker container with TF1.15 to run DeepMicrobes. I know it might not have been tested using TF1.15 but that is the compatible one with my platform. It has no problem to do the prediction for many tfrec files (converted from FASTQ files) but has problems with some.
Here is the error I get for example:

Traceback (most recent call last):
  File "/root/DeepMicrobes.py", line 382, in <module>
    absl_app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/root/DeepMicrobes.py", line 362, in main
    classes, probs = paired_report(predict_out,
  File "/root/models/format_prediction.py", line 88, in paired_report
    batch_prob = next(prediction_generator)['probabilities']
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 640, in predict
    preds_evaluated = mon_sess.run(predictions)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 750, in run
    return self._sess.run(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.8/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1413, in run
    outputs = _WrappedSession.run(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: 2 root error(s) found.
  (0) Data loss: corrupted record at 1626132269
         [[node IteratorGetNext (defined at usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Data loss: corrupted record at 1626132269
         [[node IteratorGetNext (defined at usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[token_lstm/bidirectional_rnn/bw/bw/Assert/Assert/data_0/_127]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'IteratorGetNext':
  File "root/DeepMicrobes.py", line 382, in <module>
    absl_app.run(main)
  File "usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "root/DeepMicrobes.py", line 362, in main
    classes, probs = paired_report(predict_out,
  File "root/models/format_prediction.py", line 88, in paired_report
    batch_prob = next(prediction_generator)['probabilities']
  File "usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 619, in predict
    features, input_hooks = self._get_features_from_input_fn(
  File "usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 996, in _get_features_from_input_fn
    result = self._call_input_fn(input_fn, mode)
  File "usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1116, in _call_input_fn
    return input_fn(**kwargs)
  File "root/DeepMicrobes.py", line 306, in input_fn_predict
    input_fn = input_function_predict_kmer(
  File "root/models/input_pipeline.py", line 147, in input_function_predict_kmer
    batch_features = iterator.get_next()
  File "usr/local/lib/python3.8/dist-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 425, in get_next
    flat_ret = gen_dataset_ops.iterator_get_next(
  File "usr/local/lib/python3.8/dist-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2516, in iterator_get_next
    _, _, _op = _op_def_lib._apply_op_helper(
  File "usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 792, in _apply_op_helper
    op = g.create_op(op_type_name, inputs, dtypes=None, name=scope,
  File "usr/local/lib/python3.8/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py", line 3356, in create_op
    return self._create_op_internal(op_type, inputs, dtypes, input_types, name,
  File "usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py", line 3418, in _create_op_internal
    ret = Operation(
  File "usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

another error is this one:

Traceback (most recent call last):
  File "/root/DeepMicrobes.py", line 373, in <module>
    absl_app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/root/DeepMicrobes.py", line 353, in main
    classes, probs = paired_report(predict_out,
  File "/root/models/format_prediction.py", line 89, in paired_report
    batch_prob = average_paired_end(batch_prob, num_classes)
  File "/root/models/format_prediction.py", line 14, in average_paired_end
    prob_matrix = np.mean(np.reshape(prob_matrix, (-1, 4, num_classes)), axis=1)
  File "<__array_function__ internals>", line 5, in reshape
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 301, in reshape
    return _wrapfunc(a, 'reshape', newshape, order=order)
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
    return bound(*args, **kwds)
ValueError: cannot reshape array of size 20072565 into shape (4,2505)

Any idea how to solve them?

Model weights for training

Hi,
Could you please elaborate a bit on the parameter '--model_dir', in the training part of the tool?
Are there weights that come with the study, or do we have to generate them ourselves? If so, could you please guide me on how to generate the weights and add it to the model while training?
Thanks,
Gayathri

Error in tfrec_train_kmer.sh with SILVA 138.1 SSU database as training set

Hi,
I would like to train the network with SILVA 138.1 SSU database using

tfrec_train_kmer.sh -i SILVA_138.1_SSURef_NR99_tax_silva.fasta -v /vocabulary/tokens_merged_12mers.txt -o train.tfrec -s 20480000 -k 12

However, i am getting the following error:

parallel successfully detected...
seq-shuf successfully detected...
Starting converting SILVA_138.1_SSURef_NR99_tax_silva.fasta to TFRecord (mode=training), output will be saved in train.tfrec
Parameters: kmer=12, vocab_file=/vocabulary/tokens_merged_12mers.txt, split_size=20480000

1. Shuffling sequences for training...
(echo -n ">"; cat <&0) | sed "s/^>/\x0>/" 

2. Splitting input to 20480000 sequences per file...

3. Converting to TFRecord...
INFO:tensorflow:Processing training/eval set
INFO:tensorflow:Parsing vocabulary
Traceback (most recent call last):
  File "/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 243, in <module>
    main()
  File "/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 230, in main
    training_set_convert2tfrecord(input_seq, output_tfrec, kmer, vocab, seq_type)
  File "/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 120, in training_set_convert2tfrecord
    seq, label_id = training_set_read_parser(rec)
  File "/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 56, in training_set_read_parser
    label_id = int(identifier.split('|')[1])
IndexError: list index out of range
Finished.

Can you help me please?

Thank you.

Originally posted by @oschakoory in #17 (comment)

Error occurred during model training.

Hello, I didn't encounter any errors when training small amounts of data, but when I tried to train several Gs worth of data, I received the following error. Could you please advise me on how to fix it?
Error：tensorflow.python.framework.errors_impl.InvalidArgumentError: len(seq_lens) != input.dims(0), (22 vs. 32)
[[node token_lstm/bidirectional_rnn/bw/ReverseSequence (defined at /anaconda3/envs/DeepMicrobes-master/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

where is tfrec_predict_kmer.sh?

Can't seem to find it, all i see with tf* is

find . -iname "tf*sh"
./pipelines/tfrec_train_onehot.sh
./pipelines/tfrec_predict_onehot.sh
./pipelines/tfrec_train_kmer.sh
./pipelines/tfrec_predict_kmer.sh

关于DeepMicrobe论文中的细节问题想向您请教

您好我对您的研究非常感兴趣，希望能够复现您的结果用于研究。
但是在复现的过程中遇到了一些问题，非常希望能够得到您的帮助！以下是我总结的问题：

文中句子：The number of reads to simulate depended on how many training steps were required for models to conver. 想问下您的训练集在使用ART进行模拟时，最终模拟的read数是多少（覆盖度参数-f是几×），方便提供一下模拟训练数据的命令吗。
Evaluation sets验证集的覆盖度参数-f是几×呢，方便提供一下模拟验证集的命令吗？
生成训练集过程中，您在Assign过label以后，是将所有的.fa文件拼接在一起做后续的simulation，trim等操作后得到train.tfrec吗？
由于训练集和验证集数量都很大，想问下您使用的什么方法或者工具来快速剔除掉验证集中和训练集重复的pair-end数据。

感谢您花时间阅读我的消息，期望得到您的回复。

About train_epoch

Hello, I would like to ask if the setting of train_epoch has an impact on the model. I see that you have set the default value to 1, but the common values for epochs are 50, 100, etc. I don’t quite understand

UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position #

Hi DeepMicrobes developers and users,
I'm excited to start using this tool on my dataset but I'm running into an error at the installation testing step.

Here is the command I run:
/scratch/gencore/software/DeepMicrobes/DeepMicrobes.py --helpfull

and here is the output I get:

2022-09-18 21:42:58.116424: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

       USAGE: /scratch/gencore/software/DeepMicrobes/DeepMicrobes.py [flags]
flags:
Traceback (most recent call last):
  File "/scratch/gencore/software/DeepMicrobes/DeepMicrobes.py", line 365, in <module>
    absl_app.run(main)
  File "/scratch/gencore/conda3/envs/deepmicrobes/lib/python3.6/site-packages/absl/app.py", line 273, in run
    args = _run_init(sys.argv if argv is None else argv)
  File "/scratch/gencore/conda3/envs/deepmicrobes/lib/python3.6/site-packages/absl/app.py", line 324, in _run_init
    args = _register_and_parse_flags_with_usage(argv=argv)
  File "/scratch/gencore/conda3/envs/deepmicrobes/lib/python3.6/site-packages/absl/app.py", line 204, in _register_and_parse_flags_with_usage
    args_to_main = parse_flags_with_usage(original_argv)
  File "/scratch/gencore/conda3/envs/deepmicrobes/lib/python3.6/site-packages/absl/app.py", line 159, in parse_flags_with_usage
    return FLAGS(args)
  File "/scratch/gencore/conda3/envs/deepmicrobes/lib/python3.6/site-packages/absl/flags/_flagvalues.py", line 623, in __call__
    unknown_flags, unparsed_args = self._parse_args(args, known_only)
  File "/scratch/gencore/conda3/envs/deepmicrobes/lib/python3.6/site-packages/absl/flags/_flagvalues.py", line 759, in _parse_args
    flag.parse(value)
  File "/scratch/gencore/conda3/envs/deepmicrobes/lib/python3.6/site-packages/absl/app.py", line 129, in parse
    usage(writeto_stdout=True)
  File "/scratch/gencore/conda3/envs/deepmicrobes/lib/python3.6/site-packages/absl/app.py", line 377, in usage
    stdfile.write(flag_str)
UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position 4996: ordinal not in range(128)

I haven't worked with a deep learning-based computational framework before so I'm not quite sure what I'm looking at here. Please let me know if you'd want more information from my side. Thanks!

Error in tfrec_train_kmer.sh with a customized training set

Hi, I'm trying to use tfrec_train_kmer.sh for a training dataset I constructed, and I'm struggling with this. I got the errors below.

(DeepMicrobes) root@bbf7145cde62:/workspace/vamb-data/airways# tfrec_train_kmer.sh -i dmtrain.fa -v /workspace/czj/tokens_merged_12mers.txt -o dmtrain.tfrec -s 2048 -k 12
parallel successfully detected...
seq-shuf successfully detected...
Starting converting dmtrain.fa to TFRecord (mode=training), output will be saved in dmtrain.tfrec
Parameters: kmer=12, vocab_file=/workspace/czj/tokens_merged_12mers.txt, split_size=2048
======================================
1. Shuffling sequences for training...
(echo -n ">"; cat <&0) | sed "s/^>/\x0>/"
======================================
2. Splitting input to 2048 sequences per file...

======================================
3. Converting to TFRecord...
Can't use 'defined(@array)' (Maybe you should just omit the defined()?) at /workspace/czj/DeepMicrobes/DeepMicrobes/bin/parallel line 119.
cat: 'subset*.tfrec': No such file or directory
rm: cannot remove 'subset*.tfrec': No such file or directory
Finished.

The first two lines of the dmtrain.fa looks like this:

>S4C8|22
GTTATAATTTCCCGGCTGGATCTCCTTGAAATCATCAGACAAAATACCTCTTCTTAAAAGTTCTGCCGTGCCTGCAAAGCGAAAATCACGAAGCCCCGGATTGTCTTTTAACTGATAAATCCCATATTTATCAGTATCGCCATACAAAAGCTGTGCTTCCTTGTTTGCGCCGCTTTTTTCAAGTTCCTCCTGCATCGTCCGGAGACTTCTTTCATTTTCCCAATCGCCCTTTTCTATGCCAAAGATACCCTCATGCTCTGTGATCTGCCTCCTGTTCTTCACCGTTGTTTCCGAACCGTCCTCATGGAGCAGGTACACAGGCAGGTCACGGTCAAAAAGCTCCAATGCCCTCCCCTGTGTCAATGGCAGCATTTCATTCCATGTATAGCCGTATTCCTCCATTTCCGATAAGCCGATCATAGGGTCAGGGAGTGCGTCAATCTCTGCCTGTGCGTCAATGACCGCAAGGGCAGCCCCCTGACTGCCTTCTGTTTCCTCATAATAAATATGCTCCGCAAGCTCCCTTGTCTTTTCCATGTCGCCCAGCTTAAAGGCATAATTTACAATCAGGCTGCGCTCATCATCAGAAAAAGCGGTCCTGGCAAATTCCAGACGGTTAATAATGCCCTCCGCCTGCTTTACCGGGGAAAGGTGGCTCGTCATTACCACTTCTTTTTTATCCAGGTCAATCTCAAAATCATTAAATTGCGGCATACCGGAATTTCTGCCGCCGCTGACAATAATCGGGATATAATCAGCCCCGTAGCTTTCCAGACAGTCATGGATATTTTTTCTGTCCATACCCTCCAACTTTGTGAGCGCACCCATGATCTCCTGAACTTCCTCCGGTGTCTTATTGATGATACGTTCCACCTCAAAACCGCCTGAATGTATGATCTTCAGAATCACATCATCTGCCCCGACAGGCTCCTTTGCCCTTTCATTCCACTCCATCTCTGCCTTCAGGTCAATGATCTCATTCTTATCGGTGATATACCGGACATCAAGGTGGTATTCTCCAAATTCCCTTGTTTCCTCATTGGCAATCTCTATAGTCCATGCGCCTTTGCTTTCCAGATATGCGGCAACGCTCAGTCTGTCATCGTCATTCATGGCATAGAGTGCTTCTATGATCTCCGCCGCATTCATTCCCCTGACTTCTACAAGGCTGTACTCAGACAGATCGCTGTTCTGTATCAGCAAAAGGCTTTCTTTTTCCTGCCCCTGCATGATCTCCCGGTCACGCTGTATTTCTCTAAGCTGTTCCTCTATACCTGTGATAAGCTCCGATGCAGTCCTGCGGATCGTATCAAGGGAAGATTTCAGTTCTTTCATATCCTTTCCGCTGCTCCACCCGGCAATATACCCAAAGGAATAACCAGAAGTATCTATGCCGAAGTGCTTACAGACTGTAAACGCTATACTCTCTGCTTCAAATGTTAATATATCCATTTGAGGCCTTATACCCATGTTTTTATAATGTTGATTTTTTATATTATTTCCCTTAAATCCTCACTTTCTTGTATAAAAGATAAGCATTATCAGATACGCTATTCTCAGGTTTTCCCCTCGAATGGGAAGCCGGAAAGGAGCGCATTTGATATGCAAATAAACTATTTAGATGCTGTTTCATCAGTCCTCAATATGATGAAGCAGCCAGACAGCGCATGTAAAAATATAGACATGCACAGAACCTGTTACACCATGTTCTTCAAATACCTGATGGATAAGGGCATTCCTTTTTCAATGGATGCCGCGCTGGACTGGCTTGAGATTAAGAAACAGGAAATTTCCTATGAGACGTGTTCCCAATATAGAAATGCCCTGTTCCGACTTGAGCATTACCTGCTCTTTGGAGATATCGAAAGCCCTTTCTGCCGCTCAGAAGACAGTTTTTTCTGCCGGAGCGGGATGTCGGAATCTTTTTTCCGCCTGACATATGAGCTGGAGGAATACTATGCGGCCAGCCAGAACCCCAGCTATTACCATACGTATTCCGTTGCCACAAAAGAGTTTTTCAAACTTGCGACTTCCCTTGGAATTACAGAGCCGGAAGCAGTCACCATAGATACTCTTATCGAATACTGGAATACTTACTGCAAATCCTGCGGCTCTCCCGTCAGACGCCAGAACGCCGTATGCGCTATGACGGCTCTTATGAAATACCTTCACCTTCGGGGTGATGTGCCGGAGTGTTATCAGCTGGTTCTTTTTGGCTGGAACGCTGAAATACTGTCTGGCATGAGGCTTTCCAAAACAGGCGCCGCATTCCATCCCAGTGTATCTCTTGAACATAAAGCTGAAGGGTATCTTGACGCCTTGGACGATTGGAAATACATGGAATCATCAAAAGCTGTTTACCGCAATGATTTCACCTGGTACTTTATGTTTTTGGAACTTAACCGCCTGGAGCATTCGGCAGAAACTGTAACTCTATTTACAGACATACTTCCGGATTGTCCGAATCAGGCCAAAGGCAGCAATCCTGTATCGGCCCGCCGTTCACACACGATCAGAATGTTTGAAAAGTATCTCCAGGGCACAATGGAATCTAATATGGCGGCTGATCCAAAGCGTGCGTCCGATCATCTTCCGTCATGGAGCAAAAGCATCCTTGATGGTTTTATAGAGAGCCGCAGGCGGGATGGTATGACGAATAATACACTTACTATGTGCAGGGCTGCCGGATGCAGTTTCTTCAAATATCTTGAAGATAATGGAATAGATTATCCGGCATACATAACACCTGATGCAGTGAAAGCATTCCATAACCATGATGTCCACTCGACCCCGGAAAGCAAAAATGCATATGGGACAAAGCTCCGTCAGCTTCTGCGTTACATGGCTGACCAGGATCTGGTCCCGCCAACCCTTGTTTTTGCAGTATCTGCAAGCTGCGCTCCCCGTCGCAGCATCGTTGATGTCCTGAGCGATGATATGGTTGGGAAAATATATGAATACCGCGACAAAGCCTCCACTCCCATAGAACTCAGAGACACAGCTATGGTTATGCTCGGGCTTCGGATGGGTATCAGGGGAGCGGACATCCTGAAGCTTCAGGTAAATGATTTTGACTGGAAAAACAAAACGGTTTCCTTCATCCAGCAGAAAACAGGAAAAGCAATCACGCTTCCAGTCCCAACAGATGTAGGTAATTCTATATATAAATACATCATGAATGGACGTCCGGAATCGGCTGCCACAGGCAGCGGATATATATTTATCCGCCATCAGGCGCCATATATTCCGCTTAAAGTCACAACGGCGTGCCGTGGGGCTTTAAAAAGAATACTTGCTGAATATGGATTTGAACTATCCGCCGGCCAGGGCTTCCATATGACACGGAAGACATTTGCCACAAGAATGCTTCGGGCAGGCAGCAAACTTGATGATATTTCCATCGCCCTCGGGCATGCACGTCCGGAAACTGCCGAGGTATATCTTGAACGTGACGAAGATAAAATGAGGCTCTGCCCTCTGGAATTTGGAGGTGTTTTGTCATGACATACATTTTTGAGAGCGGCCTGGCACATCATATCGAAGGACTCATACAGCAAAAACGGGCGGATGGATATGCCTATAATTGCGAAGAAAAGC
>S4C16|245
CGAGCAAACGAAGGCCGTACTAGAGATTCAGGCCAAGTGGAAGACTATAGGCTATGCTCGCAGAAGCGACAATGAGAAGATCTACGAGCGTTTCCGCGCAGCATGTGACGATTATTTCAATAAGAAAACAGCTTTCTTCAAAGGCAAACGTGAAGAGCTGACCGATAACTACAAGAAGAAGCTGGCCATGGTAGAAGAAGCGGAGAGCCTTCAGGAGAGTTCCGACTGGAAAGAAACCTCTACTCGCTTGGCCGAACTCCAAAAGAAATGGAAAACCATCGGAGCCGTTCCTCATCGGTATAGTGATGAGATATGGAAGCGTTTTACGACTGCATGCGATGCATTCTTCAAACGTAAAAAAGCCGAACAGGGAGATATGCGCTCCGAAGAATGCGAAAACCTGAAGAGCAAGAAAGCAATCATTGCAGAGCTTGAGACTTTGGATTCGGAAGAAGCAAGCGAGGGTATCATCGACAGGCTCAATGCTCTGGCCGGACGTTGGAATTCCATAGGCTTTGTACCGTTCAGAGAGAAGGATACTATCAACAAAGCTTACCGAAAATTGATCGATGGTCTGTACGACAAGCTGAATATCGAACGAAGCAACCGGCGCCTCGAAGGATACAATGCCTCCTTGGAACAACTGGAGGGTGGCGGCAAAGGACAGCTCTATGATGAACGTGATCGTATGACACGTATCCTCGACCGTATGCGCAACGAATTGCAGACCTATACGAACAATCTGGGTTTCCTCAATATATCCAGTAAAAGTGGGAATAGCCTGATGCGCGAAATAGAGCGCAAGAAGGAAAAGCTGGAAGAAGACATCCGTCTGATGATCGAAAAGATCAAGCTGATCGACAAGAAGGTGGAAGAGCTGAACTCTAAAGAGTAGGCTATCCCCCACTCCATCGGCAAAATAAAACCGAAGGAGAAAATAGCATTCAAGAATTGAGGTGAGCCACGAAAGTTTTATATCAGACTTTCGTGGCTCACTTCTTTTCTACTCGCTACTCATTGACAGAGTAAGAAACGCAAGGCCAAGAGATGAAAGACAGATACAAGGCTGTTTTTTATCTCGATAGCGCAACAACCAAAAGGGCTATGCTGTTTCATTTCTAAAAGGATATACCGATGAAGATAGTAATAGCGGACAGCTATGCAGCTCTACCCGGCGATTTGGACTGGAGCGGTATCGAAGAAATGGGCGAATGCGTGTTCTACGAATATACCCGTCCGGAGGATTTGACTCTGCGTGCTGTCGATGCTGAAATAGTGCTTACCAACAAGACTCCTGTGACTGCGGCCGACATGGAAAAGATGCCCCACCTACGTTACATCGGACTGATGATTACAGGCCTTAATCTTATAGATATGGATGCTGCTCGTCAGCGTGGTATCACCATAACGAACATCCCCCACTATAGCACAGAATCAGTAGCCCAAATGGCAATCTCGCATCTACTGCACATAACCATGCCGATCGGAGAACTTTCCCGGCAGGTGAAAGATGGTTGCTGGCAGAGCAATTACGAACAAATCTCTCGCAATACTTATCAGATAGAACTGAGCGGACTGACGATGGCTATCGTGGGACTTGGGGCAATAGGTACACGTGTAGCGGAAATGGCACGTGGATTCGGCATGAAGATTTTGGCACATACATCCAAATCTCCAATCGAGTTGCCTTCTTATATAGAAAAGTCCGATAGCCTGGAGAAGCTTTTCTCTCGGGCTGATGTGCTGAGTCTGCATTGCCCGCTCACAGCGCAAACCCAAAGGATGGTATCGGCTGATAGGCTGGCACTGATGAAACCGACAGCTATCCTGCTGAACATGTCCCGAGGAAGTCTGATCGATGAAAAAGCATTAGCCTCTGCCCTAAATGAAGGACGGCTCTATGCTGCAGGCTTGGACGTACTTGCGGAAGAACCTCCATGCATGGATCACCCTTTGCTTAAGGCGCGTAATTGTCACATCACGCCACATATGGGCTGGAATACGGATGCAGCGCGCTTGCGCCTTTCTCGGACGATCAAGGAGAATCTTCGGGCTTTCATTTCCGGTCACCCTGTCAATGTCGTTTAAGAACAGAATCCATCAAAACGATTATTTTCCGACCAATACCTTTCGAAGAATTTGACGGATTTATCCTCGATAAATCTACGTGTGTTCGA

Could you have a look and see if I've done anything wrong? Thanks!

Versioned release package for DeepMicrobes

I'm working to make DeepMicrobes available on Bioconda.

Please refer. -> https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository#creating-a-release

I need your help creating a versioned release to use for the Bioconda recipe. Once DeepMicrobes is added to Bioconda, it'll also be made available as a Docker container from Biocontainers, and as a Singularity image from the Galaxy Project. The Bioconda bot will also recognize future releases and automatically update the recipe.

Please let me know

Thanks
Jay

ERROR :parallel not detected.

run command:
(pip-tensorflow) (DeepMicrobes) [xxxx@compute40 DeepMicrobes]$ tfrec_predict_kmer.sh -f ../test/SRR5935743_clean_1.fastq -r ../test/SRR5935743_clean_2.fastq -t fastq -v ../vocab/tokens_merged_12mers.txt -o srr -s 4000000 -k 12

return :
ERROR : parallel not detected

Help me.

reloading model resets accuracy

Hi,
I'm trying to run DeepMicrobes on my own data but I'm running into some trouble. When I train seq2species through DeepMicrobes, everything seems to be working fine and training accuracy rises/loss drops, however when I reload the model and run it again on the training data, be it for new rounds of training or eval/prediction, the model performs no better than random. The output of the prediction file shows that it just predicts the same class continuously at P=1.0. No errors are thrown and the structure of the tfrecord file seems fine. I also manually checked the weights and they are loaded correctly. This even occurs in a 4-class problem of clearly divergent species.

As a sanity check I've ran with very simple sequences, just 4 species with only homopolymers (so that's poly-A, poly-T, poly-C and poly-G), and then the accuracy is retained after a model reload.

I'm guessing there's still something wrong with the data but I can't imagine what anymore...I've attached the sequences in fasta, tfrecord and json (i.e. converted from tfrecord), and a yml of the conda env I'm running in. Could you have a look and see if I've done anything wrong?
simple_tst_refseq_100nt.zip

0 byte .tfrec

Looks like tfrec_predict_kmer.sh is unable to create a proper .tfrec

(base) animeshs@DMED7596:~/ayu$ ls -ltrh
-rwxrwxrwx 1 animeshs animeshs    0 Feb 23 17:37 s13dm.tfrec

any ideas how to proceed further?

Setup is WSL/ubuntu-18.04

(base) animeshs@DMED7596:~/ayu$ uname -a
Linux DMED7596 5.4.91-microsoft-standard-WSL2 #1 SMP Mon Jan 25 18:39:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

prereqs i had to install (hope they are right?)

git clone https://github.com/MicrobeLab/DeepMicrobes-data
sudo apt install parallel
sudo apt install seqtk

CLI

(base) animeshs@DMED7596:~/ayu$ bash DeepMicrobes/pipelines/tfrec_predict_kmer.sh  -f fastq/s13._1.fastq -r fastq/s13._2.fastq  -o s13dm -v ./DeepMicrobes-data/vocabulary/tokens_merged_12mers.txt.gz
parallel successfully detected...
seqtk successfully detected...
Starting converting fastq/s13._1.fastq and fastq/s13._2.fastq to TFRecord (mode=prediction), output will be saved in s13dm.tfrec
Parameters: kmer=12, vocab_file=./DeepMicrobes-data/vocabulary/tokens_merged_12mers.txt.gz, split_size=4000000, sequence_type=fastq
======================================
1. Interleaving R1 and R2...
https://github.com/fbdesignpro/sweetviz
======================================
2. Splitting the merged file to 4000000 sequences per file...

======================================
3. Converting to TFRecord...
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
/usr/bin/env: ‘python\r’: No such file or directory
cat: 'subset*.tfrec': No such file or directory
rm: cannot remove 'subset*.tfrec': No such file or directory
Finished.

problem running the example tfrec conversion on Windows Subsystem for Linux WSL and likewise on docker

Hi,
I installed DeepMicrobes following the provided guide but did not have any success at running the tfrec_predict_kmer for the provided example file. Is there any solution to this? I tried it both on WSL and a docker container that I have built for DeepMicrobes, both got the same error. Have you made any docker image for DeepMicrobes? Any suggestion is greatly appreciated.

(DeepMicrobes) XYZ@PCXYZ:/mnt/c/Users/XYZ/Desktop$ tfrec_predict_kmer.sh -f SRR5935743_clean_1.fastq -r SRR5935743_clean_2.fastq -t fastq -v tokens_merged_12mers.txt -o SRR5935743 -s 4000000 -k 12

parallel successfully detected... seqtk successfully detected... Starting converting SRR5935743_clean_1.fastq and SRR5935743_clean_2.fastq to TFRecord (mode=prediction), output will be saved in SRR5935743.tfrec Parameters: kmer=12, vocab_file=tokens_merged_12mers.txt, split_size=4000000, sequence_type=fastq ====================================== 1. Interleaving R1 and R2... ====================================== 2. Splitting the merged file to 4000000 sequences per file... ====================================== 3. Converting to TFRecord... Can't use 'defined(@array)' (Maybe you should just omit the defined()?) at /mnt/c/Users/XYZ/Desktop /DeepMicrobes/bin/parallel line 119. cat: 'subset*.tfrec': No such file or directory rm: cannot remove 'subset*.tfrec': No such file or directory Finished.

(DeepMicrobes) XYZ@PCXYZ:/mnt/c/Users/XYZ/Desktop$ conda info

active environment : DeepMicrobes active env location : /home/XYZ/anaconda3/envs/DeepMicrobes shell level : 3 user config file : /home/XYZ/.condarc populated config files : conda version : 4.10.1 conda-build version : 3.21.4 python version : 3.8.8.final.0 virtual packages : __linux=5.4.72=0 __glibc=2.31=0 __unix=0=0 __archspec=1=x86_64 base environment : /home/XYZ/anaconda3 (writable) conda av data dir : /home/XYZ/anaconda3/etc/conda conda av metadata url : https://repo.anaconda.com/pkgs/main channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 https://repo.anaconda.com/pkgs/main/noarch https://repo.anaconda.com/pkgs/r/linux-64 https://repo.anaconda.com/pkgs/r/noarch package cache : /home/XYZ/anaconda3/pkgs /home/XYZ/.conda/pkgs envs directories : /home/XYZ/anaconda3/envs /home/XYZ/.conda/envs platform : linux-64 user-agent : conda/4.10.1 requests/2.25.1 CPython/3.8.8 Linux/5.4.72-microsoft-standard-WSL2 ubuntu/20.04.2 glibc/2.31 UID:GID : 1000:1000 netrc file : None offline mode : False

Future work?

Hello,
This isn't a real issue. Just wondering what kind of continued improvements or new releases to expect (beyond support), given most of the new commits to this project stopped two years ago.
Is there a new tool you are working on that you would suggest as a replacement/successor?
Any other tool you are most excited about in lieu of an update by this team?

How should I run this model in single-end mode?

tfrec_train_kmer.sh tries to create directories in such a way that script fails.

When using the helper script to convert the sequence fasta file that I'm interested in using for training I run the following command:

./tfrec_train_kmer.sh -i ~/Documents/Projects/NLP-binning-approaches/data/Combined_transcripts.fasta 
-v ~/Documents/Projects/NLP-binning-approaches/data/mmetsp.vocab -o mmetsp.train.tfrec -k 12

Regardless of whether I split sequences or not I get the following output:

Shuffling sequences for training...
mkdir: cannot create directory ‘tmp_tfrec_/home/ben/Documents/Projects/NLP-binning-approaches/data/Combined_transcripts.fasta’: No such file or directory
./tfrec_train_kmer.sh: line 111: tmp_tfrec_/home/ben/Documents/Projects/NLP-binning-approaches/data/Combined_transcripts.fasta/shuffled_/home/ben/Documents/Projects/NLP-binning-approaches/data/Combined_transcripts.fasta: No such file or directory

./tfrec_train_kmer.sh: line 116: cd: tmp_tfrec_/home/ben/Documents/Projects/NLP-binning-approaches/data/Combined_transcripts.fasta: No such file or directory
split: invalid number of lines: ‘0’: Numerical result out of range
rm: cannot remove 'shuffled_/home/ben/Documents/Projects/NLP-binning-approaches/data/Combined_transcripts.fasta': No such file or directory

Converting to TFRecord...
ls: cannot access 'subset*': No such file or directory
Can't use 'defined(@array)' (Maybe you should just omit the defined()?) at /home/ben/Documents/Projects/DeepMicrobes/bin/parallel line 119.
cat: 'subset*.tfrec': No such file or directory
rm: cannot remove 'subset*.tfrec': No such file or directory
rmdir: failed to remove 'tmp_tfrec_/home/ben/Documents/Projects/NLP-binning-approaches/data/Combined_transcripts.fasta': No such file or directory
Finished.

It looks like instead of adding the temporary directories to the end of the given paths its trying to prepend them, which is causing all sorts of issues. Is there something that I am missing?

Trained weights not accesible

I was hoping to use your model on metagenomics data, but I did'nt want to train it from scratch but rather use the exact same model you generated in your study. I noticed the sharepoint link associated with the weights of the model no longer work, is their any way I can get access to the model's weights? Thanks!

Optimal hyperparameters

Where can I find the optimal hyperparameters if I want to train the network from scratch?

Thanks

Different prediction results for the same fastq files

Hi Qiaoxing,

I used DeepMicrobes for species prediction in a system with GPU and a system without GPU (only CPUs). Surprisingly, predicted species are different for the same fastq samples. Indeed, the tfrec files of the same fastq files in the GPU-based system are a little larger in size than the tfrec files. I checked my scripts in both systems and they are the same. If the difference was related to the use of GPU, there must not be any difference in the tfrec files as the file conversion does not require GPU. In addition, the predictions in the GPU-based system had markedly higher confidence scores than the system without GPUs. Is there anything to explain this?
Thank you in advance

tensorflow.python.framework.errors_impl.InvalidArgumentError

Training error: "tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot add tensor to the batch: number of elements does not match. Shapes are: [tensor]: [2516], [batch]: [2337]
[[{{node IteratorGetNext}}]]
During handling of the above exception, another exception occurred".

How to solve this problem?

AttributeError: module 'tensorflow' has no attribute 'logging'. Did you mean: '_logging'? - issue while creating tfrec

I am trying to convert my reference fasta to TFrec file for training, following instructions on:
https://github.com/MicrobeLab/DeepMicrobes/blob/master/document/tfrecord.md

It is able to shuffle the sequences, split but when the conversion starts I get this error. Could you please help me in this regard?

Shuffling sequences for training...
(echo -n ">"; cat <&0) | sed "s/^>/\x0>/"
======================================
Splitting input to 20480000 sequences per file...

======================================
3. Converting to TFRecord...
defined(@array) is deprecated at
/DeepMicrobes/DeepMicrobes/bin/parallel line 119.
(Maybe you should just omit the defined()?)
defined(@array) is deprecated at
/DeepMicrobes/DeepMicrobes/bin/parallel line 585.
(Maybe you should just omit the defined()?)
defined(@array) is deprecated at
/DeepMicrobes/DeepMicrobes/bin/parallel line 631.
(Maybe you should just omit the defined()?)
defined(@array) is deprecated at
/DeepMicrobes/DeepMicrobes/bin/parallel line 632.
(Maybe you should just omit the defined()?)
defined(@array) is deprecated at
/DeepMicrobes/DeepMicrobes/bin/parallel line 633.
(Maybe you should just omit the defined()?)
defined(@array) is deprecated at
/DeepMicrobes/DeepMicrobes/bin/parallel line 666.
(Maybe you should just omit the defined()?)
defined(@array) is deprecated at
/DeepMicrobes/DeepMicrobes/bin/parallel line 1541.
(Maybe you should just omit the defined()?)
defined(@array) is deprecated at
/DeepMicrobes/DeepMicrobes/bin/parallel line 1547.
(Maybe you should just omit the defined()?)
defined(@array) is deprecated at
/DeepMicrobes/DeepMicrobes/bin/parallel line 1553.
(Maybe you should just omit the defined()?)
2023-12-11 10:10:56.981597: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-11 10:10:57.033642: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-11 10:10:57.033709: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-11 10:10:57.035517: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-11 10:10:57.045672: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-11 10:10:57.045963: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-11 10:11:02.224734: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
File "/DeepMicrobes/DeepMicrobes/scripts/seq2tfrec_kmer.py", line 239, in
tf.logging.set_verbosity(tf.logging.INFO)
^^^^^^^^^^
AttributeError: module 'tensorflow' has no attribute 'logging'. Did you mean: '_logging'?
cat: subset*.tfrec: No such file or directory
rm: cannot remove ‘subset*.tfrec’: No such file or directory
Finished.

Question about the size of the model

Hi! May I know the size of the DeepMicrobes model (the best performing one)? Like how many trainable parameters are there? Thank you!

model weights not saving?

Hello!

I’m a student using DeepMicrobes, I’m trying to train the model on my own genome set.
It seems that when I try to train DeepMicrobes, it will minimize loss/increase accuracy well, but then the model and weights are not saved at the end of the run. No file or weights are saved. Do you know why this is the case? How should DeepMicrobes.py save the weights after training?

The line I run is: ./DeepMicrobes/DeepMicrobes.py --input_tfrec ~~/localscratch/MarRef.train.tfrec --model_name=attention --batch_size=4096 --model_dir=~~/localscratch/no_parameter_setting1k

I've tried making a folder no_parameter_setting1k and not making one, but that is not the problem. Am I correct to believe the --model_dir should make brand new weights when I run this line?

Thanks for your help,
Helen

Retraining seq2species model gives error

I have created a labelled fasta file based on the refseq (full-length) 16s rrna database like so:

>label|0|Abiotrophia defectiva
AGAGTTTGATCATGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGAACCGCGACTAGGTGCTTGCACTTGGTCAAGGTGAGTGGCGAACGGGTGAGTAACACGTGGGTAACCTACCTCATAGTGGGGGATAACAGTCGGAAACGACTGCTAATACCGTTAGCTAGTTGGTAGGGTAAGGNCCTACCAAGGCGATGATGCATAGCCGACCTGAGAGGGTGATCGGCCACATTGGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGACGCAAGTCTGACGGAGCAACGCCGCGTGAGTGAAGAAGGTCTTCGGA....
>label|1|Absiella dolichum
CTGGCTCAGGATGAACGCTGGCGGCATGCCTAATACATGCAAGTCGAACGAAGTTTTTAGGAAAGCTTGCTTTCCAAAAAGACTTAGTGGCGAACGGGTGAGTAACACGTAGATAACCTGCCCATGTGCCCGGGATAACTGCTGGAAACGGTAGCTAAAACCGGATAGGTGGCTTCGAGGCATCTCGGAGACATTAAAATGGCTAAGGCCATGAACA...
>label|2|Absiella tortuosum
CAAATGGAGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCATGCCTAATACATGCAAGTCGAACGAAGTCAATTGAAAGCTTGCTTTTAAAAGACTTAGTGGCGAACGGGTGAGTAACNCGTAGGTAACCTACCCATGTAACTGGGATAACTGCTGGAAACGGTAGCTAAAACCGGATAGGTAAGATTGAGGCATCTTAATCTTATGAAAAAAGC...
>etc.

I then converted this file to a TFrecord using this command
seq2tfrec_onehot.py --input_seq=../combined_train_labelled.fa --output_tfrec=../combined_train.tfrec --is_train=True

Then why I try to train the seq2species model I get the following error:

Click to expand


(DeepMicrobes) bart@Bart-HP-PAV14:~/DeepMicrobes$ DeepMicrobes.py --input_tfrec=combined_train.tfrec --model_name=seq2species --model_dir=seq2species_new_weights --max_len=100
2020-05-30 17:36:23.820149: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
RUNNING MODE:  train
I0530 17:36:23.822940 140213195958080 tf_logging.py:115] Using config: {'_model_dir': 'seq2species_new_weights', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 100000, '_save_checkpoints_secs': None, '_session_config': None, '_keep_checkpoint_max': 1000, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f85cc39f518>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
W0530 17:36:23.823674 140213195958080 tf_logging.py:120] 'cpuinfo' not imported. CPU info will not be logged.
W0530 17:36:23.823904 140213195958080 tf_logging.py:120] 'psutil' not imported. Memory info will not be logged.
I0530 17:36:23.823984 140213195958080 tf_logging.py:115] Benchmark run: {'model_name': 'model', 'dataset': {'name': 'dataset_name'}, 'machine_config': {'gpu_info': {'count': 0}}, 'run_date': '2020-05-30T15:36:23.823391Z', 'tensorflow_version': {'version': '1.9.0', 'git_hash': 'v1.9.0-0-g25c197e023'}, 'tensorflow_environment_variables': [], 'run_parameters': [{'name': 'batch_size', 'long_value': 32}, {'name': 'train_epochs', 'long_value': 1}]}
I0530 17:36:23.885492 140213195958080 tf_logging.py:115] Calling model_fn.
INPUTS BEFORE RESHAPE Tensor("IteratorGetNext:0", shape=(?, ?), dtype=int64, device=/device:CPU:0)
INPUTS Tensor("reshape_input:0", shape=(?, 1, 100, 4), dtype=int64)
filter_dim (5, 1)
SHAPE (?, 1, 100, 4)
FILTERS <tf.Variable 'convolution_1/weights:0' shape=(1, 5, 4, 1) dtype=float32_ref>
Traceback (most recent call last):
  File "/home/bart/DeepMicrobes/DeepMicrobes.py", line 368, in <module>
    absl_app.run(main)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/absl/app.py", line 278, in run
    _run_main(main, args)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/absl/app.py", line 239, in _run_main
    sys.exit(main(argv))
  File "/home/bart/DeepMicrobes/DeepMicrobes.py", line 360, in main
    train(flags.FLAGS, model_fn, 'dataset_name')
  File "/home/bart/DeepMicrobes/DeepMicrobes.py", line 214, in train
    classifier.train(input_fn=input_fn_train, hooks=train_hooks)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 366, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1132, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1107, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/bart/DeepMicrobes/DeepMicrobes.py", line 101, in model_fn
    logits = model(features)
  File "/home/bart/DeepMicrobes/models/seq2species.py", line 163, in __call__
    x = convolution(x, (spatial_conv_width[0], 1), pointwise_conv_depth[0], weight_init_scale)
  File "/home/bart/DeepMicrobes/models/seq2species.py", line 102, in convolution
    padding=padding)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 556, in separable_conv2d
    op=op)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 364, in with_space_to_batch
    return new_op(input, None)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 520, in __call__
    return self.call(inp, filter)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 354, in <lambda>
    return lambda inp, _: op(inp, num_spatial_dims, padding)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 548, in op
    name="depthwise")
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 2111, in depthwise_conv2d_native
    name=name)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 609, in _apply_op_helper
    param_name=input_name)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 60, in _SatisfiesTypeConstraint
    ", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: Value passed to parameter 'input' has DataType int64 not in list of allowed values: float16, bfloat16, float32, float64

If I add the following to remedy this error:

        x = tf.reshape(inputs, [-1, 1, self.max_len, 4], name='reshape_input')
        x = tf.cast(x, tf.float32) #added

in seq2species.py in the __call__ function the model seems to compile but crashes with the following error eventually:

2020-05-30 17:33:38.053266: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: read.  Data types don't match. Expected type: int64, Actual type: float
Traceback (most recent call last):
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/bart/miniconda3/envs/DeepMicrobes/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Key: read.  Data types don't match. Expected type: int64, Actual type: float
         [[Node: ParseSingleExample/ParseSingleExample = ParseSingleExample[Tdense=[DT_INT64], dense_keys=["label"], dense_shapes=[[?]], num_sparse=1, sparse_keys=["read"], sparse_types=[DT_INT64], _device="/device:CPU:0"](arg0, ParseSingleExample/Const)]]
         [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,?], [?,?]], output_types=[DT_INT64, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]

Any help would be greatly appreciated!

Request for Pretrained Model

Hello.
I noticed that the repository does not provide a pretrained model. I was wondering if it is possible to provide a pretrained model so that I can use DeepMicrobes more quickly.
Thank you!

Running DeepMicrobes with TensorFlow 2.x

Hello,
our GPU requires CUDA 11 and therefore I am stuck with using TensoFlow 2.x. I was trying to lift the code to TF2, but it seems I can't find a replacement for tf.contrib.rnn.CoupledInputForgetGateLSTMCell in the embed_lstm_attention model. Is there by any chance a version of DeepMicrobes that works with TF2?
Alternatively, any pointers toward what could be a good replacement for CoupledInputForgetGateLSTMCell would help me a lot.

Error: seq-shuf not detected

Hello
I have run the command to convert fasta sequences to TFRecord as instructed. Here's my command
./tfrec_train_kmer.sh -i SRR1777715.fasta -v tokens_merged_12mers.txt -o SRR1777715.tfrec -s 20480000 -k 12

This is my output
parallel successfully detected...
ERROR : seq-shuf not detected
seq-shuf exists in bin folder. Not sure why this error is occurring. Can you please suggest a solution?

running DeepMicrobes in RHEL 8.4

Hi
I cannot run the example prediction task on a server with red hat 8.4 as OS.
When I run it I get this error: E tensorflow/stream_executor/cuda/cuda_blas.cc:647] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
I googled it and it seems to be related to a memory issue and some people have suggested (e.g. here) to add the following lines inside the python code to avoid it.

import tensorflow as tf
tf_config=tf.ConfigProto()
tf_config.gpu_options.allow_growth=True
sess = tf.Session(config=tf_config)

I inserted the code in DeepMicrobes.py just after import tensorflow as tf, but it didn't solve it. I inserted the code in embed_lstm_attention.py, and input_pipeline.py but there was no success.
I also tried the following codes to limit the memory usage to let the cuBLAS run but no success either.

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

Another possibility is that the OS is incompatible with this version of tensorflow (1.9.0) or its cudatoolkit, or basically I should get the cudatoolkit from another conda channel that has patches involved to solve this issue.

This is the conda list output:
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             4.5                       1_gnu  
_tflow_select             2.1.0                       gpu  
absl-py                   0.3.0            py36h28b3542_0  
astor                     0.8.1            py36h06a4308_0  
biopython                 1.70             py36h637b7d7_0  
blas                      1.0                         mkl  
c-ares                    1.17.1               h27cfd23_0  
ca-certificates           2021.10.26           h06a4308_2  
certifi                   2021.5.30        py36h06a4308_0  
coverage                  5.5              py36h27cfd23_2  
cudatoolkit               9.0                  h13b8566_0  
cudnn                     7.1.2                 cuda9.0_0  
cupti                     9.0.176                       0  
cython                    0.29.24          py36h295c915_0  
dataclasses               0.8                pyh4f3eec9_6  
gast                      0.5.2              pyhd3eb1b0_0  
grpcio                    1.36.1           py36h2157cd5_1  
h5py                      2.7.1            py36h9b8c120_0  
hdf5                      1.8.18               h6792536_1  
importlib-metadata        4.8.1            py36h06a4308_0  
intel-openmp              2021.3.0          h06a4308_3350  
ld_impl_linux-64          2.35.1               h7274673_9  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.3.0               h5101ec6_17  
libgfortran-ng            7.5.0               ha8ba4b0_17  
libgfortran4              7.5.0               ha8ba4b0_17  
libgomp                   9.3.0               h5101ec6_17  
libprotobuf               3.17.2               h4ff587b_1  
libstdcxx-ng              9.3.0               hd4cf53a_17  
markdown                  3.3.4            py36h06a4308_0  
mkl                       2018.0.3                      1  
mkl_fft                   1.0.6            py36h7dd41cf_0  
mkl_random                1.0.1            py36h4414c95_1  
ncurses                   6.2                  he6710b0_1  
numpy                     1.13.3           py36hdbf6ddf_4  
openssl                   1.1.1l               h7f8727e_0  
pip                       21.2.2           py36h06a4308_0  
protobuf                  3.17.2           py36h295c915_0  
python                    3.6.13               h12debd9_1  
readline                  8.1                  h27cfd23_0  
seqtk                     1.3                  h5bf99c6_3    <unknown>
setuptools                58.0.4           py36h06a4308_0  
six                       1.16.0             pyhd3eb1b0_0  
sqlite                    3.36.0               hc218d9a_0  
tensorboard               1.9.0            py36hf484d3e_0  
tensorflow                1.9.0           gpu_py36h02c5d5e_1  
tensorflow-base           1.9.0           gpu_py36h6ecc378_0  
tensorflow-gpu            1.9.0                hf154084_0  
termcolor                 1.1.0            py36h06a4308_1  
tk                        8.6.11               h1ccaba5_0  
typing_extensions         3.10.0.2           pyh06a4308_0  
werkzeug                  2.0.1              pyhd3eb1b0_0  
wheel                     0.37.0             pyhd3eb1b0_1  
xz                        5.2.5                h7b6447c_0  
zipp                      3.6.0              pyhd3eb1b0_0  
zlib                      1.2.11               h7b6447c_3

Have you seen such a problem before? and do you have any idea how to solve it?
Many thanks for you help in advance