Giter VIP home page Giter VIP logo

nucleus's Introduction

Nucleus

Nucleus is a library of Python and C++ code designed to make it easy to read, write and analyze data in common genomics file formats like SAM and VCF. In addition, Nucleus enables painless integration with the TensorFlow machine learning framework, as anywhere a genomics file is consumed or produced, a TensorFlow tfrecords file may be used instead.

Tutorial

Please check out our tutorial on using Nucleus and TensorFlow for DNA sequencing error correction. It's a Python notebook that really demonstrates the power of Nucleus at integrating information from multiple file types (BAM, VCF and Fasta) and turning it into a form usable by TensorFlow.

Poll

Which of these would most increase your usage of Nucleus? (Click on an option to vote on it.)

Installation

Nucleus currently only works on modern Linux systems using Python 3. It must be installed using a version of pip less than 21. To determine the version of pip installed on your system, run

pip --version

To install Nucleus, run

pip install --user google-nucleus

Note that each version of Nucleus works with a specific TensorFlow version. Check the releases page for specifics.

You can ignore any "Failed building wheel for google-nucleus" error messages -- these are expected and won't prevent Nucleus from installing successfully.

If you are using Python 2, instead run

pip install --user google-nucleus==0.3.2

Documentation

Building from source

For Ubuntu 20, building from source is easy. Simply type

source install.sh

This will call build_clif.sh, which will build CLIF from scratch as well.

For all other systems, you will need to first install CLIF by following the instructions at https://github.com/google/clif#installation before running install.sh. You'll need to run this command with Python 3.8. If you don't want to build CLIF binaries on your own, you can consider using pre-built CLIF binaries (see an example here). Note that we don't plan to update these pre-built CLIF binaries, so we recommend building CLIF binaries from scratch.

Note that install.sh extensively depends on apt-get, so it is unlikely to run without extensive modifications on non-Debian-based systems.

Nucleus depends on TensorFlow. By default, install.sh will install a CPU-only version of a stable TensorFlow release (currently 2.6). If that isn't what you want, there are several other options that can be enabled with a simple edit to install.sh.

Running install.sh will build all of Nucleus's programs and libraries. You can find the generated binaries under bazel-bin/nucleus. If in addition to building Nucleus you would like to run its tests, execute

bazel test -c opt $BAZEL_FLAGS nucleus/...

Version

This is Nucleus 0.6.0. Nucleus follows semantic versioning.

New in 0.6.0:

  • Upgrade to support TensorFlow 2.6.0 specifically.
  • Upgrade to Python 3.8.

New in 0.5.9:

  • Upgrade to support TensorFlow 2.5.0 specifically.

New in 0.5.8:

  • Update util/vis.py to use updated channel names.
  • Support MED_DP (median DP) field for a VariantCall.

New in 0.5.7:

  • Add automatic pileup curation functionality in util/vis.py.
  • Upgrade protobuf settings to support TensorFlow 2.4.0 specifically.

New in 0.5.6:

  • Upgrade to protobuf 3.9.2 to support TensorFlow 2.3.0 specifically.

New in 0.5.5:

  • Upgrade protobuf settings to support TensorFlow 2.2.0 specifically.

New in 0.5.4:

  • Upgrade to protobuf 3.8.0 to support TensorFlow 2.1.0. * Add explicit .close() method to TFRecordWriter.

New in 0.5.3:

  • Fixes memory leaks in message_module.cc.
  • Updates setup.py to install .egg-info directory for pip 20.2+ compatibility.
  • Pins TensorFlow to 2.0.0 for protobuf version compatibility.
  • Pins setuptools to 49.6.0 to avoid breaking changes of setuptools 50.

New in 0.5.2:

  • Upgrades htslib dependency from 1.9 to 1.10.2.
  • More informative error message for failed SAM header parsing.
  • util/vis.py now supports saving images to Google Cloud Storage.

New in 0.5.1:

  • Added new utilities for working with DeepVariant pileup images and variant protos.

New in 0.5.0:

  • Fixed bug preventing Nucleus to work with TensorFlow 2.0.
  • Added util.vis routines for visualizing DeepVariant pileup examples.
  • FASTA reader now supports keep_true_case option for keeping the original casing.
  • VCF writer now supports writing headerless VCF files.
  • SAM reader now supports optional fields of type 'B'.
  • variant_utils now supports gVCF files.
  • Numerous minor bug fixes.

New in 0.4.1:

  • Pip package is slightly more robust.

New in 0.4.0:

  • The Nucleus pip package now works with Python 3.

New in 0.3.0:

  • Reading of VCF, SAM, and most other genomics files is now twice as fast.
  • Read range and end calculations are now done in C++ for speed.
  • VcfReader can now read "headerless" VCF files.
  • variant_utils.major_allele_frequency now 5x faster.
  • Memory leaks fixed in TFRecordReader/Writer and gfile_cc.

New in 0.2.3:

  • Nucleus no longer depends on any specific version of TensorFlow's python code. This should make it easier to use Nucleus with for example TensorFlow 2.0.
  • Added BCF support to VcfWriter.
  • Fixed memory leaks in VcfWriter::Write.
  • Added print_tfrecord example program.

New in 0.2.2:

  • Faster SAM file querying and read overlap calculations.
  • Writing protocol buffers to files uses less memory.
  • Smaller pip package.
  • nucleus/util:io_utils refactored into nucleus/io:tfrecord and nucleus/io:sharded_file_utils.
  • Alleles coming from VCF files are now always normalized as uppercase.

New in 0.2.1:

  • Upgrades htslib dependency from 1.6 to 1.9.
  • Minor VCF parsing fixes.
  • Added new example program, apply_genotyping_prior.
  • Slightly more robust pip package.

New in 0.2.0:

  • Support for reading and writing BedGraph files.
  • Support for reading and writing GFF files.
  • Support for reading and writing CRAM files.
  • Support for writing SAM/BAM files.
  • Support for reading unindexed FASTA files.
  • Iteration support for indexed FASTA files.
  • Ability to read VCF files from memory.
  • Python API documentation.
  • Python 3 compatibility.
  • Added universal file converter example program.

License

Nucleus is licensed under the terms of the Apache 2 license.

Support

The Genomics team in Google Brain actively supports Nucleus and are always interested in improving its quality. If you run into an issue, please report the problem on our Issue tracker. Be sure to add enough detail to your report that we can reproduce the problem and fix it. We encourage including links to snippets of BAM/VCF/etc files that provoke the bug, if possible. Depending on the severity of the issue we may patch Nucleus immediately with the fix or roll it into the next release.

Contributing

Interested in contributing? See CONTRIBUTING.

History

Nucleus grew out of the DeepVariant project.

Disclaimer

This is not an official Google product.

nucleus's People

Contributors

cmclean avatar danielecook avatar gunjanbaid avatar marianattestad avatar pichuan avatar sgoe1 avatar tedyun avatar thomascolthurst avatar xunjieli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nucleus's Issues

Error runing codelab : dna_sequencing_error_correction

Hi ,

I was simply just run all Codelab tutorial and got this error with nucleus. I think the tutorial is built with nucleus 0.5.6 & tf 2.3 . Is there any specific requirement for python verson? There seems to be a post about using protobuf verison 3.9.2 , should I try to install this.

image

Install process is completely broken in presence of existing tensorflow install, and breaks that also

-start with a blank machine
-install tensorflow 2

$pip3 install --user google-nucleus==0.5.0

Building wheels for collected packages: google-nucleus
  Building wheel for google-nucleus (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-gd0b47d0/google-nucleus/setup.py'"'"'; __file__='"'"'/tmp/pip-install-gd0b47d0/google-nucleus/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-fjnjhmg0 --python-tag cp38
       cwd: /tmp/pip-install-gd0b47d0/google-nucleus/
  Complete output (3 lines):
  /usr/lib/python3.8/site-packages/setuptools/version.py:1: UserWarning: Module google was already imported from /home/amit/.local/lib/python3.8/site-packages/google/__init__.py, but /tmp/pip-install-gd0b47d0/google-nucleus is being added to sys.path
    import pkg_resources
  This package does not support wheel creation.
  ----------------------------------------
  ERROR: Failed building wheel for google-nucleus
  Running setup.py clean for google-nucleus
Failed to build google-nucleus
Installing collected packages: google-nucleus
    Running setup.py install for google-nucleus ... done
  WARNING: Could not find .egg-info directory in install record for google-nucleus==0.5.0 from https://files.pythonhosted.org/packages/28/04/5da4ba708671d62100d7df3ce053718af58c4635080ef10d35dac7cc7c5e/google_nucleus-0.5.0.tar.gz#sha256=74b1a280a67f03c2ff751c5dcbc6a902ca5be590360b02086e8f8b55d963acfd
Successfully installed google-nucleus

There is a closed log explaining that this is OK.

However,
import tensorflow as tf

File "~/.local/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 530, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors should not be created directly, but only retrieved from their parent.

from nucleus.io import vcf

File "~/.local/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 530, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors should not be created directly, but only retrieved from their parent.

Further,
$pip3 uninstall google-nucleus

WARNING: Skipping google-nucleus as it is not installed.

BGEN file format

I’m wanting to implement support for the bgen file format and am looking for some guidance about what would be required beyond stamping out the proto messages. Is there a general outline for adding support for other file types?

Install nucleas on air-gapped computer

Hi, I thought that moving the tarball of the git-repo to an air-gapped system and doing source install.sh would install it, but as I see it still require apt-get to try to install packages from the internet.

Is there any way to make a tarball that has all the dependencies ? Or any other suggestions on how to install it without internet ?

Really appreciate the help

pip installation check

I just tried this

pip install --user google-nucleus

I got

collecting google-nucleus
  Could not find a version that satisfies the requirement google-nucleus (from versions: )
No matching distribution found for google-nucleus

Just like to check if the package is properly set-up in PyPi.

pip installation error

I simply created a python 3.6 venv and executed

pip install google-nucleus

resulting in

Collecting google-nucleus
  Using cached https://files.pythonhosted.org/packages/67/a3/2e0c0d660c5cf2806f5fffd4ba316d9609b251a75029a55449dddfe63818/google_nucleus-0.4.0.tar.gz
Requirement already satisfied: contextlib2 in /home/james/.virtualenvs/genomic_embeddings/lib/python3.6/site-packages (from google-nucleus) (0.5.5)
Requirement already satisfied: intervaltree in /home/james/.virtualenvs/genomic_embeddings/lib/python3.6/site-packages (from google-nucleus) (3.0.2)
Requirement already satisfied: absl-py in /home/james/.virtualenvs/genomic_embeddings/lib/python3.6/site-packages (from google-nucleus) (0.7.1)
Requirement already satisfied: mock in /home/james/.virtualenvs/genomic_embeddings/lib/python3.6/site-packages (from google-nucleus) (3.0.5)
Requirement already satisfied: numpy in /home/james/.virtualenvs/genomic_embeddings/lib/python3.6/site-packages (from google-nucleus) (1.16.4)
Requirement already satisfied: six in /home/james/.virtualenvs/genomic_embeddings/lib/python3.6/site-packages (from google-nucleus) (1.12.0)
Requirement already satisfied: protobuf in /home/james/.virtualenvs/genomic_embeddings/lib/python3.6/site-packages (from google-nucleus) (3.8.0)
Requirement already satisfied: sortedcontainers<3.0,>=2.0 in /home/james/.virtualenvs/genomic_embeddings/lib/python3.6/site-packages (from intervaltree->google-nucleus) (2.1.0)
Requirement already satisfied: setuptools in /home/james/.virtualenvs/genomic_embeddings/lib/python3.6/site-packages (from protobuf->google-nucleus) (41.0.1)
Building wheels for collected packages: google-nucleus
  Building wheel for google-nucleus (setup.py) ... error
  ERROR: Complete output from command /home/james/.virtualenvs/genomic_embeddings/bin/python3 -u -c 'import setuptools, tokenize;__file__='"'"'/tmp/pip-install-ge6m9774/google-nucleus/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-xdlfri_o --python-tag cp36:
  ERROR: /home/james/.virtualenvs/genomic_embeddings/lib/python3.6/site-packages/setuptools/version.py:1: UserWarning: Module google was already imported from /home/james/.virtualenvs/genomic_embeddings/lib/python3.6/site-packages/google/__init__.py, but /tmp/pip-install-ge6m9774/google-nucleus is being added to sys.path
    import pkg_resources
  This package does not support wheel creation.
  ----------------------------------------
  ERROR: Failed building wheel for google-nucleus
  Running setup.py clean for google-nucleus
Failed to build google-nucleus
Installing collected packages: google-nucleus
  Running setup.py install for google-nucleus ... done
  WARNING: Could not find .egg-info directory in install record for google-nucleus from https://files.pythonhosted.org/packages/67/a3/2e0c0d660c5cf2806f5fffd4ba316d9609b251a75029a55449dddfe63818/google_nucleus-0.4.0.tar.gz#sha256=288bac1e1b6dd0f934a09ce2f08f0a27d00040357e8cbf970c7d564ecb733c7d
Successfully installed google-nucleus

However the docker image built just fine.

Shipped protobuf package is not compatible with current tf 2.0 pip install

Problem:
If you install tf 2.0 using pip install tensorflow-gpu, then the tensorflow package will import fine, but if you import nucleus after pip install google-nucleus the old version of protobuf is removed, and this causes tensorflow imports to result in

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-1-b56a16b58840>", line 2, in <module>
    import tensorflow as tf
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/__init__.py", line 98, in <module>
    from tensorflow_core import *
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/__init__.py", line 40, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 959, in _find_and_load_unlocked
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/__init__.py", line 50, in __getattr__
    module = self._load()
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/__init__.py", line 44, in _load
    module = _importlib.import_module(self.__name__)
  File "/opt/conda/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/__init__.py", line 52, in <module>
    from tensorflow.core.framework.graph_pb2 import *
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/core/framework/graph_pb2.py", line 16, in <module>
    from tensorflow.core.framework import node_def_pb2 as tensorflow_dot_core_dot_framework_dot_node__def__pb2
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/core/framework/node_def_pb2.py", line 16, in <module>
    from tensorflow.core.framework import attr_value_pb2 as tensorflow_dot_core_dot_framework_dot_attr__value__pb2
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/core/framework/attr_value_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__pb2
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/core/framework/tensor_pb2.py", line 16, in <module>
    from tensorflow.core.framework import resource_handle_pb2 as tensorflow_dot_core_dot_framework_dot_resource__handle__pb2
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/core/framework/resource_handle_pb2.py", line 45, in <module>
    serialized_options=None, file=DESCRIPTOR),
  File "/opt/conda/lib/python3.7/site-packages/google/protobuf/descriptor.py", line 534, in __new__
    return _message.default_pool.FindFieldByName(full_name)
KeyError: "Couldn't find field tensorflow.ResourceHandleProto.DtypeAndShape.dtype"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2040, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'KeyError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/IPython/core/ultratb.py", line 1101, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/opt/conda/lib/python3.7/site-packages/IPython/core/ultratb.py", line 319, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/IPython/core/ultratb.py", line 353, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/opt/conda/lib/python3.7/inspect.py", line 1502, in getinnerframes
    frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
  File "/opt/conda/lib/python3.7/inspect.py", line 1460, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/opt/conda/lib/python3.7/inspect.py", line 696, in getsourcefile
    if getattr(getmodule(object, filename), '__loader__', None) is not None:
  File "/opt/conda/lib/python3.7/inspect.py", line 733, in getmodule
    if ismodule(module) and hasattr(module, '__file__'):
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/__init__.py", line 50, in __getattr__
    module = self._load()
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/__init__.py", line 44, in _load
    module = _importlib.import_module(self.__name__)
  File "/opt/conda/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/__init__.py", line 42, in <module>
    from . _api.v2 import audio
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/_api/v2/audio/__init__.py", line 10, in <module>
    from tensorflow.python.ops.gen_audio_ops import decode_wav
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_audio_ops.py", line 11, in <module>
    from tensorflow.python.eager import context as _context
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/eager/context.py", line 29, in <module>
    from tensorflow.core.protobuf import config_pb2
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/core/protobuf/config_pb2.py", line 17, in <module>
    from tensorflow.core.framework import graph_pb2 as tensorflow_dot_core_dot_framework_dot_graph__pb2
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/core/framework/graph_pb2.py", line 16, in <module>
    from tensorflow.core.framework import node_def_pb2 as tensorflow_dot_core_dot_framework_dot_node__def__pb2
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/core/framework/node_def_pb2.py", line 16, in <module>
    from tensorflow.core.framework import attr_value_pb2 as tensorflow_dot_core_dot_framework_dot_attr__value__pb2
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/core/framework/attr_value_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__pb2
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/core/framework/tensor_pb2.py", line 16, in <module>
    from tensorflow.core.framework import resource_handle_pb2 as tensorflow_dot_core_dot_framework_dot_resource__handle__pb2
  File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/core/framework/resource_handle_pb2.py", line 45, in <module>
    serialized_options=None, file=DESCRIPTOR),
  File "/opt/conda/lib/python3.7/site-packages/google/protobuf/descriptor.py", line 534, in __new__
    return _message.default_pool.FindFieldByName(full_name)
KeyError: "Couldn't find field tensorflow.ResourceHandleProto.DtypeAndShape.dtype"

I think this could be resolved by updating the protobuf version shipping with pip install google-nucleus. Not sure though

PythonNext() argument read is not valid: Dynamic cast failed

from nucleus.protos import reads_pb2
from nucleus.io import sam

read_requirements = reads_pb2.ReadRequirements()
sam_reader = sam.SamReader(
      input_path="NA12878_sliced.bam", read_requirements=read_requirements)

for r in sam_reader:
    print(r)
RuntimeError: PythonNext() argument read is not valid: Dynamic cast failed
2023-12-27 16:25:36.748166: W [.nucleus/util/proto_clif_converter.h:60)] Failed to cast type N6google8protobuf14DynamicMessageE
  • Versions:
Python 3.8.18
tensorflow                   2.13.1
protobuf                     3.20.3
google-nucleus               0.6.0

Any recommendations?
I'm trying to run this example: https://blog.tensorflow.org/2019/01/using-nucleus-and-tensorflow-for-dna.html

pip install error

With pip install --user google-nucleus it gives the error

Could not find a version that satisfies the requirement google-nucleus (from versions: )
No matching distribution found for google-nucleus

Is there a fix for this? Machine is 18.04 Ubuntu

Can't uninstall?

I installed nucleus with:

pip install --user google-nucleus

It threw out this error as well as successfully installed message:

Building wheel for google-nucleus (setup.py) ... error
ERROR: Complete output from command /home/user/miniconda3/envs/py2/bin/python -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-install-80mPYb/google-nucleus/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-2P701B --python-tag cp27:
ERROR: /home/user/miniconda3/envs/py2/lib/python2.7/site-packages/setuptools/version.py:1: UserWarning: Module google was already imported from None, but /tmp/pip-install-80mPYb/google-nucleus is being added to sys.path
import pkg_resources
This package does not support wheel creation.

ERROR: Failed building wheel for google-nucleus
Running setup.py clean for google-nucleus
Failed to build google-nucleus
Installing collected packages: google-nucleus
Running setup.py install for google-nucleus ... done
WARNING: Could not find .egg-info directory in install record for google-nucleus from https://files.pythonhosted.org/packages/4c/98/24e36281ccae0879a94a5de17a7769fe43937622a5b57afde6c18e6671bd/google_nucleus-0.4.1.tar.gz#sha256=eaedc19ae573d27d204192ff21014bbb2fe28c85fe4e27b328cd10b299fa6871
Successfully installed google-nucleus

I could find "google" and "nucleus" directories in /home/user/.local/lib/python2.7/site-packages,
but when I do "pip list" or "pip2 list", they are not there.

Now I want to uninstall them:

pip/pip2 uninstall google
Cannot uninstall requirement google, not installed

pip/pip2 uninstall nucleus
Cannot uninstall requirement nucleus, not installed

pip/pip2 uninstall google-nucleus
Cannot uninstall requirement google-nucleus, not installed

What should I do??

Apache Beam integration

Hi,

I have a question regarding Nucleus and Apache Beam integration. I've recently read that Nucleus will not replace PyVCF that is currently used by Apache Beam and so

Beam might have to drop support for VCF IO once Beam drops support of Python 2

I am curious if there are any plans for Nucleus-Beam integration or perhaps any suggestions on how to read VCF files with Nucleus and serve them to Beam Pipeline? Should one first read the VCFs with Nucleus write it as TFRecord and then use Beam's tfrecordio module to read those? This seems a bit clumsy and counter-intuitive. Any advice on how to approach this more elegantly will be highly appreciated.

Thanks!

install google-nucleus error

when i pip install the nucleus,it throw the error,i have try version 0.5.6 to latest,the same error....

python 3.9
tensorflow 2.9.1
ubuntu 20

pip install --user google-nucleus

`
import pkg_resources
This package does not support wheel creation.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for google-nucleus
Running setup.py clean for google-nucleus
Failed to build google-nucleus
ERROR: Could not build wheels for google-nucleus, which is required to install pyproject.toml-based projects

`

python2 import error

Have tried to install with pip and pip2.

Import nucleus in Python3 works fine.
Import nucleus in Python2 fails with the following error:

Python 2.7.16 |Anaconda, Inc.| (default, Mar 14 2019, 21:00:58)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import nucleus
from nucleus.io import vcf
Traceback (most recent call last):
File "", line 1, in
File "/home/user/.local/lib/python2.7/site-packages/nucleus/io/vcf.py", line 64, in
from nucleus.io import genomics_reader
File "/home/user/.local/lib/python2.7/site-packages/nucleus/io/genomics_reader.py", line 68, in
from nucleus.io.python import tfrecord_reader
ImportError: /home/user/.local/lib/python2.7/site-packages/nucleus/io/python/tfrecord_reader.so: undefined symbol: _Py_FalseStruct

I found an answer to this error:
tensorflow/tensorflow#16720 (comment)

but I am not sure what I did wrong.
I have removed and recreated the environment many times.

Read strand

I would like to know how to get the strand information of the read from nucleus.io.sam.SamReader, since I am working with bam files of long-read transcriptome.

Is there any limitation for calling rare mutation?

Hi,
Thanks for the magic library you developed.
I wonder whether this can be used for rare mutation calling with MAF around or below 1%. I remember that there is a limitation from Deepvariant that the maximum read depth is 362 due to the Inception v3. I wonder whether this limitation also works with Nucleus.
Thanks!
Best,
Weiwei

User Menu for Nucleus

Hello,

Do you have user menu for nucleus, which explain each class, function and argument?
For example, what are nucleus.protos and reads_pb2.

Thank you!

Nucleus Proto is colliding with Tensorflow protos?

Hello,

I'm trying to use the nucleus.io's tfrecord functions.
When I'm importing as shown below, I get a TypeError. It looks like protos are colliding and I'm not sure how to fix this as I want to use Example and Feature protos provided in nucleus.

I have two directories, dir1 and dir2. In dir1, I import tensorflow and it has test files. In dir2, I import nucleus.io. When I run pytest on dir1 and dir2, separately, they both work. However, when I change to the parent directory and run pytest, I get this error. Please help.

Thank you in advance,
Moonjo

from nucleus.io import tfrecord

/root/.local/lib/python3.7/site-packages/nucleus/io/tfrecord.py:35: in
from nucleus.protos import example_pb2
/root/.local/lib/python3.7/site-packages/nucleus/protos/example_pb2.py:16: in
from nucleus.protos import feature_pb2 as nucleus_dot_protos_dot_feature__pb2
/root/.local/lib/python3.7/site-packages/nucleus/protos/feature_pb2.py:23: in
serialized_pb=_b('\n\x1cnucleus/protos/feature.proto\x12\ntensorflow"\x1a\n\tBytesList\x12\r\n\x05value\x18\x01 \x03(\x0c"\x1e\n\tFloatList\x12\x11\n\x05value\x18\x01 \x03(\x02\x42\x02\x10\x01"\x1e\n\tInt64List\x12\x11\n\x05value\x18\x01 \x03(\x03\x42\x02\x10\x01"\x98\x01\n\x07\x46\x65\x61ture\x12+\n\nbytes_list\x18\x01 \x01(\x0b\x32\x15.tensorflow.BytesListH\x00\x12+\n\nfloat_list\x18\x02 \x01(\x0b\x32\x15.tensorflow.FloatListH\x00\x12+\n\nint64_list\x18\x03 \x01(\x0b\x32\x15.tensorflow.Int64ListH\x00\x42\x06\n\x04kind"\x83\x01\n\x08\x46\x65\x61tures\x12\x32\n\x07\x66\x65\x61ture\x18\x01 \x03(\x0b\x32!.tensorflow.Features.FeatureEntry\x1a\x43\n\x0c\x46\x65\x61tureEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12"\n\x05value\x18\x02 \x01(\x0b\x32\x13.tensorflow.Feature:\x02\x38\x01"3\n\x0b\x46\x65\x61tureList\x12$\n\x07\x66\x65\x61ture\x18\x01 \x03(\x0b\x32\x13.tensorflow.Feature"\x9c\x01\n\x0c\x46\x65\x61tureLists\x12?\n\x0c\x66\x65\x61ture_list\x18\x01 \x03(\x0b\x32).tensorflow.FeatureLists.FeatureListEntry\x1aK\n\x10\x46\x65\x61tureListEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12&\n\x05value\x18\x02 \x01(\x0b\x32\x17.tensorflow.FeatureList:\x02\x38\x01\x42,\n\x16org.tensorflow.exampleB\rFeatureProtosP\x01\xf8\x01\x01\x62\x06proto3')
/usr/local/lib/python3.7/dist-packages/google/protobuf/descriptor.py:965: in new
return _message.default_pool.AddSerializedFile(serialized_pb)
E TypeError: Couldn't build proto file into descriptor pool!
E Invalid proto descriptor for file "nucleus/protos/feature.proto":
E tensorflow.BytesList.value: "tensorflow.BytesList.value" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.BytesList: "tensorflow.BytesList" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.FloatList.value: "tensorflow.FloatList.value" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.FloatList: "tensorflow.FloatList" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Int64List.value: "tensorflow.Int64List.value" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Int64List: "tensorflow.Int64List" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Feature.kind: "tensorflow.Feature.kind" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Feature.bytes_list: "tensorflow.Feature.bytes_list" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Feature.float_list: "tensorflow.Feature.float_list" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Feature.int64_list: "tensorflow.Feature.int64_list" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Feature: "tensorflow.Feature" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Features.feature: "tensorflow.Features.feature" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Features.FeatureEntry.key: "tensorflow.Features.FeatureEntry.key" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Features.FeatureEntry.value: "tensorflow.Features.FeatureEntry.value" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Features.FeatureEntry: "tensorflow.Features.FeatureEntry" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Features: "tensorflow.Features" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.FeatureList.feature: "tensorflow.FeatureList.feature" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.FeatureList: "tensorflow.FeatureList" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.FeatureLists.feature_list: "tensorflow.FeatureLists.feature_list" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.FeatureLists.FeatureListEntry.key: "tensorflow.FeatureLists.FeatureListEntry.key" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.FeatureLists.FeatureListEntry.value: "tensorflow.FeatureLists.FeatureListEntry.value" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.FeatureLists.FeatureListEntry: "tensorflow.FeatureLists.FeatureListEntry" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.FeatureLists: "tensorflow.FeatureLists" is already defined in file "tensorflow/core/example/feature.proto".
E tensorflow.Feature.bytes_list: "tensorflow.BytesList" seems to be defined in "tensorflow/core/example/feature.proto", which is not imported by "nucleus/protos/feature.proto". To use it here, please add the necessary import.
E tensorflow.Feature.float_list: "tensorflow.FloatList" seems to be defined in "tensorflow/core/example/feature.proto", which is not imported by "nucleus/protos/feature.proto". To use it here, please add the necessary import.
E tensorflow.Feature.int64_list: "tensorflow.Int64List" seems to be defined in "tensorflow/core/example/feature.proto", which is not imported by "nucleus/protos/feature.proto". To use it here, please add the necessary import.
E tensorflow.Features.FeatureEntry.value: "tensorflow.Feature" seems to be defined in "tensorflow/core/example/feature.proto", which is not imported by "nucleus/protos/feature.proto". To use it here, please add the necessary import.
E tensorflow.Features.feature: "tensorflow.Features.FeatureEntry" seems to be defined in "tensorflow/core/example/feature.proto", which is not imported by "nucleus/protos/feature.proto". To use it here, please add the necessary import.
E tensorflow.FeatureList.feature: "tensorflow.Feature" seems to be defined in "tensorflow/core/example/feature.proto", which is not imported by "nucleus/protos/feature.proto". To use it here, please add the necessary import.
E tensorflow.FeatureLists.FeatureListEntry.value: "tensorflow.FeatureList" seems to be defined in "tensorflow/core/example/feature.proto", which is not imported by "nucleus/protos/feature.proto". To use it here, please add the necessary import.
E tensorflow.FeatureLists.feature_list: "tensorflow.FeatureLists.FeatureListEntry" seems to be defined in "tensorflow/core/example/feature.proto", which is not imported by "nucleus/protos/feature.proto". To use it here, please add the necessary import.

Nucleus not building for Python3.7-3.9 via pip.

I am trying to build a Docker image, which will have a nucleus installed.

Currently I have such a file:

FROM ubuntu:latest

WORKDIR /test/

RUN apt -y update && apt install -y \
    software-properties-common

RUN add-apt-repository ppa:deadsnakes/ppa

RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata

RUN apt -y update && apt install -y \
    python3.7 \
    python3.7-distutils \
    curl

RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3.7 get-pip.py

RUN python3.7 -m pip install google-nucleus

COPY example.vcf /test/

CMD /bin/bash

I managed to install nucleus for Python3 version 10+, however, there is a problem with Python removing some collections, which your are using.

Currently, for this installation (I tried also python 3.9) I am receiving following log:

Building wheels for collected packages: google-nucleus, intervaltree, crcmod, dill, docopt
  Building wheel for google-nucleus (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      This package does not support wheel creation.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for google-nucleus
  Running setup.py clean for google-nucleus
  Building wheel for intervaltree (setup.py) ... done
  Created wheel for intervaltree: filename=intervaltree-3.1.0-py2.py3-none-any.whl size=26100 sha256=1b59ccfb155513ee60dad398c484e6495a1675d9db312c7c8b5c02ec1b660997
  Stored in directory: /root/.cache/pip/wheels/ab/fa/1b/75d9a713279796785711bd0bad8334aaace560c0bd28830c8c
  Building wheel for crcmod (setup.py) ... done
FROM ubuntu:latest
  Created wheel for crcmod: filename=crcmod-1.7-py3-none-any.whl size=18834 sha256=9b30119c33f15d631de7cf9e9951031770c5bb098860bc633a8a33732de511a0
  Stored in directory: /root/.cache/pip/wheels/4a/6c/a6/ffdd136310039bf226f2707a9a8e6857be7d70a3fc061f6b36
  Building wheel for dill (setup.py) ... done
  Created wheel for dill: filename=dill-0.3.1.1-py3-none-any.whl size=78544 sha256=2e36ffde415b6526c1e157d895b7bfa00ef526f1b10f2076ffdbf72c80e7fbf3
  Stored in directory: /root/.cache/pip/wheels/4f/0b/ce/75d96dd714b15e51cb66db631183ea3844e0c4a6d19741a149
  Building wheel for docopt (setup.py) ... done
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=a81345c43fee7ab5e75b827c4e34d6112f00de5b30be2b93465edfe968e85c91
  Stored in directory: /root/.cache/pip/wheels/70/4a/46/1309fc853b8d395e60bafaf1b6df7845bdd82c95fd59dd8d2b
Successfully built intervaltree crcmod dill docopt
Failed to build google-nucleus
Installing collected packages: wcwidth, sortedcontainers, pytz, pure-eval, ptyprocess, pickleshare, executing, docopt, crcmod, backcall, urllib3, typing-extensions, traitlets, python-dateutil, pymongo, pygments, pydot, protobuf, prompt-toolkit, Pillow, pexpect, parso, orjson, numpy, mock, intervaltree, idna, grpcio, fastavro, dill, decorator, contextlib2, cloudpickle, charset-normalizer, certifi, asttokens, absl-py, stack-data, requests, pyarrow, proto-plus, matplotlib-inline, jedi, ipython, hdfs, apache-beam, google-nucleus
  Running setup.py install for google-nucleus ... done
  DEPRECATION: google-nucleus was installed using the legacy 'setup.py install' method, because a wheel could not be built for it. A possible replacement is to fix the wheel build issue reported above. Discussion can be found at https://github.com/pypa/pip/issues/8368
Successfully installed Pillow-9.1.1 absl-py-1.1.0 apache-beam-2.40.0 asttokens-2.0.5 backcall-0.2.0 certifi-2022.6.15 charset-normalizer-2.1.0 cloudpickle-2.1.0 contextlib2-21.6.0 crcmod-1.7 decorator-5.1.1 dill-0.3.1.1 docopt-0.6.2 executing-0.8.3 fastavro-1.5.2 google-nucleus grpcio-1.47.0 hdfs-2.7.0 idna-3.3 intervaltree-3.1.0 ipython-8.4.0 jedi-0.18.1 matplotlib-inline-0.1.3 mock-4.0.3 numpy-1.22.4 orjson-3.7.5 parso-0.8.3 pexpect-4.8.0 pickleshare-0.7.5 prompt-toolkit-3.0.30 proto-plus-1.20.6 protobuf-3.20.1 ptyprocess-0.7.0 pure-eval-0.2.2 pyarrow-7.0.0 pydot-1.4.2 pygments-2.12.0 pymongo-3.12.3 python-dateutil-2.8.2 pytz-2022.1 requests-2.28.1 sortedcontainers-2.4.0 stack-data-0.3.0 traitlets-5.3.0 typing-extensions-4.2.0 urllib3-1.26.9 wcwidth-0.2.5
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

This is happening during the google-nucleus installation process.
Could someone help me with that?
I am out of ideas of what to do from here.

This is the communicate that I am receiving for default Python (version 3.10.4).
Screenshot 2022-07-01 at 10 54 04
.

ImportError: libbz2.so.1.0: cannot open shared object file: No such file or directory

I tried to run the example Nucleus program:

from nucleus.io import vcf

with vcf.VcfReader('./clinvar.vcf.gz') as reader:
  print('Sample names in VCF: ', ' '.join(reader.header.sample_names))
  with vcf.VcfWriter('/tmp/filtered.tfrecord', header=reader.header) as writer:
    for variant in reader:
      if variant.quality > 3.01:
        writer.write(variant)

but I keep getting this error:
ImportError: libbz2.so.1.0: cannot open shared object file: No such file or directory

I have tried two solutions I found online but they did not work: eg
Solution1:

sudo yum install libbz2
sudo yum install bzip2-libs

Solution 2:

sudo ln -s 
sudo ldconfig

Seeking your help, thank you very much!

Error importing

I get the following error while important. I am using Mac OS X 10.14.6 Mojave with the latest version of anaconda.

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-d1c328622acf> in <module>
----> 1 from nucleus.io import vcf

~/anaconda3/lib/python3.7/site-packages/nucleus/io/vcf.py in <module>
     62 from __future__ import print_function
     63 
---> 64 from nucleus.io import genomics_reader
     65 from nucleus.io import genomics_writer
     66 from nucleus.io.python import vcf_reader

~/anaconda3/lib/python3.7/site-packages/nucleus/io/genomics_reader.py in <module>
     66 import six
     67 
---> 68 from nucleus.io.python import tfrecord_reader
     69 
     70 

ImportError: dlopen(/Users/dquang/anaconda3/lib/python3.7/site-packages/nucleus/io/python/tfrecord_reader.so, 2): no suitable image found.  Did find:
	/Users/dquang/anaconda3/lib/python3.7/site-packages/nucleus/io/python/tfrecord_reader.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00
	/Users/dquang/anaconda3/lib/python3.7/site-packages/google/protobuf/pyext/_message.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00```

ImportError for `nucleus.io.gfile`

Hi,
I encountered an import error when trying to import nucleus.io.gfile. It seems someone reported this problem on a closed issue in Nov 2019 (#14 (comment)).

I'm using macOS 10.14.6, and I've attached the problem below.

## Start a python3.6 virtualenv to show reproduction of import error
$ python3 -m virtualenv nucleus-import
created virtual environment CPython3.6.11.final.0-64 in 1120ms
  creator CPython3Posix(dest=/Users/YoungChanPark/nucleus-import, clear=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/YoungChanPark/Library/Application Support/virtualenv)
    added seed packages: pip==20.2.2, setuptools==49.6.0, wheel==0.35.1
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator

## Activate virtualenv
$ source nucleus-import/bin/activate

## Install Nucleus with Pip
(nucleus-import) $ pip install google-nucleus
Collecting google-nucleus
  Using cached google_nucleus-0.5.3.tar.gz (6.8 MB)
Collecting contextlib2
  Using cached contextlib2-0.6.0.post1-py2.py3-none-any.whl (9.8 kB)
Processing ./Library/Caches/pip/wheels/fc/e6/3f/1616b381f981006664dd5123f06b231bbbb2e7d604a417e2fd/intervaltree-3.1.0-py2.py3-none-any.whl
Collecting absl-py
  Using cached absl_py-0.10.0-py3-none-any.whl (127 kB)
Collecting mock
  Using cached mock-4.0.2-py3-none-any.whl (28 kB)
Collecting numpy
  Using cached numpy-1.19.2-cp36-cp36m-macosx_10_9_x86_64.whl (15.3 MB)
Collecting six
  Using cached six-1.15.0-py2.py3-none-any.whl (10 kB)
Collecting protobuf
  Using cached protobuf-3.13.0-cp36-cp36m-macosx_10_9_x86_64.whl (1.3 MB)
Collecting Pillow
  Using cached Pillow-7.2.0-cp36-cp36m-macosx_10_10_x86_64.whl (2.2 MB)
Collecting ipython
  Using cached ipython-7.16.1-py3-none-any.whl (785 kB)
Collecting sortedcontainers<3.0,>=2.0
  Using cached sortedcontainers-2.2.2-py2.py3-none-any.whl (29 kB)
Requirement already satisfied: setuptools in ./nucleus-import/lib/python3.6/site-packages (from protobuf->google-nucleus) (49.6.0)
Collecting pexpect; sys_platform != "win32"
  Using cached pexpect-4.8.0-py2.py3-none-any.whl (59 kB)
Collecting decorator
  Using cached decorator-4.4.2-py2.py3-none-any.whl (9.2 kB)
Collecting jedi>=0.10
  Using cached jedi-0.17.2-py2.py3-none-any.whl (1.4 MB)
Collecting traitlets>=4.2
  Using cached traitlets-4.3.3-py2.py3-none-any.whl (75 kB)
Collecting prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0
  Using cached prompt_toolkit-3.0.7-py3-none-any.whl (355 kB)
Collecting pygments
  Using cached Pygments-2.7.1-py3-none-any.whl (944 kB)
Collecting pickleshare
  Using cached pickleshare-0.7.5-py2.py3-none-any.whl (6.9 kB)
Collecting backcall
  Using cached backcall-0.2.0-py2.py3-none-any.whl (11 kB)
Collecting appnope; sys_platform == "darwin"
  Using cached appnope-0.1.0-py2.py3-none-any.whl (4.0 kB)
Collecting ptyprocess>=0.5
  Using cached ptyprocess-0.6.0-py2.py3-none-any.whl (39 kB)
Collecting parso<0.8.0,>=0.7.0
  Using cached parso-0.7.1-py2.py3-none-any.whl (109 kB)
Collecting ipython-genutils
  Using cached ipython_genutils-0.2.0-py2.py3-none-any.whl (26 kB)
Collecting wcwidth
  Using cached wcwidth-0.2.5-py2.py3-none-any.whl (30 kB)
Building wheels for collected packages: google-nucleus
  Building wheel for google-nucleus (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /Users/YoungChanPark/nucleus-import/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/4m/q2rj8ww90ms0jwg7zv4ldwqh0000gn/T/pip-install-0hj3airk/google-nucleus/setup.py'"'"'; __file__='"'"'/private/var/folders/4m/q2rj8ww90ms0jwg7zv4ldwqh0000gn/T/pip-install-0hj3airk/google-nucleus/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/4m/q2rj8ww90ms0jwg7zv4ldwqh0000gn/T/pip-wheel-7e5y8qnk
       cwd: /private/var/folders/4m/q2rj8ww90ms0jwg7zv4ldwqh0000gn/T/pip-install-0hj3airk/google-nucleus/
  Complete output (1 lines):
  This package does not support wheel creation.
  ----------------------------------------
  ERROR: Failed building wheel for google-nucleus
  Running setup.py clean for google-nucleus
Failed to build google-nucleus
DEPRECATION: Could not build wheels for google-nucleus which do not use PEP 517. pip will fall back to legacy 'setup.py install' for these. pip 21.0 will remove support for this functionality. A possible replacement is to fix the wheel build issue reported above. You can find discussion regarding this at https://github.com/pypa/pip/issues/8368.
Installing collected packages: contextlib2, sortedcontainers, intervaltree, six, absl-py, mock, numpy, protobuf, Pillow, ptyprocess, pexpect, decorator, parso, jedi, ipython-genutils, traitlets, wcwidth, prompt-toolkit, pygments, pickleshare, backcall, appnope, ipython, google-nucleus
    Running setup.py install for google-nucleus ... done
Successfully installed Pillow-7.2.0 absl-py-0.10.0 appnope-0.1.0 backcall-0.2.0 contextlib2-0.6.0.post1 decorator-4.4.2 google-nucleus-0.5.3 intervaltree-3.1.0 ipython-7.16.1 ipython-genutils-0.2.0 jedi-0.17.2 mock-4.0.2 numpy-1.19.2 parso-0.7.1 pexpect-4.8.0 pickleshare-0.7.5 prompt-toolkit-3.0.7 protobuf-3.13.0 ptyprocess-0.6.0 pygments-2.7.1 six-1.15.0 sortedcontainers-2.2.2 traitlets-4.3.3 wcwidth-0.2.5
WARNING: You are using pip version 20.2.2; however, version 20.2.3 is available.
You should consider upgrading via the '/Users/YoungChanPark/nucleus-import/bin/python -m pip install --upgrade pip' command.

# Start Python interpreter
(nucleus-import) $ python3

I get an import error when trying to import gfile.

>>> from nucleus.io import gfile
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/YoungChanPark/nucleus-import/lib/python3.6/site-packages/nucleus/io/gfile.py", line 22, in <module>
    from nucleus.io.python import gfile
ImportError: dlopen(/Users/YoungChanPark/nucleus-import/lib/python3.6/site-packages/nucleus/io/python/gfile.so, 2): no suitable image found.  Did find:
	/Users/YoungChanPark/nucleus-import/lib/python3.6/site-packages/nucleus/io/python/gfile.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00
	/Users/YoungChanPark/nucleus-import/lib/python3.6/site-packages/google/protobuf/pyext/_message.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00

>>> from nucleus.io.python import gfile
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: dlopen(/Users/YoungChanPark/nucleus-import/lib/python3.6/site-packages/nucleus/io/python/gfile.so, 2): no suitable image found.  Did find:
	/Users/YoungChanPark/nucleus-import/lib/python3.6/site-packages/nucleus/io/python/gfile.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00
	/Users/YoungChanPark/nucleus-import/lib/python3.6/site-packages/google/protobuf/pyext/_message.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00

cannot import name 'tfrecord_reader' (python3)

Problem:

When I try to import the vcf module I get the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/nucleus-0.3.0/nucleus/io/vcf.py", line 64, in <module>
    from nucleus.io import genomics_reader
  File "/usr/local/lib/python3.5/dist-packages/nucleus-0.3.0/nucleus/io/genomics_reader.py", line 68, in <module>
    from nucleus.io.python import tfrecord_reader
ImportError: cannot import name 'tfrecord_reader'

I installed google-nucleus library from source, since it is not available through pip3 as described in issue #9 .

Steps to reproduce:

Dockerfile:

FROM tensorflow/tensorflow:1.13.1-gpu-py3-jupyter

RUN apt-get update && apt-get install -y --no-install-recommends sudo

# Install nucleus from source since it is not available through pip3
RUN cd /usr/local/lib/python3.5/dist-packages/ && \
    curl -SL https://github.com/google/nucleus/archive/0.3.0.tar.gz | tar xz && \
    cd nucleus-0.3.0 && source install.sh

ENV PYTHONPATH="$PYTHONPATH:/usr/local/lib/python3.5/dist-packages/nucleus-0.3.0/"

# Try to import vcf to reproduce an error
CMD ["python", "-c", "from nucleus.io import vcf"]

within the directory containing the Dockerfile run:

docker build -t test .
docker run test

Any help would by highly appreciated!

RuntimeError: PythonNext() argument variant is not valid: Dynamic cast failed.

Hi,

As I am using Mac-OS, I have set up a Docker environment to run Nucleus with Tensorflow in a Jupyter Notebook.

For the Docker image/container I am using the base-image "python:3.6-buster" as Nucleus doesn't work with Python 3.8. Here is the Dockerfile that I use:

# download base image
FROM python:3.6-buster

# dealing with cache issues
RUN apt-get clean
RUN apt-get update

# keep pip below v.21
RUN pip3 install --upgrade pip==20.2.3

# copy and paste the src directory into the image
WORKDIR src/
COPY . .

RUN pip3 install -r requirements.txt

For the dependencies (requirements.txt) I have tried with the latest versions of Tensorflow and Nucleus, as well as
Nucleus v. 0.4.1 and Tensorflow v.1.13.1 as suggested here.

Inside the Jupyter Notebook (that is running in the Docker container), I am re-installing protobuf (i.e., !pip3 install --force-reinstall --upgrade protobuf).

But when trying to test-run this code:

from nucleus.io import vcf
r = vcf.VcfReader('<sample>.vcf')
for v in r:
  print(v.start)

I am getting this error: RuntimeError: PythonNext() argument variant is not valid: Dynamic cast failed.

I have tried different hacks and tricks to get around the error (changing pip versions, different python versions (3.6, and 3.7, different Tensorflow and Nucleus, etc.) in the Docker container. Unfortunately, I forgot to log everything I have tried to get past this issue the last couple of days.

Is there a way to get around this issue?

Thanks,

Best wishes, Birgitte

[DOC Request] Performance metrics for *Reader



First of all, thanks for open sourcing this software !

It would be great if a performance analysis of nucleus on basic tasks were available, with comparaisons to pysam and samtools/htslib.

In particular the costs of:

  • Iteration over all reads/records in a large BAM/VCF file.
  • Filtering variants based on a simple condition.
  • Iterate over reads in a range (samtools view like)
  • Cost for random access of a single read.
  • How well it supports parallel access to a single VCF/BAM file.

Some metrics that could be interesting are:

  • Time spent in IO.
  • Memory footprint.
  • Time to execute
  • Cost depending on batch size (I imagine the cost of a fetch is an affine function in the number of reads to extract).

The rational behind this is that when using reads for building tensors to feed a machine learning algorithm, as you did in deep variant.
If the tensors are generated on the fly, reads are fetched once per epoch and the cost can add up.

I imagine that nucleus is less efficient than samtools as you have to pay for the python overhead, however I am curious about how pysam fares relative to nucleus as it seems to be the current standard for analyzing sequencing data in python.

Cheers,
Felix

Example Binaries

Are the example binaries compiled if I install nucleus via pip? If so, where can I find them?

If not, do I have to run the install.sh script to build the example binaries?

Compatibility with Tensorflow 2.11.0

Hello,

The current version of nucleus is compatible only till Tensorflow 2.6.0. Can we have nucleus version compatible with the latest Tensorflow 2.11.0?

Thanks,
Saurabh

Use FASTQ test suite from Cock et al?

This is a little self promotion, but I'd encourage you to use the problematic FASTQ files (tar-ball) provided as supplementary material in our paper about the file format for your test suite:

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. P.J.A. Cock, C.J. Fields, N. Goto, M.L. Heuer and P.M. Rice. Nucleic Acids Research, 2010, 38(6), 1767-1771. http://dx.doi.org/10.1093/nar/gkp1137

Note that Illumina moved over to the standard Sanger quality encoding as of v1.8 of their pipeline, see also https://en.wikipedia.org/wiki/FASTQ_format

MD tag

Hi
I might be missing it, but is there anyway to access the information stored in the MD tag?

RuntimeError: PythonNext() argument read is not valid: Dynamic cast failed

Hi All,
Thank you for the tutorial about "Using Nucleus and TensorFlow for DNA Sequencing Error Correction". I am a bioinformatician trying to apply deep learning in genomics.
As a beginner, I have been trying to run the provided tutorial codes from this notebook

However, in the final step, it is throwing the error as follows,

hparams = BaseHparams()
run(hparams)

The error,

Generating data...
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-12-d46fc6eb10e4> in <module>
      1 # This cell should take ~6 minutes to run with the default parameters.
      2 hparams = BaseHparams()
----> 3 run(hparams)

<ipython-input-11-eb89d5445b01> in run(hparams, use_existing_data, seed)
      9   if not use_existing_data:
     10     print('Generating data...')
---> 11     generate_tfrecord_datasets(hparams)
     12 
     13   train_dataset = get_dataset(

<ipython-input-5-25c33ad9027c> in generate_tfrecord_datasets(hparams)
     15        TFRecordWriter(os.path.join(hparams.out_dir, _TEST)) as test_out:
     16     all_examples = make_ngs_examples(hparams)
---> 17     for example in all_examples:
     18       r = random.random()
     19       if r < train_eval_test_split[0]:

<ipython-input-5-25c33ad9027c> in make_ngs_examples(hparams)
     44   used_pileup_ranges = set()
     45   with ref_reader, vcf_reader, sam_reader, sam_query_reader:
---> 46     for read in sam_reader:
     47 
     48       # Check that read has cigar string present and allowed alignment.

~/anaconda3/envs/tensorflow/lib/python3.7/site-packages/nucleus/io/clif_postproc.py in __next__(self)
     65   def __next__(self):
     66     try:
---> 67       record, not_done = self._raw_next()
     68     except AttributeError:
     69       if self._cc_iterable is None:

~/anaconda3/envs/tensorflow/lib/python3.7/site-packages/nucleus/io/clif_postproc.py in _raw_next(self)
    124   def _raw_next(self):
    125     record = reads_pb2.Read()
--> 126     not_done = self._cc_iterable.PythonNext(record)
    127     return record, not_done
    128 

RuntimeError: PythonNext() argument read is not valid: Dynamic cast failed

The versions of neccesary packages are,

google-nucleus         0.5.6
tensorflow             2.2.0
tensorflow-estimator   2.2.0
tensorflow-gpu         2.2.0
protobuf               3.14.0

It is not clear, what is going wrong here. I am using the most recent version of all packages.
I am running all these on Ubuntu 18.04.5 LTS, and in a separate virtual environment.
Any help or suggestion is greatly appreciated!
Thanks

libbz2.so.1.0 cannot open shared object file

I met an issue after finished installing nucleus using command (no error happens):
pip install --user google-nucleus

However, when I try to run the following demo code, I got error:

from nucleus.io import vcf
sf_vcf="test_nist.b37_chr20_100kbp_at_10mb.vcf.gz"
with vcf.VcfReader(sf_vcf) as reader:
    print('Sample names in VCF: ', ' '.join(reader.header.sample_names))

libbz2.so.1.0: cannot open shared object file: No such file or directory

Any suggestion for solving the issue? Thank you!

Nucleus Proto is colliding with Tensorflow protos

Sorry for the repeat, but I'm having the same issue and the suggested workaround is not working for me.

My setup is on ubuntu 20.04, 64-bit.

I have a miniconda (python 3.8) that I installed both tensorflow 2.6.0 and google-nucleus 0.6.0 into (and a bunch of other stuff - I'm trying to run deepconsensus). My test case script is trivial:

import tensorflow as tf
from nucleus.io import tfrecord

with errors being of the form

File "/home/flowers/test/miniconda3/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 983, in new
return _message.default_pool.AddSerializedFile(serialized_pb)
TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "nucleus/protos/feature.proto":
tensorflow.BytesList.value: "tensorflow.BytesList.value" is already defined in file "tensorflow/core/example/feature.proto".

If I reverse the order of the two statements, I get the same error with the names switched:

File "/home/flowers/test/miniconda3/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 983, in new
return _message.default_pool.AddSerializedFile(serialized_pb)
TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "tensorflow/core/example/feature.proto":
tensorflow.BytesList.value: "tensorflow.BytesList.value" is already defined in file "nucleus/protos/feature.proto".

Any suggestions as to how to get this to work?

RuntimeError: Could not load PyProto API (Python 3)

Hi!

With release 0.4.0 google-nucleus Python 3-compatible package suppose to be available through pip.

Installing it with

pip3 install google-nucleus==0.4.0

results in:

Building wheels for collected packages: google-nucleus
  Building wheel for google-nucleus (setup.py) ... error
  ERROR: Failed building wheel for google-nucleus
  Running setup.py clean for google-nucleus
Failed to build google-nucleus
Installing collected packages: google-nucleus
  Running setup.py install for google-nucleus ... done
  WARNING: Could not find .egg-info directory in install record for google-nucleus==0.4.0 from https://files.pythonhosted.org/packages/67/a3/2e0c0d660c5cf2806f5fffd4ba316d9609b251a75029a55449dddfe63818/google_nucleus-0.4.0.tar.gz#sha256=288bac1e1b6dd0f934a09ce2f08f0a27d00040357e8cbf970c7d564ecb733c7d
Successfully installed google-nucleus

Even though an error is encountered during the installation I can still import it, but when running a simple example it throws the following RuntimeError:

/usr/local/lib/python3.6/dist-packages/nucleus/io/clif_postproc.py in _raw_next(self)
    132   def _raw_next(self):
    133     record = variants_pb2.Variant()
--> 134     not_done = self._cc_iterable.PythonNext(record)
    135     return record, not_done

RuntimeError: PythonNext() argument variant is not valid: Could not load PyProto API

Am I missing something?

I made Colab Notebook that shows steps that lead to the error.

Thanks for you help!

PythonNext() argument read is not valid

Screenshot 2023-07-03 at 22 15 18 Hello, I'm working on fine-tuning the DeepVariant model on Arabidopsis thaliana. I developed the fine-tuned model trained on A.Thaliana data. While running this customised model on unseen data using singularity, I ran into an error

PythonNext() argument read is not valid: Dynamic cast failed.

I have the tf and google-nucleus packages installed with versions,

  1. tensorFlow 2.12.0
  2. google-nucleus 0.6.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.