Giter VIP home page Giter VIP logo

binarycif's Introduction

Version 0.3.0

BinaryCIF

BinaryCIF is a data format that stores text based CIF files using a more efficient binary encoding. It enables both lossless and lossy compression of the original CIF file. BinaryCIF is currently mainly used by RCSB PDB and PDBe and is supported by the Mol* and LiteMol viewers.

Some aspects of the BinaryCIF format, namely using MessagePack as the container and the usage the fixed point, run length, delta, and integer packing encodings was inspired by the MMTF data format.

Table of contents

Implementations

BinaryCIF is currently available as TypeScript (JavaScript), Java, and Python.

Principles

Use Cases

CoordinateServer

BinaryCIF is supported by the CoordinateServer, a web service for delivering subsets of 3D macromolecular data stored in the mmCIF format.

The server can return data both in the text and binary version of the CIF format, with the binary representation being a lot more efficient (see the benchmark).

DensityServer

BinaryCIF is supported by the DensityServer, a web service for accessing subsets of volumetric density data, that automatically downsamples the data depending on the volume of the requested region to reduce the bandwidth requirements and provide near-instant access to even the largest data sets.


Contributing

Just open an issue or make a pull request. All contributions are welcome.

Funding

Funding sources include but are not limited to:

binarycif's People

Contributors

arose avatar dsehnal avatar jonstargaryen avatar josemduarte avatar speleo3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

binarycif's Issues

Default encodings for different columns

Is there some resource (code would also OK for the purpose), that contains the default encodings that were used to create the bcif files provided by https://models.rcsb.org/?

I would like to implement a writer for bcif, so it would be helpful to know what reasonable encodings for each PDBx field are.

`ByteArray` encoding type 33

I'm working on my own parser, and I have it successfully working with importing example data .bcif from the py-mmcif as well as CellPack .bcif files from molstar.org/dev/me. It seems to be working well on parsing everything for the structures, but when extracting the symmetry operations from the CellPack files, I am coming across a ByteArray type that doesn't make sense.

[[{'kind': 'Delta', 'origin': 1, 'srcType': 3},
  {'kind': 'RunLength', 'srcType': 3, 'srcSize': 767},
  {'kind': 'IntegerPacking', 'byteCount': 1, 'isUnsigned': True, 'srcSize': 4},
  {'kind': 'ByteArray', 'type': 4}],
 [{'kind': 'ByteArray', 'type': 33}],
 [{'kind': 'ByteArray', 'type': 33}],
 [{'kind': 'ByteArray', 'type': 33}],
 [{'kind': 'ByteArray', 'type': 33}],
 [{'kind': 'ByteArray', 'type': 33}],
 [{'kind': 'ByteArray', 'type': 33}],
 [{'kind': 'ByteArray', 'type': 33}],
 [{'kind': 'ByteArray', 'type': 33}],
 [{'kind': 'ByteArray', 'type': 33}],
 [{'kind': 'ByteArray', 'type': 33}],
 [{'kind': 'ByteArray', 'type': 33}],
 [{'kind': 'ByteArray', 'type': 33}]]

Is 33 something special that isn't explicitly mentioned in the spec, or have I gotten something wrong earlier in my pipeline?

Compression efficiency: some mmCIF files are smaller than their BinaryCIF counterparts when not gzipped

This is a question rather than an issue report.
I have downloaded the BinaryCIF file for structure 5z6y from RCSB and the corresponding mmCIF file.

format size gzipped
mmCIF 227kB 58kB
BinaryCIF 269kB 32 kB
mmtf 24kB 17kB

What surprises me is that the BinaryCIF file takes more space than the mmCIF file, even if most of the information is contained in the atom_site table which should be amenable to efficient compression.
This seems to contradict the claims of the original BinaryCIF publication.

I am wondering if there is an issue with the current implementation of the format which would use less efficient compression techniques?

Unclarities in specification

While implementing a BinaryCIF file interface and I found some parts of the specification ambiguous:

  • What are the integer values a mask can hold and how do they map to the cif values (. and ?)
  • Is the final offset in a String Arraythe exclusive stop or a start index itself?
  • When encoding using Interval Quantization, are the values assigned to the closest step or to the next lower/higher step?
  • Into which data type does Delta encode?
  • How are the data types mapped to integers?

IntervalQuantization example seems to be wrong

The decoder for IntervalQuantization does not produce the results suggested in the encoding.md example in this repo:

It suggests that if you take the list [0, 0, 1, 2, 2, 1] and run it through the decoder with values min = 1, max = 2, numSteps = 3 you should get [0.5, 1, 1.5, 2, 3, 1.345]. However in fact it produces [1.0, 1.0, 1.5, 2.0, 2.0, 1.5]. I'm making the issue here as I assume it is the example that is wrong.

The description of what IntervalQuantization is is also quite hard to follow and I'm not sure what it actually is, so I cannot judge for myself which is correct or what the function should be doing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.