Giter VIP home page Giter VIP logo

Comments (4)

lindstro avatar lindstro commented on July 20, 2024

That is indeed intentional. Strides are a property of the organization of the data in memory as the data is compressed and written, with the compressed output being organized differently. Consumers do not necessarily want to maintain those strides when later reading and decompressing the data. Requiring that could, for instance, blow up memory requirements for the consumer if the original data is not stored contiguously. As an example, the original layout could be in array-of-struct form, perhaps with dozens of different fields being written one at a time by a simulation code. If later data analysis is to be done on a single field, you don't want to have to recreate the original layout that wastes storage on all but one field.

If the same data layout is desired during decompression, then that can be accomplished by setting strides during decompression, though the strides would have to be maintained separately.

from zfp.

S-o-T avatar S-o-T commented on July 20, 2024

Thanks for elaborated answer. The scenario i am interested in is compression of 2d vector field as two strided 2d scalar fields. I believe that in such scenario i am bound to either manually (de)interleave vector components into separate planes prior/after compression/decompression or to rely on zfp's internal accounting for strides. As you suggested, storing strides externally and providing them during decompression seems to be a solution, although, i would argue that deriving all the meta required to setup decompression directly from a header would be more convenient.

A bit tangential: what is rationale for not directly storing all but data pointer fields of zfp_field in header? It seems that cost of such header is negligible wrt compressed stream itself.

from zfp.

lindstro avatar lindstro commented on July 20, 2024

As mentioned above, one rationale is that the consumer may not want to organize the data the same way the producer does. In fact, I cannot think of a case where the consumer, which processes the data, does not know how it wants the data to be organized. Can you think of a scenario where it would be beneficial to have the producer dictate the data layout for the consumer? In the case of a code processing 2D vector fields, the consumer needs to know if the data layout is float field[ny][nx][2] or float field[2][ny][nx] (or some other permutation) so it can index the multidimensional field properly. If the code is written with one of these conventions, it will fail if the data producer mandates the other convention. While one can write such a code using strides (e.g. field[stride_x * x + stride_y * y + stride_c * c] to access vector component c at (x, y)), oftentimes you want to use some container class like a NumPy array whose strides are given by the container, not the data producer.

Another rationale is that we have gone to great lengths to make the storage of metadata and compression parameters as compact as possible; in most cases, we encode array dimensions, scalar type, and compression mode and parameters in only 64 bits. This compact encoding is motivated by zfp's unique approach to representing large arrays as a collection of very small blocks (consisting of 4d values in d dimensions) that can be (de)compressed independently. We early on anticipated the potential to vary compression parameters spatially, perhaps even from one block to the next, and in that case the overhead of storing compression parameters becomes large. Similarly, in certain applications (like AMR), one may form a larger grid as a collection of smaller ones, with each subgrid composed of a small collection of zfp blocks. In this case, it is again important to keep array metadata per subgrid small. One may even vary precision spatially (e.g., float vs. double), where again you need an efficient way of encoding scalar type. Whereas individual array dimensions are often small (say, 16 bits or less), strides are not only signed but may span the product of all dimensions or even more (when multiple fields are interleaved), making them far costlier to encode. In practice, you often need more than 32 bits per stride, or more than 96 bits for the 2D vector field example above.

Now, I can envision a case where the consumer (perhaps an I/O module) is tasked only with reconstructing the original data bit for bit. Using the current zfp API, it would be possible to add a new ZFP_HEADER tag for strides to also store this information. The consumer could then override the strides set in zfp_read_header() before calling zfp_decompress(). The main challenge would be to do this in a backwards compatible manner as one would presumably have to redefine ZFP_HEADER_FULL to also include strides, and that would break existing code. But it may be reasonable to consider such a feature for future versions of the zfp codec. There are other changes to the compressed format we would want to incorporate, but a change to the codec will not happen anytime soon.

from zfp.

S-o-T avatar S-o-T commented on July 20, 2024

We early on anticipated the potential to vary compression parameters spatially, perhaps even from one block to the next, and in that case the overhead of storing compression parameters becomes large.

The need to design for such use-case answers my question, thanks.

Can you think of a scenario where it would be beneficial to have the producer dictate the data layout for the consumer?
Now, I can envision a case where the consumer (perhaps an I/O module) is tasked only with reconstructing the original data bit for bit.

This is pretty much the case for my usage scenario.

from zfp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.