observingclouds / xbitinfo Goto Github PK

View Code? Open in Web Editor NEW

52.0 3.0 21.0 592 KB

Python wrapper of BitInformation.jl to easily compress xarray datasets based on their information content

Home Page: https://xbitinfo.readthedocs.io

License: MIT License

Python 98.47% Julia 1.53%

compression pangeo xarray python gsoc-2024

xbitinfo's Introduction

xbitinfo: Retrieve information content and compress accordingly

This is an xarray-wrapper around BitInformation.jl to retrieve and apply bitrounding from within python. The package intends to present an easy pipeline to compress (climate) datasets based on the real information content.

How the science works

Paper

Klöwer, M., Razinger, M., Dominguez, J. J., Düben, P. D., & Palmer, T. N. (2021). Compressing atmospheric data into its real information content. Nature Computational Science, 1(11), 713–724. doi: 10/gnm4jj

Video

Julia Repository

BitInformation.jl

How to install

Preferred installation

conda install -c conda-forge xbitinfo

Alternative installation

pip install xbitinfo # ensure to install julia manually

How to use

import xarray as xr
import xbitinfo as xb
example_dataset = 'eraint_uvz'
ds = xr.tutorial.load_dataset(example_dataset)
bitinfo = xb.get_bitinformation(ds, dim="longitude")  # calling bitinformation.jl.bitinformation
keepbits = xb.get_keepbits(bitinfo, inflevel=0.99)  # get number of mantissa bits to keep for 99% real information
ds_bitrounded = xb.xr_bitround(ds, keepbits)  # bitrounding keeping only keepbits mantissa bits
ds_bitrounded.to_compressed_netcdf(outpath)  # save to netcdf with compression

Credits

xbitinfo's People

Contributors

Stargazers

Watchers

Forkers

rsignell-usgs ocefpaf mattphysics aaronspring raybellwaves wntrsldr0x01 kar-hik ishaanj18 ayoubft prakhar-pandey28 ambiguousphoton thodson-usgs bhairvi23 faze-geek aryanbakliwal fnhirwa shivamagar swimmingharibo yashraj1506

xbitinfo's Issues

Low number of `keepbits` using `get_keepbits` in `quick-start.ipynb`

xbitinfo version: 0.0.3
Python version: 3.10.4
Operating System: Linux (Pop!_OS 22.04 LTS)

Description

I am trying to verify if the library identifies a reasonable number of bits to keep. I am working with the quick_start.ipynb notebook.

I plotted ds and ds_rounded side-by-side for variable u (I fixed month and level to their first value). I also checked variable v and get similar issues. I expected that by looking at the plots, I would not be able to tell which dataset had been rounded. Instead, I can clearly see which dataset has been rounded.

What I Did

Plotting both original and bitrounded u in the notebook quick_start.ipynb (as taken on current main branch)

import matplotlib.pyplot as plt
var = "u"
fig, (ax1,ax2) = plt.subplots(1,2,dpi=200)
ax1.set_title("uncompressed")
ax2.set_title(f"99% information (keepbits = {keepbits[var][0]})")
ax1.imshow(ds[var].isel(month=0,level=0))
ax2.imshow(ds_bitrounded[var].isel(month=0,level=0))

More info

I'm pretty sure this is related to function get_keepbits and _cdf_from_info_per_bit. Prior to keeping only 99% of bit info, there is preliminary cleaning step where only bits that respect: bit info > 1.5*max(last 4 bit info) are kept. This process is shown in the following figure:

The comment in the _cdf_from_info_per_bit function explaining the cleaning step is # set below rounding error from last digit to zero. Does this mean that the large values of bit info for the last bits are due to this rounding error? The idea behind the procedure is that one should not trust bit info which are of comparable size to these last bit info?

I could agree with that, but then, how would we interpret the first comparison figure I show? The original plot seems more physical than the rounded one. If we were to trust get_keepbits, should we conclude we should not trust the data too much (beyond the level of precision of keepbits)?

Less strict condition

Say that we can still trust mantissa bits with small information content even though the last few bits have suspiciously large information. Could the criteria to trust or not bit information be given with respect to bit information variation rather than its amplitude? We expect that bit information should generally decrease as we look further down the chain (higher mantissa bit number), correct? With this in mind, we could simply identify the bit with minimum information (or the bit with 4th minimal information, to mimic your condition) and drop higher mantissa bits (because the information on the right of this bit will increase, which is the behaviour we consider as problematic). Something more refined, e.g. cutting mantissa bits once a sequence of 3 increasing bit info is observed, could also be considered.

`get_bitinformation(label="test")` when used second time

some json error:

bitinfo = bp.get_bitinformation(ds, dim="x", label=".test")
bitinfo = bp.get_bitinformation(ds, dim="x", label=".test")

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 bitinfo = bp.get_bitinformation(ds, dim="x", label=".test")
      2 bitinfo

File ~/bitinformation_pipeline/bitinformation_pipeline/bitinformation_pipeline.py:80, in get_bitinformation(ds, label, overwrite, **kwargs)
     35 """Wrap BitInformation.bitinformation().
     36 
     37 Inputs
   (...)
     77 
     78 """
     79 if label is not None and overwrite is False:
---> 80     info_per_bit = load_bitinformation(label)
     81     if info_per_bit is None:
     82         overwrite = True

File ~/bitinformation_pipeline/bitinformation_pipeline/bitinformation_pipeline.py:117, in load_bitinformation(label)
    115 if os.path.exists(label_file):
    116     with open(label_file) as f:
--> 117         info_per_bit = json.open(f)
    118     print(info_per_bit)
    119     return info_per_bit

AttributeError: module 'json' has no attribute 'open'

hosting docs on `readthedocs`

can template from https://github.com/pangeo-data/climpred/blob/main/docs/source/conf.py

`info_per_bit` as `xr.Dataset`

As info_per_bit is saved as a dict, which is the underlying structure of a xr.Dataset, we could easily convert this dict to an xr.Dataset to gain its functionalities.

So I propose to still serialize as json but return to the user as xr.Dataset.

import json

info_per_bit = json.load(open(label+'.json'))

dsb = xr.Dataset()
for v in info_per_bit.keys():
    dsb[v] = xr.DataArray(info_per_bit[v], dims=["bit"], coords={"bit":np.arange(len(info_per_bit[v]))}, name=v)

# give bits a name, here for float32
dsb = dsb.assign_coords(bit_name=("bit",
                  ["±"] + 
                  [f"e{int(i)}" for i in dsb.sel(bit=slice(1,8)).bit] + 
                  [f"m{int(i-8)}" for i in dsb.sel(bit=slice(9,None)).bit]
                 ))

# todo add last for bits threshold

# normalize
dsbn=(dsb.cumsum("bit")/dsb.cumsum("bit").isel(bit=-1,drop=True)).assign_coords(dsb.coords)

threshold = [0.99, 0.99999]
keepbits = (dsbn > xr.DataArray(threshold, dims="threshold")).argmax("bit")

# replace 0 with 31
keepbits = keepbits.where(keepbits!=0, other=dsbn.isel(bit=-1).bit.values)

###
# plotting
keepbits.to_array().plot(hue='threshold',y="variable")

import cmcrameri.cm as cmc
fig,ax=plt.subplots(1,1,figsize=(7,10))
plt.pcolormesh(dsb.to_array(),vmin=0,vmax=1, cmap=cmc.turku_r)
plt.gca().set_xticks(np.arange(.5,.5+dsb.bit_name.size))
plt.gca().set_xticklabels(dsb.bit_name.values)
plt.colorbar()
plt.gca().set_yticks(np.arange(.5,.5+len(dsb.data_vars)))
plt.gca().set_yticklabels(list(dsb.data_vars))
plt.step(keepbits.to_array()+1, np.arange(len(keepbits)))

make Julia optional dependency

used python implementation if Julia isn’t installed but throw warning
use xr_bitround instead of jl_bitround
helps #95 and Julia installation issues
Continues #126
In the pangeo show case @rsignell-usgs also copied some code to not import xbitinfo due to julia

Retrieval of maximum keepbits

The bitwise information content can depend on the dimension bitinformation is applied to. Changes along dimensions like lon and time are rather small compared to those along the lat dimension. This might depend on the dataset though and the user might want to run the analysis across each dimension individually to identify the dimension with most keepbits and use those.

I propose to allow get_bitinformation handle a list of dims or a string like maxdim to return the keepbits of the dimension with the largest information content.

ICON examples

small data https://psyplot.github.io/examples/maps/example_ugrid.html for testing
Big data example for gist: r2b8 might be too large, try smaller

Beauty and populate readme

test `xr_bitround` on different input types

Yes, feel free to merge! Could you create an issue though that we make sure to test we support not only float32.

Originally posted by @observingClouds in #21 (comment)

Duplicate memory allocations

Currently, the input data to xbitinfo.get_bitinformation is duplicated when calling

xbitinfo/xbitinfo/xbitinfo.py

Lines 185 to 186 in c91a21f

 X = ds[var].values 

 Main.X = X

Main.X = X is a deep copy operation. This poses in general an issue for large datasets, because a single copy of the dataset can already be too much to load into memory. Further, I observed that the memory is not freed when calling xbitinfo.get_bitinformation again.

This results into several issues/tasks:

allocate memory only once, or even better
allocate memory lazily
free memory after Julia call

register on pypi

to allow pip install bitinformation_pipeline

`get_bitinformation`: want to skip a variable if `dim` missing or return `None` for it's information content along missing `dim`

LGTM! I think it addresses only the first part of @rsignell-usgs comment of

I'd like to specify the "y" dimensions in a list, e.g. dim = ['eta_rho', 'eta_u', 'eta_v'] and then use whichever one is found for a specific variable.

though.
If a variable in a dataset with dims 'eta_rho', 'eta_u', 'eta_v' only depends on 'eta_v', the code might currently fail when e.g. dim='eta_rho' is provided. We could discuss if we just want to skip that variable in this case or return None for it's information content along dim=eta_rho.
Nevertheless, this PR is self-containing, so not objections to merge this directly.

Originally posted by @observingClouds in #106 (review)

`xb.get_bitinformation(implementation="julia/python")` yields different results

xbitinfo version: main
Python version: binder
Operating System: binder

Description

@observingClouds shouldnt these yield same results?
I am lacking a test in #126 in https://github.com/observingClouds/xbitinfo/blob/main/tests/test_get_bitinformation.py where the numerical values of both implementations are compared and are equal
makes https://xbitinfo--153.org.readthedocs.build/en/153/quick-start.html fail

What I Did

ds = xr.tutorial.load_dataset("eraint_uvz")
xb.get_bitinformation(ds, dim="longitude", implementation="python")
xb.get_bitinformation(ds, dim="longitude", implementation="julia")

Memory potentially not deallocated correctly when calling get_bitinformation

Description

Running get_bitinformation on a "large" dataset with several variables on a Dask LocalCluster results in high memory usage that accumulates during the loop over the individual variables.

Warnings like the following occur until the workers run out of memory:

distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be
released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged 
memory: 11.13 GiB -- Worker memory limit: 15.72 GiB

Prep slides for ocean group meeting talk

slides: https://docs.google.com/presentation/d/1VeFqeCTqsFc68dN4KuKsqMXVpieFpQblu6kLioijf_U/edit?usp=sharing

bitround with xr.apply_ufunc

Better than current implementation and would allow dask parallelised or try map_blocks

Evaluating bitrounded results with metrics?

To ease the decision on specific keepbits for a simulation, it would be great to include some metrics into this package that can quantify the differences between bit-rounded and original values. The simplest being:

mean
standard deviation

These could also be enhanced by plotting routines that loop over possible keepbits.

Add BitInformation.jl version info to output

As a wrapper, we heavily rely on the underlying library. In this case the results depend on BitInformation.jl.
I propose to add the version of this package to the global attributes of the output to keep track of the underlying library.

sphinx readthedocs notebook wont execute

Description

for now set in #94: nbsphinx_execute = "never" # "auto" "always"; jupyter_execute_notebooks = "never" and therefore not execute but should be fixed in the future
https://nbsphinx.readthedocs.io/en/pydata-theme/pre-executed.html
came up in #84

What happens

---------------------------------------------------------------------------
JuliaError                                Traceback (most recent call last)
Input In [1], in <cell line: 1>()
----> 1 import xbitinfo as xb
      3 import xarray as xr
      5 xr.set_options(display_style="text")

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/checkouts/94/xbitinfo/__init__.py:4, in <module>
      1 """Top-level package for xbitinfo."""
      3 from ._version import __version__
----> 4 from .bitround import jl_bitround, xr_bitround
      5 from .graphics import plot_bitinformation, plot_distribution
      6 from .save_compressed import get_compress_encoding_nc, get_compress_encoding_zarr

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/checkouts/94/xbitinfo/bitround.py:4, in <module>
      1 import xarray as xr
      2 from numcodecs.bitround import BitRound
----> 4 from .xbitinfo import _jl_bitround, get_keepbits
      7 def _np_bitround(data, keepbits):
      8     """Bitround for Arrays."""

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/checkouts/94/xbitinfo/xbitinfo.py:19, in <module>
     15 path_to_julia_functions = os.path.join(
     16     os.path.dirname(__file__), "bitinformation_wrapper.jl"
     17 )
     18 Main.path = path_to_julia_functions
---> 19 jl.using("BitInformation")
     20 jl.using("Pkg")
     21 jl.eval("include(Main.path)")

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/lib/python3.9/site-packages/julia/core.py:644, in Julia.using(self, module)
    642 def using(self, module):
    643     """Load module in Julia by calling the `using module` command"""
--> 644     self.eval("using %s" % module)

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/lib/python3.9/site-packages/julia/core.py:621, in Julia.eval(self, src)
    619 if src is None:
    620     return None
--> 621 ans = self._call(src)
    622 if not ans:
    623     return None

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/lib/python3.9/site-packages/julia/core.py:549, in Julia._call(self, src)
    547 # logger.debug("_call(%s)", src)
    548 ans = self.api.jl_eval_string(src.encode('utf-8'))
--> 549 self.check_exception(src)
    551 return ans

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/lib/python3.9/site-packages/julia/core.py:603, in Julia.check_exception(self, src)
    601 else:
    602     exception = sprint(showerror, self._as_pyobj(res))
--> 603 raise JuliaError(u'Exception \'{}\' occurred while calling julia code:\n{}'
    604                  .format(exception, src))

JuliaError: Exception 'LoadError: InterruptException:
in expression starting at /home/docs/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/share/julia/packages/Distributions/0h6GE/src/multivariate/mvnormal.jl:415
in expression starting at /home/docs/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/share/julia/packages/Distributions/0h6GE/src/multivariates.jl:112
in expression starting at /home/docs/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/share/julia/packages/Distributions/0h6GE/src/Distributions.jl:1
in expression starting at /home/docs/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/share/julia/packages/BitInformation/VpaaY/src/BitInformation.jl:1' occurred while calling julia code:
using BitInformation

on RTD: https://readthedocs.org/projects/xbitinfo/builds/16814564/

but works locally!

I guess the julia installer doesnt install into the right kernel or I choose the wrong kernel.
Also cannot get nbsphinx_kernel_name = "bitinfo-docs" working in conf.py

Port plotting functions to `python`

Overview:

Fig. 2 #19
Fig. 3 seems to work nice with an xr_bitround wrapper
Fig. SI 1: #59
another one needed?

The plotting function gives a great overview about the information content present in each bit of a variable.
The port of the function to python should be straight forward and should be preferred over a further call of to Julia.

Provide size of dataarray to get_keepbits

This is a note that get_keepbits needs to get the actual size of the unmasked data points. Currently this is fixed to an arbitrary number.

fix `xr_bitround(map_blocks=True)`

see

#47
https://github.com/zarr-developers/numcodecs/pull/299/files#r848419349
also could do dask="allowed"

wrong issue, see pydata/xarray#6482 (reply in thread)

Make maskinfo optional

For large datasets get_keepbits is tremendously slowed down by the calculation of the mask for each individual variable

 "maskinfo": int(ds[var].notnull().sum())

We could make the mask an argument or get it somehow from get_bitinformation. But I would argue we follow Milan's advice #17 (comment) and remove this part from the code.

keep attrs in get_bitinformation

now we delete incoming attrs
could be useful to keep
change existing units to input_units and long_name to input_long or change units to bitinfo_units and bitinfo_long_name

Documentation of zarr compression

Explain the different approaches how BitRounding and Zarr can be used and explain why we have chosen not the alternative approach below:

import numcodecs
encoding = {
    "var1": {
        "filters": [numcodecs.BitRound(keepbits=14)],
        "compressor": numcodecs.Blosc("zstd",shuffle=numcodecs.Blosc.BITSHUFFLE)
    }
}

ds.to_zarr("out.zarr", encoding=encoding)

But as you mentioned, the issue with this solution is that "filters" are saved in the zarr files (.zattrs and .zmetadata) which makes it necessary for users to have this filter installed on their machine when opening the dataset. However, this particular filter can be seen as a one way filter and does not need a decoder for reading.

Applying bitrounding directly to the Dataset circumvents the writing of the filters information, but still keeps the information about the number of keepbits in the _QuantizeBitRoundNumberOfSignificantDigits attribute.

Originally posted by @observingClouds in #71 (comment)

Naming conventions

Let’s try to keep our names a close to bitinformation.jl as possible.

Rename

https://github.com/observingClouds/bitinformation_pipeline/blob/master/bitinformation_pipeline/get_n_plot_bitinformation.jl to bitinformation.jl
information_content or inflevels #9
please add here

misleading progressbar in `xb.get_bitinformation(dim=None)`

xbitinfo version: 0.0.1

Description

I would hope that the default dim=None works out of the box. Maybe that's too ambitious. This examples fails in the variable z (why is z the second?, its the first in ds.data_vars) for dimension month (which has two elements only).

I observed that 2 element dimensions often do not work. (However, it seems to have worked for v). Maybe we could raise more informative error messages in xbitinfo?

What I Did

ds = xr.tutorial.load_dataset("eraint_uvz")
info_per_bit = xb.get_bitinformation(ds, dim=None)
Processing v: 100%
3/3 [00:05<00:00, 1.33s/it]
Processing v: 100%
3/3 [00:01<00:00, 2.29it/s]
Processing v: 100%
3/3 [00:01<00:00, 2.67it/s]
Processing z: 0%
0/3 [00:01<?, ?it/s]
---------------------------------------------------------------------------
JuliaError                                Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 info_per_bit = xb.get_bitinformation(ds)
      3 info_per_bit

File ~/xbitinfo/xbitinfo.py:153, in get_bitinformation(ds, dim, axis, label, overwrite, **kwargs)
     87 """Wrap `BitInformation.jl.bitinformation() <[https://github.com/milankl/BitInformation.jl/blob/main/src/mutual_information.jl>`__](https://github.com/milankl/BitInformation.jl/blob/main/src/mutual_information.jl%3E%60__).
     88 
     89 Parameters
   (...)
    149     BitInformation.jl_version:  ...
    150 """
    151 if dim is None and axis is None:
    152     # gather bitinformation on all axis
--> 153     return _get_bitinformation_along_all_dims(
    154         ds, label=label, overwrite=overwrite, **kwargs
    155     )
    156 else:
    157     # gather bitinformation along one axis
    158     if overwrite is False and label is not None:

File ~/xbitinfo/xbitinfo.py:229, in _get_bitinformation_along_all_dims(ds, label, overwrite, **kwargs)
    227     if label is not None:
    228         label = "_".join([label, d])
--> 229     info_per_bit_per_dim[d] = get_bitinformation(
    230         ds, dim=d, axis=None, label=label, overwrite=overwrite, **kwargs
    231     ).expand_dims("dim", axis=0)
    232 info_per_bit = xr.merge(info_per_bit_per_dim.values()).squeeze()
    233 return info_per_bit

File ~/xbitinfo/xbitinfo.py:205, in get_bitinformation(ds, dim, axis, label, overwrite, **kwargs)
    203 logging.debug(f"get_bitinformation(X, dim={dim}, {kwargs_str})")
    204 info_per_bit[var] = {}
--> 205 info_per_bit[var]["bitinfo"] = jl.eval(
    206     f"get_bitinformation(X, dim={axis_jl}, {kwargs_str})"
    207 )
    208 info_per_bit[var]["dim"] = dim
    209 info_per_bit[var]["axis"] = axis_jl - 1

File /srv/conda/envs/notebook/lib/python3.10/site-packages/julia/core.py:621, in Julia.eval(self, src)
    619 if src is None:
    620     return None
--> 621 ans = self._call(src)
    622 if not ans:
    623     return None

File /srv/conda/envs/notebook/lib/python3.10/site-packages/julia/core.py:549, in Julia._call(self, src)
    547 # logger.debug("_call(%s)", src)
    548 ans = self.api.jl_eval_string(src.encode('utf-8'))
--> 549 self.check_exception(src)
    551 return ans

File /srv/conda/envs/notebook/lib/python3.10/site-packages/julia/core.py:603, in Julia.check_exception(self, src)
    601 else:
    602     exception = sprint(showerror, self._as_pyobj(res))
--> 603 raise JuliaError(u'Exception \'{}\' occurred while calling julia code:\n{}'
    604                  .format(exception, src))

JuliaError: Exception 'AssertionError: Mask has 347040 unmasked values, 0 entries are adjacent.' occurred while calling julia code:
get_bitinformation(X, dim=1, masked_value=convert(Float32,NaN), set_zero_insignificant=true)

Long-term perspective

get Milan on board
See how well this package is adopted
Transfer to pangeo-data or xarray-contrib

`get_bitinformation(dim/axis)` in julia or python notation?

bp.get_bitinformation has a dim keyword.

Currently it accepts str and understands them as dimension names and it also accepts int to pass them on to bitinformation.bitinformation in julia. However in julia dimensions start with 1 and axis start with 0 in python.

I suggest to add internally +1 if dim is int mimicking python behaviour. Or we implement axis additionally, which would be more correct name for what I propose.

bitrounding in python?

The bitrounding method calls Julia, but I just wanted to point out these routines in Python.

See appendix A4 here: https://gmd.copernicus.org/articles/14/377/2021/

Bitround in python or julia

is xr_bitround the bitround from xarray as discussed here? For performance I'd suggest to use round from BitInformation. I've never actually compared the speed, but recalculation of constants is avoided reaching 16GB/s at which point it's memory-bound (and therefore likely an upper bound). Also gives us more flexibility in case we want to change something on the rounding front.

Originally posted by @milankl in #21 (comment)

Orchestrate for multiple files

Run get_bitinformation and get_keepbits once and apply bitround to multiple files.

How to run in a script in parallel?
try prefect https://docs.prefect.io/core/concepts/persistence.html#output-caching-based-on-a-file-target pangeo_forge_recipes/executors/prefect.py

Deprecate `for_cdo` option

The upcoming release of CDO (2.1.1) will better support temporal chunks >1 with a significant speed-up. After testing this claim, we might want to remove the for_cdo option, document that this is only necessary for cdo < 2.1.1 or raise a info message during execution.

xbitinfo/xbitinfo/save_compressed.py

Lines 11 to 19 in ebf417e

 if for_cdo: # take shape as chunksize and ensure time chunksize 1 

 if time_dim in da.dims: 

 time_axis_num = da.get_axis_num(time_dim) 

 chunksize = da.data.chunksize if da.chunks is not None else da.shape 

 # https://code.mpimet.mpg.de/boards/2/topics/12598 

 chunksize = list(chunksize) 

 chunksize[time_axis_num] = 1 

 chunksize = tuple(chunksize) 

 return chunksize

FEATURE: shift the data range

@milankl metioned very important points in milankl/BitInformation.jl#38 (comment) which could/should be implemented in our xbitinfo pipeline:

premisses have to be checked / questions have to be answered before doing the bitinformation+round business:

is the data rather linear or logarithmically distributed?

which binary encoding is most appropriate given 1. (integer/fixed-point/linear quantization vs floats)

Analyse the bitinformation within that appropriate encoding

Bitround in that encoding too (and you'll either have rigid absolute error bounds for linear or relative for log)

Problem is obviously that most people want to use floats regardless of what data they are handling. Sure that makes sense. So for linearly distributed data (where you want absolute errors to have rigid bounds) we have to come up with a workaround to better adhere to 1.-4. while using floats. Let's take your sea surface temperature example and say you have the high precision data in ˚C and definitely want to store the data as ˚C, so what can one do?

shift the data into a range where floats are linear, i.e. all data is within a power of 2. For ˚C you could convert to Kelvin (although I'd advise not to* unless you store in K) but you can also just add 256 (if no neg temperatures) or 512, a power of 2 is a good choice. Now your encoding matches your data distribution.

analyse the bitinformation now. While this gives you mantissa bits, this actually suggests an absolute error and not a relative one for your data as all exponent bits are identical anyway.

now subtract your initial offset again and you'll get more mantissa bits of precision for higher temperatures and fewer for lower temperatures so that your max absolute error is globally constant.

Move information content filtering from get_keepbits to get_bitinformation

The filtering of the bit information is currently done in get_keepbits, however, I think we should move this to get_bitinformation since this is the more appropriate place and we might even only return ICfilt in get_bitinformation because this is what people are most likely interested in.

This is the part I'm proposing to move to get_bitinformation:

    # filter insignificant information via binomial distribution as explained in methods
    # for some variables the last bits do not have a free entropy of 1 bit, in that case
    # adjust the threshold by the information in the very last bits (which should always be 0)

    ic = values(bitinfo)
    p = BitInformation.binom_confidence(maskinfo,0.99)  # get chance p for 1 (or 0) from binom distr
    M₀ = 1 - entropy([p,1-p],2)                            # free entropy of random 50/50 at trial size
    threshold = max(M₀,1.5*maximum(ic[end-3:end]))         # in case the information never drops to zero
                                                               # use something a bit bigger than maximum
                                                               # of the last 4 bits
    insigni = (ic .<= threshold) .& (collect(1:length(ic)) .> 9)
    ic[insigni] .= floatmin(Float64)

Fail to install properly from PyPi

xbitinfo version: 0.0.2
Python version: 3.9
Operating System: macOS

Description

I was trying to install xbitinfo via pip but got the following error message when importing it via import xbitinfo:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./mambaforge/envs/114/lib/python3.9/site-packages/xbitinfo/__init__.py", line 4, in <module>
    from .bitround import jl_bitround, xr_bitround
  File "./mambaforge/envs/114/lib/python3.9/site-packages/xbitinfo/bitround.py", line 4, in <module>
    from .xbitinfo import _jl_bitround, get_keepbits
  File "./mambaforge/envs/114/lib/python3.9/site-packages/xbitinfo/xbitinfo.py", line 19, in <module>
    jl.using("BitInformation")
  File "./mambaforge/envs/114/lib/python3.9/site-packages/julia/core.py", line 644, in using
    self.eval("using %s" % module)
  File "./mambaforge/envs/114/lib/python3.9/site-packages/julia/core.py", line 621, in eval
    ans = self._call(src)
  File "./mambaforge/envs/114/lib/python3.9/site-packages/julia/core.py", line 549, in _call
    self.check_exception(src)
  File "./mambaforge/envs/114/lib/python3.9/site-packages/julia/core.py", line 603, in check_exception
    raise JuliaError(u'Exception \'{}\' occurred while calling julia code:\n{}'
julia.core.JuliaError: Exception 'ArgumentError: Package BitInformation not found in current path:
- Run `import Pkg; Pkg.add("BitInformation")` to install the BitInformation package.
' occurred while calling julia code:
using BitInformation

What I did

pip install xbitinfo
python
> import xbitinfo

returns the error above.

pip install git+https://github.com/observingClouds/xbitinfo.git
python
> import xbitinfo

works.

Reduce dependencies

pip allows extras when installing, see https://github.com/pangeo-data/climpred/blob/main/setup.py or https://github.com/pydata/xarray/blob/main/setup.cfg for non essential deps

all viz libs could be grouped, also parallel for dask and prefect

`ReadOnlyMemoryError` for `julia`/`pycall` when using `asv` on `GHA`

What I want to do?

run julia with asv on github actions and levante/supercomputer but not in pytest

What happens?

   Main.X = X
                 File "/home/runner/work/bitinformation_pipeline/bitinformation_pipeline/asv_bench/.asv/env/90d8778db9ad901392615fc50cb0cf4e/lib/python3.8/site-packages/julia/core.py", line 215, in __setattr__
                   self._julia.eval(setter)(value)
                 File "/home/runner/work/bitinformation_pipeline/bitinformation_pipeline/asv_bench/.asv/env/90d8778db9ad901392615fc50cb0cf4e/lib/python3.8/site-packages/julia/core.py", line 621, in eval
                   ans = self._call(src)
                 File "/home/runner/work/bitinformation_pipeline/bitinformation_pipeline/asv_bench/.asv/env/90d8778db9ad901392615fc50cb0cf4e/lib/python3.8/site-packages/julia/core.py", line 549, in _call
                   self.check_exception(src)
                 File "/home/runner/work/bitinformation_pipeline/bitinformation_pipeline/asv_bench/.asv/env/90d8778db9ad901392615fc50cb0cf4e/lib/python3.8/site-packages/julia/core.py", line 603, in check_exception
                   raise JuliaError(u'Exception \'{}\' occurred while calling julia code:\n{}'
               julia.core.JuliaError: Exception 'ReadOnlyMemoryError' occurred while calling julia code:
               
                           PyCall.pyfunctionret(
                               (x) -> Base.eval(Main, :(X = $x)),
                               Any,
                               PyCall.PyAny)
               asv: benchmark failed (exit status 1)

https://github.com/observingClouds/bitinformation_pipeline/runs/5970342909?check_suite_focus=true

different dtypes for `xb.get_bitinformation(ds)` fail

xbitinfo version: 0.0.1

Description

different dtypes for xb.get_bitinformation(ds) fail

I thought we'd already covered this in

xbitinfo/xbitinfo/xbitinfo.py

Lines 236 to 241 in c91a21f

 def _get_bitinformation_kwargs_handler(da, kwargs): 

 """Helper function to preprocess kwargs args of :py:func:`xbitinfo.xbitinfo.get_bitinformation`.""" 

 if "masked_value" not in kwargs: 

 kwargs["masked_value"] = f"convert({str(da.dtype).capitalize()},NaN)" 

 elif kwargs["masked_value"] is None: 

 kwargs["masked_value"] = "nothing"

str(ds.z.dtype).capitalize() # Float32
str(ds.z64.dtype).capitalize() # Float64

What I Did

ds = xr.tutorial.load_dataset("eraint_uvz")
ds['z64']=ds.z.astype("float64") # add float64 var to existing float32

info_per_bit = xb.get_bitinformation(ds, dim="longitude") # fails
info_per_bit = xb.get_bitinformation(ds[["z"]], dim="longitude", masked_value="convert(Float32,NaN)") # works
info_per_bit = xb.get_bitinformation(ds[["z64"]], dim="longitude", masked_value="convert(Float64,NaN)") # works

Add "Launch in AWS Studio Lab" to readme.md?

Not sure if you want to add this, but I've been loving AWS Studio Lab, and launching it worked perfectly -- it discovered the environment.yml, built the environment, I ran the notebook without problems. And now the "bitinfo" env and repo are there for me to run again (on AWS Studio Lab, work is persisted, you can login with your account (free, no AWS account required) and you have 12 hours of CPU time every time you fire it up.

Here's the proposed blurb for the readme:

Launch in SageMaker Studio Lab

If you have an AWS SageMaker Studio Lab account, you can open in Studio Lab using the button below, then when prompted, choose to download the whole repo and to build the conda environment. If you don't have an account, you can sign up for free (no AWS account required).

`get_bitinformation(label="some_label")` prints to screen

I find it too much output, maybe just issue a short warning/message that data is read from label.json file
maybe just show one variable or nothing

CI failure with python 3.11

With the release of python 3.11, our CI tests fail. See e.g. #163, #161

The reason might be numcodecs zarr-developers/numcodecs#397

Saving to compressed zarr yields lower or negative compression

xbitinfo version:
xbitinfo 0.0.2 pypi_0 pypi
Python version:
python 3.10.5 hdaaf3db_0_cpython conda-forge
Operating System:
macOS Big Sur

Description

I ran the tutorial notebook for saving to .nc and .zarr (https://xbitinfo.readthedocs.io/en/latest/quick-start.html). I expected progressively smaller file sizes, with "original" > "compressed" > "bitrounded". For .nc, this is what happened, while for .zarr

the file size of the "original" is 30% smaller than for .nc (2.8 MB).
Doing "ds.to_compressed_zarr") increases the size by 60% (4.5 MB).
Saving the bitrounded data to compressed_zarr yields (as expected) a smaller file size (804 KB), which is 64% larger than the equivalent .nc file

I have uploaded the notebook with results at https://github.com/mattphysics/xbitinfo/blob/main/tests/nb_xbitinfo_export.ipynb

Allow dim in get_bitinformation as list

`eraint_uvz` `test_full` fails

Description

https://github.com/observingClouds/xbitinfo/actions/runs/3168552683/jobs/5159830575

assert compressed_size < ori_size
assert 9640058 < 4186398

Did defaults in to_netcdf change with new xarray release?

keepbits or keepmantissabits

We don't document this very clear.
There is a shift of 9: keepbits - 9 = keepmantissabits

keepmantissabits: shown in plot_bitinformation and used in xr_bitround
keepbits: output of get_keepbits, set as attrs in xr_bitround

hamocc_2d: https://gist.github.com/aaronspring/9a373397b830c2212c7c4bccaf500a48
mpiom_2d: https://gist.github.com/aaronspring/f3d1012bed2e0ac5b6f0974cb849e598
pwd: /home/m/m300524/bitinformation_pipeline/examples

Could the dimensions in the list be the inner loop rather than the outer loop?

@aaronspring I tried it and it worked for me:

%%time
info_per_bit_x = xb.get_bitinformation(ds_sample, dim=['xi_rho', 'xi_u', 'xi_v'])
Processing Uwave_rms: 100%
82/82 [00:22<00:00, 5.27it/s]
Processing Uwave_rms: 100%
82/82 [00:09<00:00, 11.74it/s]
Processing Uwave_rms: 100%
82/82 [00:09<00:00, 11.11it/s]

CPU times: user 41.6 s, sys: 0 ns, total: 41.6 s
Wall time: 41.2 s

This is great, but I think it could be confusing for the user why all the variables seem to be processed multiple times.

Could the dimensions in the list be the inner loop rather than the outer loop?

Originally posted by @rsignell-usgs in #106 (comment)

a bit related #97

How to apply bitrounding to `MPIESM` `grb`

Goal:

bitround MPIESM grb files, i.e. input grb -> output grb

Issue:

mpiesm grb files can be opened by xarray but varnames and attrs have weird names

Solution ideas:

~~ open grb file with cfgrib (ignoring weird names), applying xr_bitround and saving back to grb with cfgrib.xarray_to_grib.to_grib https://github.com/ecmwf/cfgrib/#write-support ~~ doesnt work IMO
~~open grb file with cdo-python, apply xr_bitround and save as grb, i.e. involving no -f nc~~ doesnt work because cdopy does always cdo -f nc operator
cdo -f nc copy to create netcdf and then bitround that

Originally posted by @aaronspring in #44 (comment)

add `bitround_along_dim(keepbits=[])` to be used instead of `inflevels=[]`

probably it would be better when allowing another keyword: keepbits=[] which could be used instead of inflevels=[], Fig 3 Kloewer use keepbits I guess. with inflevels, you may end up with the same bitrounding for different levels.

Originally posted by @aaronspring in #73 (comment)

allow inplace bitround

inplace operations are against the xr data model but they could reduce the resources needed. Could implement as a keyword inplace=False

	if for_cdo: # take shape as chunksize and ensure time chunksize 1
	if time_dim in da.dims:
	time_axis_num = da.get_axis_num(time_dim)
	chunksize = da.data.chunksize if da.chunks is not None else da.shape
	# https://code.mpimet.mpg.de/boards/2/topics/12598
	chunksize = list(chunksize)
	chunksize[time_axis_num] = 1
	chunksize = tuple(chunksize)
	return chunksize

	def _get_bitinformation_kwargs_handler(da, kwargs):
	"""Helper function to preprocess kwargs args of :py:func:`xbitinfo.xbitinfo.get_bitinformation`."""
	if "masked_value" not in kwargs:
	kwargs["masked_value"] = f"convert({str(da.dtype).capitalize()},NaN)"
	elif kwargs["masked_value"] is None:
	kwargs["masked_value"] = "nothing"

observingclouds / xbitinfo Goto Github PK

xbitinfo's Introduction

xbitinfo: Retrieve information content and compress accordingly

How the science works

Paper

Video

Julia Repository

How to install

Preferred installation

Alternative installation

How to use

Credits

xbitinfo's People

Contributors

Stargazers

Watchers

Forkers

xbitinfo's Issues

Description

What I Did

More info

Less strict condition

Description

What I Did

Description

Description

What happens

Description

What I Did

Description

What I did

What I want to do?

What happens?

Description

What I Did

Launch in SageMaker Studio Lab

Description

Description

Recommend Projects

Recommend Topics

Recommend Org