Giter VIP home page Giter VIP logo

xbitinfo's Introduction


xbitinfo: Retrieve information content and compress accordingly

Binder Open In SageMaker Studio Lab CI pre-commit.ci status Documentation Status pypi Conda (channel only)

This is an xarray-wrapper around BitInformation.jl to retrieve and apply bitrounding from within python. The package intends to present an easy pipeline to compress (climate) datasets based on the real information content.

How the science works

Paper

Klöwer, M., Razinger, M., Dominguez, J. J., Düben, P. D., & Palmer, T. N. (2021). Compressing atmospheric data into its real information content. Nature Computational Science, 1(11), 713–724. doi: 10/gnm4jj

Video

Video

Julia Repository

BitInformation.jl

How to install

Preferred installation

conda install -c conda-forge xbitinfo

Alternative installation

pip install xbitinfo # ensure to install julia manually

How to use

import xarray as xr
import xbitinfo as xb
example_dataset = 'eraint_uvz'
ds = xr.tutorial.load_dataset(example_dataset)
bitinfo = xb.get_bitinformation(ds, dim="longitude")  # calling bitinformation.jl.bitinformation
keepbits = xb.get_keepbits(bitinfo, inflevel=0.99)  # get number of mantissa bits to keep for 99% real information
ds_bitrounded = xb.xr_bitround(ds, keepbits)  # bitrounding keeping only keepbits mantissa bits
ds_bitrounded.to_compressed_netcdf(outpath)  # save to netcdf with compression

Credits

xbitinfo's People

Contributors

aaronspring avatar ayoubft avatar dependabot[bot] avatar github-actions[bot] avatar ishaanj18 avatar milankl avatar observingclouds avatar ocefpaf avatar pre-commit-ci[bot] avatar raybellwaves avatar rsignell-usgs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

xbitinfo's Issues

Low number of `keepbits` using `get_keepbits` in `quick-start.ipynb`

  • xbitinfo version: 0.0.3
  • Python version: 3.10.4
  • Operating System: Linux (Pop!_OS 22.04 LTS)

Description

I am trying to verify if the library identifies a reasonable number of bits to keep. I am working with the quick_start.ipynb notebook.

I plotted ds and ds_rounded side-by-side for variable u (I fixed month and level to their first value). I also checked variable v and get similar issues. I expected that by looking at the plots, I would not be able to tell which dataset had been rounded. Instead, I can clearly see which dataset has been rounded.

What I Did

Plotting both original and bitrounded u in the notebook quick_start.ipynb (as taken on current main branch)

import matplotlib.pyplot as plt
var = "u"
fig, (ax1,ax2) = plt.subplots(1,2,dpi=200)
ax1.set_title("uncompressed")
ax2.set_title(f"99% information (keepbits = {keepbits[var][0]})")
ax1.imshow(ds[var].isel(month=0,level=0))
ax2.imshow(ds_bitrounded[var].isel(month=0,level=0))

u_compression

More info

I'm pretty sure this is related to function get_keepbits and _cdf_from_info_per_bit. Prior to keeping only 99% of bit info, there is preliminary cleaning step where only bits that respect: bit info > 1.5*max(last 4 bit info) are kept. This process is shown in the following figure:
u_bitinformation

The comment in the _cdf_from_info_per_bit function explaining the cleaning step is # set below rounding error from last digit to zero. Does this mean that the large values of bit info for the last bits are due to this rounding error? The idea behind the procedure is that one should not trust bit info which are of comparable size to these last bit info?

I could agree with that, but then, how would we interpret the first comparison figure I show? The original plot seems more physical than the rounded one. If we were to trust get_keepbits, should we conclude we should not trust the data too much (beyond the level of precision of keepbits)?

Less strict condition

Say that we can still trust mantissa bits with small information content even though the last few bits have suspiciously large information. Could the criteria to trust or not bit information be given with respect to bit information variation rather than its amplitude? We expect that bit information should generally decrease as we look further down the chain (higher mantissa bit number), correct? With this in mind, we could simply identify the bit with minimum information (or the bit with 4th minimal information, to mimic your condition) and drop higher mantissa bits (because the information on the right of this bit will increase, which is the behaviour we consider as problematic). Something more refined, e.g. cutting mantissa bits once a sequence of 3 increasing bit info is observed, could also be considered.

`get_bitinformation(label="test")` when used second time

some json error:

bitinfo = bp.get_bitinformation(ds, dim="x", label=".test")
bitinfo = bp.get_bitinformation(ds, dim="x", label=".test")
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 bitinfo = bp.get_bitinformation(ds, dim="x", label=".test")
      2 bitinfo

File ~/bitinformation_pipeline/bitinformation_pipeline/bitinformation_pipeline.py:80, in get_bitinformation(ds, label, overwrite, **kwargs)
     35 """Wrap BitInformation.bitinformation().
     36 
     37 Inputs
   (...)
     77 
     78 """
     79 if label is not None and overwrite is False:
---> 80     info_per_bit = load_bitinformation(label)
     81     if info_per_bit is None:
     82         overwrite = True

File ~/bitinformation_pipeline/bitinformation_pipeline/bitinformation_pipeline.py:117, in load_bitinformation(label)
    115 if os.path.exists(label_file):
    116     with open(label_file) as f:
--> 117         info_per_bit = json.open(f)
    118     print(info_per_bit)
    119     return info_per_bit

AttributeError: module 'json' has no attribute 'open'

`info_per_bit` as `xr.Dataset`

As info_per_bit is saved as a dict, which is the underlying structure of a xr.Dataset, we could easily convert this dict to an xr.Dataset to gain its functionalities.

So I propose to still serialize as json but return to the user as xr.Dataset.

import json

info_per_bit = json.load(open(label+'.json'))

dsb = xr.Dataset()
for v in info_per_bit.keys():
    dsb[v] = xr.DataArray(info_per_bit[v], dims=["bit"], coords={"bit":np.arange(len(info_per_bit[v]))}, name=v)

# give bits a name, here for float32
dsb = dsb.assign_coords(bit_name=("bit",
                  ["±"] + 
                  [f"e{int(i)}" for i in dsb.sel(bit=slice(1,8)).bit] + 
                  [f"m{int(i-8)}" for i in dsb.sel(bit=slice(9,None)).bit]
                 ))

# todo add last for bits threshold

# normalize
dsbn=(dsb.cumsum("bit")/dsb.cumsum("bit").isel(bit=-1,drop=True)).assign_coords(dsb.coords)

threshold = [0.99, 0.99999]
keepbits = (dsbn > xr.DataArray(threshold, dims="threshold")).argmax("bit")

# replace 0 with 31
keepbits = keepbits.where(keepbits!=0, other=dsbn.isel(bit=-1).bit.values)

###
# plotting
keepbits.to_array().plot(hue='threshold',y="variable")

import cmcrameri.cm as cmc
fig,ax=plt.subplots(1,1,figsize=(7,10))
plt.pcolormesh(dsb.to_array(),vmin=0,vmax=1, cmap=cmc.turku_r)
plt.gca().set_xticks(np.arange(.5,.5+dsb.bit_name.size))
plt.gca().set_xticklabels(dsb.bit_name.values)
plt.colorbar()
plt.gca().set_yticks(np.arange(.5,.5+len(dsb.data_vars)))
plt.gca().set_yticklabels(list(dsb.data_vars))
plt.step(keepbits.to_array()+1, np.arange(len(keepbits)))

make Julia optional dependency

  • used python implementation if Julia isn’t installed but throw warning

  • use xr_bitround instead of jl_bitround

  • helps #95 and Julia installation issues

  • Continues #126

  • In the pangeo show case @rsignell-usgs also copied some code to not import xbitinfo due to julia

Retrieval of maximum keepbits

The bitwise information content can depend on the dimension bitinformation is applied to. Changes along dimensions like lon and time are rather small compared to those along the lat dimension. This might depend on the dataset though and the user might want to run the analysis across each dimension individually to identify the dimension with most keepbits and use those.

I propose to allow get_bitinformation handle a list of dims or a string like maxdim to return the keepbits of the dimension with the largest information content.

Duplicate memory allocations

Currently, the input data to xbitinfo.get_bitinformation is duplicated when calling

X = ds[var].values
Main.X = X
.

Main.X = X is a deep copy operation. This poses in general an issue for large datasets, because a single copy of the dataset can already be too much to load into memory. Further, I observed that the memory is not freed when calling xbitinfo.get_bitinformation again.

This results into several issues/tasks:

  • allocate memory only once, or even better
  • allocate memory lazily
  • free memory after Julia call

`get_bitinformation`: want to skip a variable if `dim` missing or return `None` for it's information content along missing `dim`

LGTM! I think it addresses only the first part of @rsignell-usgs comment of

I'd like to specify the "y" dimensions in a list, e.g. dim = ['eta_rho', 'eta_u', 'eta_v'] and then use whichever one is found for a specific variable.

though.
If a variable in a dataset with dims 'eta_rho', 'eta_u', 'eta_v' only depends on 'eta_v', the code might currently fail when e.g. dim='eta_rho' is provided. We could discuss if we just want to skip that variable in this case or return None for it's information content along dim=eta_rho.
Nevertheless, this PR is self-containing, so not objections to merge this directly.

Originally posted by @observingClouds in #106 (review)

`xb.get_bitinformation(implementation="julia/python")` yields different results

  • xbitinfo version: main
  • Python version: binder
  • Operating System: binder

Description

What I Did

ds = xr.tutorial.load_dataset("eraint_uvz")
xb.get_bitinformation(ds, dim="longitude", implementation="python")
xb.get_bitinformation(ds, dim="longitude", implementation="julia")

Memory potentially not deallocated correctly when calling get_bitinformation

Description

Running get_bitinformation on a "large" dataset with several variables on a Dask LocalCluster results in high memory usage that accumulates during the loop over the individual variables.

Warnings like the following occur until the workers run out of memory:

distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be
released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged 
memory: 11.13 GiB -- Worker memory limit: 15.72 GiB

Evaluating bitrounded results with metrics?

To ease the decision on specific keepbits for a simulation, it would be great to include some metrics into this package that can quantify the differences between bit-rounded and original values. The simplest being:

  • mean
  • standard deviation

These could also be enhanced by plotting routines that loop over possible keepbits.

Add BitInformation.jl version info to output

As a wrapper, we heavily rely on the underlying library. In this case the results depend on BitInformation.jl.
I propose to add the version of this package to the global attributes of the output to keep track of the underlying library.

sphinx readthedocs notebook wont execute

Description

What happens

---------------------------------------------------------------------------
JuliaError                                Traceback (most recent call last)
Input In [1], in <cell line: 1>()
----> 1 import xbitinfo as xb
      3 import xarray as xr
      5 xr.set_options(display_style="text")

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/checkouts/94/xbitinfo/__init__.py:4, in <module>
      1 """Top-level package for xbitinfo."""
      3 from ._version import __version__
----> 4 from .bitround import jl_bitround, xr_bitround
      5 from .graphics import plot_bitinformation, plot_distribution
      6 from .save_compressed import get_compress_encoding_nc, get_compress_encoding_zarr

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/checkouts/94/xbitinfo/bitround.py:4, in <module>
      1 import xarray as xr
      2 from numcodecs.bitround import BitRound
----> 4 from .xbitinfo import _jl_bitround, get_keepbits
      7 def _np_bitround(data, keepbits):
      8     """Bitround for Arrays."""

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/checkouts/94/xbitinfo/xbitinfo.py:19, in <module>
     15 path_to_julia_functions = os.path.join(
     16     os.path.dirname(__file__), "bitinformation_wrapper.jl"
     17 )
     18 Main.path = path_to_julia_functions
---> 19 jl.using("BitInformation")
     20 jl.using("Pkg")
     21 jl.eval("include(Main.path)")

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/lib/python3.9/site-packages/julia/core.py:644, in Julia.using(self, module)
    642 def using(self, module):
    643     """Load module in Julia by calling the `using module` command"""
--> 644     self.eval("using %s" % module)

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/lib/python3.9/site-packages/julia/core.py:621, in Julia.eval(self, src)
    619 if src is None:
    620     return None
--> 621 ans = self._call(src)
    622 if not ans:
    623     return None

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/lib/python3.9/site-packages/julia/core.py:549, in Julia._call(self, src)
    547 # logger.debug("_call(%s)", src)
    548 ans = self.api.jl_eval_string(src.encode('utf-8'))
--> 549 self.check_exception(src)
    551 return ans

File ~/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/lib/python3.9/site-packages/julia/core.py:603, in Julia.check_exception(self, src)
    601 else:
    602     exception = sprint(showerror, self._as_pyobj(res))
--> 603 raise JuliaError(u'Exception \'{}\' occurred while calling julia code:\n{}'
    604                  .format(exception, src))

JuliaError: Exception 'LoadError: InterruptException:
in expression starting at /home/docs/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/share/julia/packages/Distributions/0h6GE/src/multivariate/mvnormal.jl:415
in expression starting at /home/docs/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/share/julia/packages/Distributions/0h6GE/src/multivariates.jl:112
in expression starting at /home/docs/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/share/julia/packages/Distributions/0h6GE/src/Distributions.jl:1
in expression starting at /home/docs/checkouts/readthedocs.org/user_builds/xbitinfo/conda/94/share/julia/packages/BitInformation/VpaaY/src/BitInformation.jl:1' occurred while calling julia code:
using BitInformation

on RTD: https://readthedocs.org/projects/xbitinfo/builds/16814564/

but works locally!

I guess the julia installer doesnt install into the right kernel or I choose the wrong kernel.
Also cannot get nbsphinx_kernel_name = "bitinfo-docs" working in conf.py

Port plotting functions to `python`

Overview:

  • Fig. 2 #19
  • Fig. 3 seems to work nice with an xr_bitround wrapper
  • Fig. SI 1: #59
  • another one needed?

The plotting function gives a great overview about the information content present in each bit of a variable.
The port of the function to python should be straight forward and should be preferred over a further call of to Julia.

Make maskinfo optional

For large datasets get_keepbits is tremendously slowed down by the calculation of the mask for each individual variable

 "maskinfo": int(ds[var].notnull().sum())

We could make the mask an argument or get it somehow from get_bitinformation. But I would argue we follow Milan's advice #17 (comment) and remove this part from the code.

keep attrs in get_bitinformation

  • now we delete incoming attrs
  • could be useful to keep
  • change existing units to input_units and long_name to input_long or change units to bitinfo_units and bitinfo_long_name

Documentation of zarr compression

Explain the different approaches how BitRounding and Zarr can be used and explain why we have chosen not the alternative approach below:

import numcodecs
encoding = {
    "var1": {
        "filters": [numcodecs.BitRound(keepbits=14)],
        "compressor": numcodecs.Blosc("zstd",shuffle=numcodecs.Blosc.BITSHUFFLE)
    }
}

ds.to_zarr("out.zarr", encoding=encoding)

But as you mentioned, the issue with this solution is that "filters" are saved in the zarr files (.zattrs and .zmetadata) which makes it necessary for users to have this filter installed on their machine when opening the dataset. However, this particular filter can be seen as a one way filter and does not need a decoder for reading.

Applying bitrounding directly to the Dataset circumvents the writing of the filters information, but still keeps the information about the number of keepbits in the _QuantizeBitRoundNumberOfSignificantDigits attribute.

Originally posted by @observingClouds in #71 (comment)

misleading progressbar in `xb.get_bitinformation(dim=None)`

  • xbitinfo version: 0.0.1

Description

I would hope that the default dim=None works out of the box. Maybe that's too ambitious. This examples fails in the variable z (why is z the second?, its the first in ds.data_vars) for dimension month (which has two elements only).

I observed that 2 element dimensions often do not work. (However, it seems to have worked for v). Maybe we could raise more informative error messages in xbitinfo?

What I Did

ds = xr.tutorial.load_dataset("eraint_uvz")
info_per_bit = xb.get_bitinformation(ds, dim=None)
Processing v: 100%
3/3 [00:05<00:00, 1.33s/it]
Processing v: 100%
3/3 [00:01<00:00, 2.29it/s]
Processing v: 100%
3/3 [00:01<00:00, 2.67it/s]
Processing z: 0%
0/3 [00:01<?, ?it/s]
---------------------------------------------------------------------------
JuliaError                                Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 info_per_bit = xb.get_bitinformation(ds)
      3 info_per_bit

File ~/xbitinfo/xbitinfo.py:153, in get_bitinformation(ds, dim, axis, label, overwrite, **kwargs)
     87 """Wrap `BitInformation.jl.bitinformation() <[https://github.com/milankl/BitInformation.jl/blob/main/src/mutual_information.jl>`__](https://github.com/milankl/BitInformation.jl/blob/main/src/mutual_information.jl%3E%60__).
     88 
     89 Parameters
   (...)
    149     BitInformation.jl_version:  ...
    150 """
    151 if dim is None and axis is None:
    152     # gather bitinformation on all axis
--> 153     return _get_bitinformation_along_all_dims(
    154         ds, label=label, overwrite=overwrite, **kwargs
    155     )
    156 else:
    157     # gather bitinformation along one axis
    158     if overwrite is False and label is not None:

File ~/xbitinfo/xbitinfo.py:229, in _get_bitinformation_along_all_dims(ds, label, overwrite, **kwargs)
    227     if label is not None:
    228         label = "_".join([label, d])
--> 229     info_per_bit_per_dim[d] = get_bitinformation(
    230         ds, dim=d, axis=None, label=label, overwrite=overwrite, **kwargs
    231     ).expand_dims("dim", axis=0)
    232 info_per_bit = xr.merge(info_per_bit_per_dim.values()).squeeze()
    233 return info_per_bit

File ~/xbitinfo/xbitinfo.py:205, in get_bitinformation(ds, dim, axis, label, overwrite, **kwargs)
    203 logging.debug(f"get_bitinformation(X, dim={dim}, {kwargs_str})")
    204 info_per_bit[var] = {}
--> 205 info_per_bit[var]["bitinfo"] = jl.eval(
    206     f"get_bitinformation(X, dim={axis_jl}, {kwargs_str})"
    207 )
    208 info_per_bit[var]["dim"] = dim
    209 info_per_bit[var]["axis"] = axis_jl - 1

File /srv/conda/envs/notebook/lib/python3.10/site-packages/julia/core.py:621, in Julia.eval(self, src)
    619 if src is None:
    620     return None
--> 621 ans = self._call(src)
    622 if not ans:
    623     return None

File /srv/conda/envs/notebook/lib/python3.10/site-packages/julia/core.py:549, in Julia._call(self, src)
    547 # logger.debug("_call(%s)", src)
    548 ans = self.api.jl_eval_string(src.encode('utf-8'))
--> 549 self.check_exception(src)
    551 return ans

File /srv/conda/envs/notebook/lib/python3.10/site-packages/julia/core.py:603, in Julia.check_exception(self, src)
    601 else:
    602     exception = sprint(showerror, self._as_pyobj(res))
--> 603 raise JuliaError(u'Exception \'{}\' occurred while calling julia code:\n{}'
    604                  .format(exception, src))

JuliaError: Exception 'AssertionError: Mask has 347040 unmasked values, 0 entries are adjacent.' occurred while calling julia code:
get_bitinformation(X, dim=1, masked_value=convert(Float32,NaN), set_zero_insignificant=true)

Long-term perspective

  • get Milan on board
  • See how well this package is adopted
  • Transfer to pangeo-data or xarray-contrib

`get_bitinformation(dim/axis)` in julia or python notation?

bp.get_bitinformation has a dim keyword.

Currently it accepts str and understands them as dimension names and it also accepts int to pass them on to bitinformation.bitinformation in julia. However in julia dimensions start with 1 and axis start with 0 in python.

I suggest to add internally +1 if dim is int mimicking python behaviour. Or we implement axis additionally, which would be more correct name for what I propose.

Bitround in python or julia

is xr_bitround the bitround from xarray as discussed here? For performance I'd suggest to use round from BitInformation. I've never actually compared the speed, but recalculation of constants is avoided reaching 16GB/s at which point it's memory-bound (and therefore likely an upper bound). Also gives us more flexibility in case we want to change something on the rounding front.

Originally posted by @milankl in #21 (comment)

Deprecate `for_cdo` option

The upcoming release of CDO (2.1.1) will better support temporal chunks >1 with a significant speed-up. After testing this claim, we might want to remove the for_cdo option, document that this is only necessary for cdo < 2.1.1 or raise a info message during execution.

if for_cdo: # take shape as chunksize and ensure time chunksize 1
if time_dim in da.dims:
time_axis_num = da.get_axis_num(time_dim)
chunksize = da.data.chunksize if da.chunks is not None else da.shape
# https://code.mpimet.mpg.de/boards/2/topics/12598
chunksize = list(chunksize)
chunksize[time_axis_num] = 1
chunksize = tuple(chunksize)
return chunksize

FEATURE: shift the data range

@milankl metioned very important points in milankl/BitInformation.jl#38 (comment) which could/should be implemented in our xbitinfo pipeline:

premisses have to be checked / questions have to be answered before doing the bitinformation+round business:

  1. is the data rather linear or logarithmically distributed?
  2. which binary encoding is most appropriate given 1. (integer/fixed-point/linear quantization vs floats)
  3. Analyse the bitinformation within that appropriate encoding
  4. Bitround in that encoding too (and you'll either have rigid absolute error bounds for linear or relative for log)

Problem is obviously that most people want to use floats regardless of what data they are handling. Sure that makes sense. So for linearly distributed data (where you want absolute errors to have rigid bounds) we have to come up with a workaround to better adhere to 1.-4. while using floats. Let's take your sea surface temperature example and say you have the high precision data in ˚C and definitely want to store the data as ˚C, so what can one do?

  1. shift the data into a range where floats are linear, i.e. all data is within a power of 2. For ˚C you could convert to Kelvin (although I'd advise not to* unless you store in K) but you can also just add 256 (if no neg temperatures) or 512, a power of 2 is a good choice. Now your encoding matches your data distribution.
  2. analyse the bitinformation now. While this gives you mantissa bits, this actually suggests an absolute error and not a relative one for your data as all exponent bits are identical anyway.
  3. now subtract your initial offset again and you'll get more mantissa bits of precision for higher temperatures and fewer for lower temperatures so that your max absolute error is globally constant.

Move information content filtering from get_keepbits to get_bitinformation

The filtering of the bit information is currently done in get_keepbits, however, I think we should move this to get_bitinformation since this is the more appropriate place and we might even only return ICfilt in get_bitinformation because this is what people are most likely interested in.

This is the part I'm proposing to move to get_bitinformation:

    # filter insignificant information via binomial distribution as explained in methods
    # for some variables the last bits do not have a free entropy of 1 bit, in that case
    # adjust the threshold by the information in the very last bits (which should always be 0)

    ic = values(bitinfo)
    p = BitInformation.binom_confidence(maskinfo,0.99)  # get chance p for 1 (or 0) from binom distr
    M₀ = 1 - entropy([p,1-p],2)                            # free entropy of random 50/50 at trial size
    threshold = max(M₀,1.5*maximum(ic[end-3:end]))         # in case the information never drops to zero
                                                               # use something a bit bigger than maximum
                                                               # of the last 4 bits
    insigni = (ic .<= threshold) .& (collect(1:length(ic)) .> 9)
    ic[insigni] .= floatmin(Float64)

Fail to install properly from PyPi

  • xbitinfo version: 0.0.2
  • Python version: 3.9
  • Operating System: macOS

Description

I was trying to install xbitinfo via pip but got the following error message when importing it via import xbitinfo:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./mambaforge/envs/114/lib/python3.9/site-packages/xbitinfo/__init__.py", line 4, in <module>
    from .bitround import jl_bitround, xr_bitround
  File "./mambaforge/envs/114/lib/python3.9/site-packages/xbitinfo/bitround.py", line 4, in <module>
    from .xbitinfo import _jl_bitround, get_keepbits
  File "./mambaforge/envs/114/lib/python3.9/site-packages/xbitinfo/xbitinfo.py", line 19, in <module>
    jl.using("BitInformation")
  File "./mambaforge/envs/114/lib/python3.9/site-packages/julia/core.py", line 644, in using
    self.eval("using %s" % module)
  File "./mambaforge/envs/114/lib/python3.9/site-packages/julia/core.py", line 621, in eval
    ans = self._call(src)
  File "./mambaforge/envs/114/lib/python3.9/site-packages/julia/core.py", line 549, in _call
    self.check_exception(src)
  File "./mambaforge/envs/114/lib/python3.9/site-packages/julia/core.py", line 603, in check_exception
    raise JuliaError(u'Exception \'{}\' occurred while calling julia code:\n{}'
julia.core.JuliaError: Exception 'ArgumentError: Package BitInformation not found in current path:
- Run `import Pkg; Pkg.add("BitInformation")` to install the BitInformation package.
' occurred while calling julia code:
using BitInformation

Might be related to https://stackoverflow.com/questions/15853058/run-custom-task-when-call-pip-install

What I did

pip install xbitinfo
python
> import xbitinfo

returns the error above.

pip install git+https://github.com/observingClouds/xbitinfo.git
python
> import xbitinfo

works.

`ReadOnlyMemoryError` for `julia`/`pycall` when using `asv` on `GHA`

What I want to do?

  • run julia with asv on github actions and levante/supercomputer but not in pytest

What happens?

   Main.X = X
                 File "/home/runner/work/bitinformation_pipeline/bitinformation_pipeline/asv_bench/.asv/env/90d8778db9ad901392615fc50cb0cf4e/lib/python3.8/site-packages/julia/core.py", line 215, in __setattr__
                   self._julia.eval(setter)(value)
                 File "/home/runner/work/bitinformation_pipeline/bitinformation_pipeline/asv_bench/.asv/env/90d8778db9ad901392615fc50cb0cf4e/lib/python3.8/site-packages/julia/core.py", line 621, in eval
                   ans = self._call(src)
                 File "/home/runner/work/bitinformation_pipeline/bitinformation_pipeline/asv_bench/.asv/env/90d8778db9ad901392615fc50cb0cf4e/lib/python3.8/site-packages/julia/core.py", line 549, in _call
                   self.check_exception(src)
                 File "/home/runner/work/bitinformation_pipeline/bitinformation_pipeline/asv_bench/.asv/env/90d8778db9ad901392615fc50cb0cf4e/lib/python3.8/site-packages/julia/core.py", line 603, in check_exception
                   raise JuliaError(u'Exception \'{}\' occurred while calling julia code:\n{}'
               julia.core.JuliaError: Exception 'ReadOnlyMemoryError' occurred while calling julia code:
               
                           PyCall.pyfunctionret(
                               (x) -> Base.eval(Main, :(X = $x)),
                               Any,
                               PyCall.PyAny)
               asv: benchmark failed (exit status 1)

https://github.com/observingClouds/bitinformation_pipeline/runs/5970342909?check_suite_focus=true

different dtypes for `xb.get_bitinformation(ds)` fail

  • xbitinfo version: 0.0.1

Description

  • different dtypes for xb.get_bitinformation(ds) fail
  • I thought we'd already covered this in
    def _get_bitinformation_kwargs_handler(da, kwargs):
    """Helper function to preprocess kwargs args of :py:func:`xbitinfo.xbitinfo.get_bitinformation`."""
    if "masked_value" not in kwargs:
    kwargs["masked_value"] = f"convert({str(da.dtype).capitalize()},NaN)"
    elif kwargs["masked_value"] is None:
    kwargs["masked_value"] = "nothing"
  • str(ds.z.dtype).capitalize() # Float32
  • str(ds.z64.dtype).capitalize() # Float64

What I Did

ds = xr.tutorial.load_dataset("eraint_uvz")
ds['z64']=ds.z.astype("float64") # add float64 var to existing float32

info_per_bit = xb.get_bitinformation(ds, dim="longitude") # fails
info_per_bit = xb.get_bitinformation(ds[["z"]], dim="longitude", masked_value="convert(Float32,NaN)") # works
info_per_bit = xb.get_bitinformation(ds[["z64"]], dim="longitude", masked_value="convert(Float64,NaN)") # works

Add "Launch in AWS Studio Lab" to readme.md?

Not sure if you want to add this, but I've been loving AWS Studio Lab, and launching it worked perfectly -- it discovered the environment.yml, built the environment, I ran the notebook without problems. And now the "bitinfo" env and repo are there for me to run again (on AWS Studio Lab, work is persisted, you can login with your account (free, no AWS account required) and you have 12 hours of CPU time every time you fire it up.

Here's the proposed blurb for the readme:

Launch in SageMaker Studio Lab

If you have an AWS SageMaker Studio Lab account, you can open in Studio Lab using the button below, then when prompted, choose to download the whole repo and to build the conda environment. If you don't have an account, you can sign up for free (no AWS account required).

Open In SageMaker Studio Lab

Saving to compressed zarr yields lower or negative compression

  • xbitinfo version:
    xbitinfo 0.0.2 pypi_0 pypi

  • Python version:
    python 3.10.5 hdaaf3db_0_cpython conda-forge

  • Operating System:
    macOS Big Sur

Description

I ran the tutorial notebook for saving to .nc and .zarr (https://xbitinfo.readthedocs.io/en/latest/quick-start.html). I expected progressively smaller file sizes, with "original" > "compressed" > "bitrounded". For .nc, this is what happened, while for .zarr

  1. the file size of the "original" is 30% smaller than for .nc (2.8 MB).
  2. Doing "ds.to_compressed_zarr") increases the size by 60% (4.5 MB).
  3. Saving the bitrounded data to compressed_zarr yields (as expected) a smaller file size (804 KB), which is 64% larger than the equivalent .nc file
    Screenshot 2022-07-14 at 20 46 55
    Screenshot 2022-07-14 at 20 47 07

I have uploaded the notebook with results at https://github.com/mattphysics/xbitinfo/blob/main/tests/nb_xbitinfo_export.ipynb

Could the dimensions in the list be the inner loop rather than the outer loop?

@aaronspring I tried it and it worked for me:

%%time
info_per_bit_x = xb.get_bitinformation(ds_sample, dim=['xi_rho', 'xi_u', 'xi_v'])
Processing Uwave_rms: 100%
82/82 [00:22<00:00, 5.27it/s]
Processing Uwave_rms: 100%
82/82 [00:09<00:00, 11.74it/s]
Processing Uwave_rms: 100%
82/82 [00:09<00:00, 11.11it/s]

CPU times: user 41.6 s, sys: 0 ns, total: 41.6 s
Wall time: 41.2 s

This is great, but I think it could be confusing for the user why all the variables seem to be processed multiple times.

Could the dimensions in the list be the inner loop rather than the outer loop?

Originally posted by @rsignell-usgs in #106 (comment)

a bit related #97

How to apply bitrounding to `MPIESM` `grb`

Goal:

  • bitround MPIESM grb files, i.e. input grb -> output grb

Issue:

  • mpiesm grb files can be opened by xarray but varnames and attrs have weird names

Solution ideas:

  • ~~ open grb file with cfgrib (ignoring weird names), applying xr_bitround and saving back to grb with cfgrib.xarray_to_grib.to_grib https://github.com/ecmwf/cfgrib/#write-support ~~ doesnt work IMO
  • open grb file with cdo-python, apply xr_bitround and save as grb, i.e. involving no -f nc doesnt work because cdopy does always cdo -f nc operator
  • cdo -f nc copy to create netcdf and then bitround that

Originally posted by @aaronspring in #44 (comment)

Related:

allow inplace bitround

inplace operations are against the xr data model but they could reduce the resources needed. Could implement as a keyword inplace=False

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.