Giter VIP home page Giter VIP logo

fast-histogram's Introduction

CI Status asv PyPI release

About

Sometimes you just want to compute simple 1D or 2D histograms with regular bins. Fast. No nonsense. Numpy's histogram functions are versatile, and can handle for example non-regular binning, but this versatility comes at the expense of performance.

The fast-histogram mini-package aims to provide simple and fast histogram functions for regular bins that don't compromise on performance. It doesn't do anything complicated - it just implements a simple histogram algorithm in C and keeps it simple. The aim is to have functions that are fast but also robust and reliable. The result is a 1D histogram function here that is 7-15x faster than numpy.histogram, and a 2D histogram function that is 20-25x faster than numpy.histogram2d.

To install:

pip install fast-histogram

or if you use conda you can instead do:

conda install -c conda-forge fast-histogram

The fast_histogram module then provides two functions: histogram1d and histogram2d:

from fast_histogram import histogram1d, histogram2d

Example

Here's an example of binning 10 million points into a regular 2D histogram:

In [1]: import numpy as np

In [2]: x = np.random.random(10_000_000)

In [3]: y = np.random.random(10_000_000)

In [4]: %timeit _ = np.histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
935 ms ± 58.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: from fast_histogram import histogram2d

In [6]: %timeit _ = histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
40.2 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

(note that 10_000_000 is possible in Python 3.6 syntax, use 10000000 instead in previous versions)

The version here is over 20 times faster! The following plot shows the speedup as a function of array size for the bin parameters shown above:

Comparison of performance between Numpy and fast-histogram

as well as results for the 1D case, also with 30 bins. The speedup for the 2D case is consistently between 20-25x, and for the 1D case goes from 15x for small arrays to around 7x for large arrays.

Q&A

Why don't the histogram functions return the edges?

Computing and returning the edges may seem trivial but it can slow things down by a factor of a few when computing histograms of 10^5 or fewer elements, so not returning the edges is a deliberate decision related to performance. You can easily compute the edges yourself if needed though, using numpy.linspace.

Doesn't package X already do this, but better?

This may very well be the case! If this duplicates another package, or if it is possible to use Numpy in a smarter way to get the same performance gains, please open an issue and I'll consider deprecating this package :)

One package that does include fast histogram functions (including in n-dimensions) and can compute other statistics is vaex, so take a look there if you need more advanced functionality!

Are the 2D histograms not transposed compared to what they should be?

There is technically no 'right' and 'wrong' orientation - here we adopt the convention which gives results consistent with Numpy, so:

numpy.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

should give the same result as:

fast_histogram.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

Why not contribute this to Numpy directly?

As mentioned above, the Numpy functions are much more versatile, so they could not be replaced by the ones here. One option would be to check in Numpy's functions for cases that are simple and dispatch to functions such as the ones here, or add dedicated functions for regular binning. I hope we can get this in Numpy in some form or another eventually, but for now, the aim is to have this available to packages that need to support a range of Numpy versions.

Why not use Cython?

I originally implemented this in Cython, but found that I could get a 50% performance improvement by going straight to a C extension.

What about using Numba?

I specifically want to keep this package as easy as possible to install, and while Numba is a great package, it is not trivial to install outside of Anaconda.

Could this be parallelized?

This may benefit from parallelization under certain circumstances. The easiest solution might be to use OpenMP, but this won't work on all platforms, so it would need to be made optional.

Couldn't you make it faster by using the GPU?

Almost certainly, though the aim here is to have an easily installable and portable package, and introducing GPUs is going to affect both of these.

Why make a package specifically for this? This is a tiny amount of functionality

Packages that need this could simply bundle their own C extension or Cython code to do this, but the main motivation for releasing this as a mini-package is to avoid making pure-Python packages into packages that require compilation just because of the need to compute fast histograms.

Can I contribute?

Yes please! This is not meant to be a finished package, and I welcome pull request to improve things.

fast-histogram's People

Contributors

3553x avatar astrofrog avatar atrettin avatar henryiii avatar janpipek avatar manodeep avatar mladenivkovic avatar pllim avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fast-histogram's Issues

histogram2d memory error when x and y are swapped.

I appreciate your work as well. So I hope this issue report will help make fast-histogram even better.

When I call histogram2d I get a memory error depending on the order of the columns. I have attached a script+data to reproduce the error (on windows 11 and ubuntu 18.04).

python v3.9.13
fast-histogram v0.11
np v1.24.3

reproduction_materials.zip

Thank you for your great work.

histogram result has strange spikes as compared to numpy histogram

Data file : https://ufile.io/cnj9l

Python Code:

import numpy as np
import matplotlib.pyplot as plt
from fast_histogram import histogram1d

data = np.load('x.npz')
h_np, _ = np.histogram(data['x'], bins=1100, range=[0, 1100] )
h_fast = histogram1d(data['x'], bins=1100, range=[0, 1100] )

plt.plot(h_np[:-1], 'r--', label='numpy')
plt.plot(h_fast[:-1], label='fast')
plt.legend()

The result is as :
image

Also in histogram2d, the spikes are there.

histogram1d returns incorrect bins due to rounding error

Hello,

the function histogram1d (and likely histogram2d) can increment the wrong bin when a floating-point rounding error happened. This happens at https://github.com/astrofrog/fast-histogram/blob/main/fast_histogram/_histogram_core.c#L173, when the value of ix is calculated just below a whole number due to rounding. The index lookup in the next line then truncates the value to a whole number resulting in an incorrectly incremented bin one below.

See here:

left, right = (0.9, 1.1)
bins = 10
norm = bins/(right-left)
value = 1.00
ix = (value-left)*norm;
print(ix, int(ix))  # 4.999999999999997 4

The incorrect behaviour of histogram1d can be reproduced with the following script:

#!/usr/bin/env python3

import numpy as np
from fast_histogram import histogram1d

data = np.array([0.999, 1.000, 1.001, 1.099, 1.100, 1.101])

bins, edges = np.histogram(data, bins=10, range=(0.9, 1.1))
fast_bins = histogram1d(data, bins=10, range=(0.9, 1.1))
print(data)
print(edges)
print("#### NumPy")
print(bins)  # [0 0 0 0 1 2 0 0 0 2]
print("#### Fast-Histogram")
print(fast_bins)  # [0. 0. 0. 0. 2. 1. 0. 0. 0. 1.]

Also this includes another error (miscounted right-most value), the incorrect result due to the rounding error is the in the middle of the bins ( (1,2) vs (2,1) ).

different result compared to numpy

Hello there,

I am trying to use this repo to replace numpy but get different result.

I put range as the minimum of the input and the maximum of the input. But I found out that the result is missing some maximum value.

For example,

test_case = np.array([1, 1, 2, 2, 3, 3, 10, 10]) freq, bins = np.histogram(test_case, range(np.min(test_case), np.max(test_case + 1))) result = histogram1d(test_case, bins=9, range=(np.min(test_case), np.max(test_case)))

Incorrect work for 3d np.array

I'm going to use your package because it is exactly what I need (I just need to add normalization). I found out that histogram1d() works incorrectly with 3d np.arrays.
Here is an example of my test:

from future import division
from fast_histogram import histogram1d
import numpy as np
import matplotlib.pyplot as plt

def pdf_from_array_np(array, bins, range):
pdf, _ = np.histogram(array, bins=bins, range=range, normed=1)
return pdf

def pdf_from_array(array, bins, range):
pdf = histogram1d(array, bins=bins, range=range)
norm = np.sum(pdf)/bins
return pdf/norm

arr = np.random.uniform(-0.5, 0.5, 64**3)
arr = arr.reshape((64, 64, 64))
x = np.linspace(-0.5, 0.5, 100)
pdf_np = pdf_from_array_np(arr, 100, [-0.5, 0.5])
plt.plot(x, pdf_np, label='np')
pdf_fast = pdf_from_array(arr, 100, [-0.5, 0.5])
plt.plot(x, pdf_fast, label='fast')
plt.legend()
plt.show()
print(np.sum((pdf_fast-pdf_np)**2))

figure_1

choice of returned histogram datatype

Thanks for writing this fast histogram package. A task that we astronomers tend to do often and repeatedly :)

Coming back to the package itself, it seems to me that the return datatype of the histograms are doubles. Conceptually, the histogram is simply a count in that bin, and should be of an integral type. Is the return datatype chosen to be double to reflect the input array type? Or is that in keeping with the numpy convention?

Changing the histogram increment to be an integral type (i.e., the postfix ++ operator) might also lead to better performance. (There are some opportunity for performance improvements -- e.g., pointer operations, reduction of scope for variables, unrolled loops + separate counters)

Separately, any particular reason to perform a bitwise and (&) rather than the logical and (&&) here?

Improve how we generate test cases with hypothesis

At the moment, the test_1d_compare_with_numpy and test_2d_compare_with_numpy tests are slightly hacky in how they generate test cases. Here's the code for the 1-d case:

@given(values=arrays(dtype='<f8', shape=st.integers(0, 200),
elements=st.floats(-1000, 1000), unique=True),
nx=st.integers(1, 10),
xmin=st.floats(-1e10, 1e10),
xmax=st.floats(-1e10, 1e10),
weights=st.booleans(),
dtype=st.sampled_from(['>f4', '<f4', '>f8', '<f8']))
@settings(max_examples=500)
def test_1d_compare_with_numpy(values, nx, xmin, xmax, weights, dtype):
if xmax <= xmin:
return
values = values.astype(dtype)
size = len(values) // 2
if weights:
w = values[:size]
else:
w = None
x = values[size:size * 2]

What I'm trying to do is generate two arrays x and w which have the same dtype (either 32-bit or 64-bit floats, big or little endian), have the same 1-d size (sampled between 0 and 200), and have values in the range -1000, 1000. So at the moment I generate a 64-bit array, cast it inside the test, then split it into two. It would be cleaner to be able to directly generate the two arrays directly with the correct dtype, but I can't figure out how to do this.

@Zac-HD - do you have any suggestions how I might be able to achieve this in a cleaner way?

Move this repo upstream?

This seems to be generally useful. Do you think upstream org like NumPy or Scientific Python would be interested to house this repo?

Improve robustness to arbitrary input

I started trying to add some full hypothesis tests such as:


# First some tests with hypothesis value generation to make sure our
# implementation doesn't crash, and do basic checks of validity of the output.

@given(values=arrays(dtype='<f8', shape=st.integers(0, 50),
                     elements=st.floats(-1000, 1000, allow_subnormal=False),
                     unique=True),
       nx=st.integers(1, 10),
       xmin=st.floats(-1e10, 1e10),
       xmax=st.floats(-1e10, 1e10),
       weights=st.booleans(),
       dtype=st.sampled_from(['>f4', '<f4', '>f8', '<f8']))
@settings(max_examples=1000)
def test_1d_basic(values, nx, xmin, xmax, weights, dtype):
    size = len(values) // 2
    w = values[:size] if weights else None
    x = values[size:]
    fast = histogram1d(x, bins=nx, weights=w, range=(xmin, xmax))
    assert np.sum(fast) == np.sum((x < xmax) & (x >= xmin))

but this crashes with a segmentation fault currently, so need to investigate further.

Hypothesis failures

This issue is to collect random Hypothesis failures that need investigating

Bug in v0.7: Incorrect way to translate input number to histogram bin

I have found that the way you're converting the numbers into bins is susceptible to floating point precision errors.

If you try the following code:
fast_histogram.histogram1d([33], 126, (0,126))
you will have a 1 in the histogram in the bin for 32 rather than 33.

This is due to:
(33-0)*(1/(126-0))*126
being evaluated to 32.99999999999999 rather than 33.0.

I have found that changing the reciprocal to a division with the number of bins solves this:
(33-0)*126/(126-0) = 33.0

In effect you need to do the following:
Replace https://github.com/astrofrog/fast-histogram/blob/master/fast_histogram/_histogram_core.c#L147 with
normx = fnx / (xmax - xmin);
and https://github.com/astrofrog/fast-histogram/blob/master/fast_histogram/_histogram_core.c#L166 with
ix = (tx - xmin) * normx

[0.11] Segmentation fault on tests

Forwarding Debian#1026594:
Since a few weeks, the package produces a segmentation fault when running the tests:

autopkgtest [11:23:37]: test command1: cd $AUTOPKGTEST_TMP && python3 -m pytest --pyargs fast_histogram
autopkgtest [11:23:37]: test command1: [-----------------------
============================= test session starts ==============================
platform linux -- Python 3.11.1, pytest-7.2.0, pluggy-1.0.0+repack
rootdir: /tmp/autopkgtest-lxc.adqdcee1/downtmp/autopkgtest_tmp
plugins: hypothesis-6.61.0
collected 9 items

tests/test_histogram.py FFatal Python error: Segmentation fault

Current thread 0x00007f633c836040 (most recent call first):
  Garbage-collecting
  File "/usr/lib/python3/dist-packages/hypothesis/internal/conjecture/data.py", line 850 in __init__
  File "/usr/lib/python3/dist-packages/hypothesis/internal/conjecture/engine.py", line 891 in new_conjecture_data
  File "/usr/lib/python3/dist-packages/hypothesis/internal/conjecture/engine.py", line 1031 in cached_test_function
  File "/usr/lib/python3/dist-packages/hypothesis/internal/conjecture/engine.py", line 803 in generate_mutations_from
  File "/usr/lib/python3/dist-packages/hypothesis/internal/conjecture/engine.py", line 733 in generate_new_examples
  File "/usr/lib/python3/dist-packages/hypothesis/internal/conjecture/engine.py", line 880 in _run
  File "/usr/lib/python3/dist-packages/hypothesis/internal/conjecture/engine.py", line 474 in run
  File "/usr/lib/python3/dist-packages/hypothesis/core.py", line 884 in run_engine
  File "/usr/lib/python3/dist-packages/hypothesis/core.py", line 1292 in wrapped_test
  File "/usr/lib/python3/dist-packages/fast_histogram/tests/test_histogram.py", line 78 in test_2d_compare_with_numpy
  File "/usr/lib/python3/dist-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/lib/python3/dist-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/lib/python3/dist-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/lib/python3/dist-packages/_pytest/python.py", line 1789 in runtest
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/lib/python3/dist-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/lib/python3/dist-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 260 in <lambda>
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 339 in from_call
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 220 in call_and_report
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/lib/python3/dist-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/lib/python3/dist-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/lib/python3/dist-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/lib/python3/dist-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/lib/python3/dist-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/lib/python3/dist-packages/_pytest/main.py", line 324 in _main
  File "/usr/lib/python3/dist-packages/_pytest/main.py", line 270 in wrap_session
  File "/usr/lib/python3/dist-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/lib/python3/dist-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/lib/python3/dist-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/lib/python3/dist-packages/_pytest/config/__init__.py", line 167 in main
  File "/usr/lib/python3/dist-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/usr/lib/python3/dist-packages/pytest/__main__.py", line 5 in <module>
  File "<frozen runpy>", line 88 in _run_code
  File "<frozen runpy>", line 198 in _run_module_as_main

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, fast_histogram._histogram_core, numpy.linalg.lapack_lite (total: 15)
Segmentation fault

This happens with both Python3.10.9 and 3.11.1. Other versions:

  • hypothesis 6.60.0, 6.61.0 (6.36.0 works!)
  • numpy 1.23.5, 1.24.1
  • pytest 7.2.0,
  • pluggy 1.0.0

This happens both when running during the build (log) and when testing with python3 -m pytest --pyargs fast_histogram on the installed package (log).

Bug: Histogram2d result does not match that of the numpy function

Thank you for this package, it is indeed very fast and I am looking forward to using it.

There is a discrepancy between your histogram2d function and the numpy equivalent, which is demonstrated below:

import numpy as np
from fast_histogram import histogram2d as fast_hist_2d

n_bins = 5	

x = np.array([3,2,5,4,6,2,3,5,3,2,5,3,6,7,3])
y = np.array([1,6,3,2,5,3,1,6,4,6,4,2,4,5,3])

xmin, xmax = np.min(x), np.max(x)
ymin, ymax = np.min(y), np.max(y)

h_fast = fast_hist_2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])


h_np = np.histogram2d(x,y,
	range=[[xmin,xmax], [ymin,ymax]],
	bins = [n_bins,n_bins])[0]

print "h_fast =\n", h_fast
print "\nh_np =\n", h_np

On my machine this outputs

h_fast =
[[ 0. 0. 1. 0. 0.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 0.]
[ 0. 0. 0. 1. 1.]]

h_np =
[[ 0. 0. 1. 0. 2.]
[ 2. 1. 1. 1. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 1. 1.]
[ 0. 0. 0. 1. 2.]]

The fast_histogram result only includes 11 of the 15 points. The 4 missing points are all from the final column

Python 3.12

I don't see py312 in the publish workflow.

setup_requires without install_requires not supported by NumPy

In setup.py the build-time dependency on NumPy is defined via a setup_requires without an install_requires. This is poorly supported, and leads to the following problem on a clean python install

pip3 install fast-histogram
python3 -c 'import fast_histogram'
ImportError: No module named 'numpy.core._multiarray_umath'

While this might be an issue with Numpy (see numpy/numpy#6599), SciPy solved the problem via scipy/scipy#3566.

Long doubles make histogram2d fail subtly, maybe add type check?

I'm experimenting with various optimizations in the pulsar search code in HENDRICS (StingraySoftware/HENDRICS#65). One of my bottlenecks was a 2d histogram operation, and I gave a shot to yours. Tested with simulated data, everything worked and histogram2d was not a bottleneck anymore. Yay!
However...
When I used it on real data, Jupyter notebook was giving "dead kernel" messages with no information whatsoever. After spending one day banging my head, I tried to cast the longdouble arrays to double, and everything worked again.

It would help to add a type check, raising a ValueError or something similar if the number type is unsupported.

Allow to pass various types of arguments (like numpy)

Hi, I would like to try your package, but my code keeps throwing errors.
First, I do not use range parameter at all (I have no need). That is OK, I can add a few numbers.
However, when I try to run it again, I get The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() since I use bins=[bin_x,bin_y].

I can in general rewrite my code but my sets of bins are quite sophisticated and it would be a bit painful.
Is it possible to update your code somehow?

Sincerely,
V.

Numpy version in pyproject.toml should be pinned

Specifying build requirements in pyproject.toml leads to pip using a feature called build isolation, which means that it doesn't use any existing installed package but instead creates an isolated environment and pre-installs the build dependencies. In the case where someone does:

FROM ubuntu:16.04
RUN apt-get update
RUN apt-get install -y python3 python3-dev python3-pip python3-wheel
RUN pip3 install pip --upgrade
RUN pip install fast-histogram==0.6 numpy==1.12.1
RUN python3 -c 'import fast_histogram'

the issue is that fast-histogram gets built against the latest version of numpy since the 'numpy' is in the pyproject.toml file, but we then install an older version.

The way pyproject.toml is intended to be used, we should actually pin numpy to the oldest compatible version, e.g. numpy==1.9.0. The problem however is that this minimum version depends on Python version and we need to use something called environment markers:

    "numpy==1.8.2; python_version<='3.4'",
    "numpy==1.9.3; python_version=='3.5'",
    "numpy==1.12.1; python_version=='3.6'",
    "numpy==1.13.1; python_version>='3.7'",

But this is only supported with very recent versions of pip so we might want to hold back from doing that.

See a related discussion for scipy: scipy/scipy#8734

GIL release

Nominally in C code that takes some time one can release the GIL with:

Py_BEGIN_ALLOW_THREADS
<Non-Python calling code>
Py_END_ALLOW_THREADS

It looks like it would be pretty trivial to enclose your for loops with these macros, is this something you're interested in?

Tropical Rainfall Rebinning and Error Band Issues

  1. Histogram Rebinning with Incorrect Normalization
  • Problem: When rebinning histograms, the normalization appears incorrect, leading to inaccurate PDF distributions. Currently, rebinning is disabled in the command-line interface (CLI) to avoid this issue.
  • Action Required: Investigate the rebinning logic to determine the root cause of the incorrect normalization. It involves:
    • Checking the number of counts in the histogram before and after rebinning.
  1. Error Band for Daily Variability
  • Problem: The error band calculation for daily variability uses a fitting method, which may not be optimal. The current method could lead to an inaccurate representation of variability error band.
  • Action Required: Adjust the error band calculation by:
    • Removing the fitting step to avoid potential biases.
    • Using the 95th percentile for error band calculation instead of the 75th percentile. This change aims to capture a broader range of variability, providing a more robust representation.
  1. Also:
  • Add error bars!
  • Be sure that the plot is not empty while you are saving it.
  • Proposed deletion of the diagnostics/tropical_rainfall/cli/run_cli_tropical_rainfall.sh script.
  • Implemented a warning system to alert users of potentially lengthy computation times, providing an estimated duration for better planning and expectation management.

Add OpenMP

I think it should be possible to rewrite the loops to include OpenMP pragmas. The final update could be protected with #pragma omp atomic and you can use default(shared) private(tx ty ix iy) firstprivate(dataptr) or something similar. You'll need to redo the loops, however, you'll need for loops instead of the inner while (or maybe the outer do while), and the stride part might need to be precalculated in some way.

Suggestion: memory allocation enhancement

Hi, Thomas

First of all, thank you for the package, it's really handy.
However, I have a particular problem that might require an additional modification of the code.
I have to calculate a lot of histograms (starting from the trivial case with one bin only and up to probably 10^6 bins) using the same grid but passing different sets of data. That means constant allocations of memory whereas it is possible to do this once and just keep clearing the array.

Is it potentially possible to implement this?

V.

allow density/normed argument

Thank you for your cool package!

numpy.histogram and numpy.histogram2d support the normed argument, which divides counts by the number of samples and the bin width/area. It would be great if fast-histogram could support this as well.

upstreaming

Do you have any intention of upstreaming this? I feel like it would be really useful (together with having a bincount2d which is what I'm usually missing).

messing around with alternatives

Hey.
I was just playing around with this and was trying to see if there's a way to implement this efficiently with standard libs.

My usual way to do things like this is using the scipy.sparse.coo_matrix construct.

import scipy.sparse as sp

def bincount2d(x, y, bins):
	return sp.coo_matrix((np.ones(x.shape[0]), (x, y)), shape=(bins, bins), dtype=np.int)

If the data was scaled so that making it ints would put it in the right bins, this would work.

import numpy as np
x = np.random.random(10_000_000)
y = np.random.random(10_000_000)

from fast_histogram import histogram2d
%timeit _ = histogram2d(x, y, range=[[0, 1], [0, 1]], bins=30)

36.8 ms ± 4.14 ms per loop

xx = (x * 30)
yy = (y * 30)

%timeit bincount2d(xx, yy, bins=30)

153 ms ± 4.04 ms per loop

So your code would "only" be 5x faster, so it's about a 4x speedup over numpy.
Unfortunately I cheated and didn't include shifting ``xx` so that the data aligns with the bins. I don't think it's possible to make this work without copying the data at least once, which is why I'm giving up on this route.

Thought it might be of interest, though ...

Add citation

At least hook in a Zenodo DOI to releases and add that badge to README?

Weighted histogram differs from numpy unless the data is explicity copied.

First, thanks for the great library. I suspect I am hitting a memory issue. Unless I explictly copy the arrays before passing them to histogram2D, the results differ from numpy. So far, I only saw this happen when I am using the weighted version of the 2Dhistogram. Code and output follows

#! /usr/bin/env python
import numpy as np
import fast_histogram as ft

print(f"Numpy version {np.__version__}")
print(f"Fast-histogram version {ft.__version__}")

# Initial data
#######################################################
nbins=100
max_val=17.64
min_val=0.0
set_range=[[min_val, max_val],[min_val, max_val]]
with np.load("data.npz") as d:
    pot = d["pot"]
    displ_new = d["displ_new"]
    bond_ind = d["bond_ind"]


# Without copy
print("NOT EXPLICITLY COPYING THE DATA")
pot_array = pot[:, bond_ind[0,0], bond_ind[0,1]]
arr1 = displ_new[:,bond_ind[0,0]]
arr2 = displ_new[:,bond_ind[0,1]]

#Run the histograms
h_np = np.histogram2d(arr1, arr2, nbins, range=set_range, weights=pot_array)[0]
h_ft = ft.histogram2d(arr1, arr2, nbins, range=set_range, weights=pot_array)

print("===FAST-HIST===")
print(h_ft)
print("===NUMPY===")
print(h_np)


# With copy
print("EXPLICITLY COPYING THE DATA")
pot_array = pot[:, bond_ind[0,0], bond_ind[0,1]].copy()
arr1 = displ_new[:,bond_ind[0,0]].copy()
arr2 = displ_new[:,bond_ind[0,1]].copy()

#Run the histograms
h_np = np.histogram2d(arr1, arr2, nbins, range=set_range, weights=pot_array)[0]
h_ft = ft.histogram2d(arr1, arr2, nbins, range=set_range, weights=pot_array)

print("===FAST-HIST===")
print(h_ft)
print("===NUMPY===")
print(h_np)

with output

Numpy version 1.18.1
Fast-histogram version 0.8
NOT EXPLICITLY COPYING THE DATA
===FAST-HIST===
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.0425809  0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.01108671 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
===NUMPY===
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.00524697 0.11627543 0.00218829 ... 0.         0.         0.        ]
 [0.         0.0233833  0.08906353 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
EXPLICITLY COPYING THE DATA
===FAST-HIST===
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.00524697 0.11627543 0.00218829 ... 0.         0.         0.        ]
 [0.         0.0233833  0.08906353 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
===NUMPY===
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.00524697 0.11627543 0.00218829 ... 0.         0.         0.        ]
 [0.         0.0233833  0.08906353 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]

The data file necessary to run the code is attached. Due to github restrictions, I had to zip it first.
data.npz.zip

NumPy 2.0

I don't see any test against numpy 2.0.dev in the CI. Will this package be compatible with numpy 2.0?

return two results

numpy.histogram returns both counts and edges, while histogram1d only returns counts.

setup.py requires numpy

The current version of setup.py imports numpy to get its include path. I'm finding this to be a problem when you add fast-histogram as a dependency in your requirements.txt. Even if you also add numpy as a requirement, there is no way to enforce that numpy will have been installed by the time the setup.py in fast-histogram gets run.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.