lonelyenvoy / python-memoization Goto Github PK

A powerful caching library for Python, with TTL support and multiple algorithm options.

License: MIT License

Python 100.00%

memoization memoization-library memoize-decorator cache cache-python functional-programming decorator python-memoization ttl ttl-cache

python-memoization's People

Contributors

Stargazers

Watchers

Forkers

prathyushark goredar philiptzou danmou chennupushkar achillesxu meiliqi kosmos342 inos-soft generalcommission krachyon theclocktwister exxpe1

python-memoization's Issues

two dataframes with different values hitting cache incorrectly

Are pandas dataframes supported as function arguments in a @cached decorated function?

I tried to simplify this example with a smaller dataframe but @cached does seem to behave as one would expect for smaller dataframes.

However, when I tried the minimal code below with the attached data I ran into a problem where the two clearly different dataframes are being interpreted as identical in the @cached decorated function. Thus, df2 doesn't make it through which_df but instead gets the value from the cache since it assumes df2 is equals to df1 (and it is not!)

This is the test to replicate. Please use the attached data get the unexpected behavior explained in this issue

import pandas as pd
from memoization import cached

@cached()
def which_df(df):
#     print("got inside function")
    return df.name
    
    
df1 = pd.read_pickle('memoization_test.pkl')
df1.name = "This is DF No. 1"
df2 = df1.interpolate()
df2.name = "This is DF No. 2"

df1.equals(df2)   # ==> False, since they are not identical
print(which_df(df1) + ', and it should be DF No. 1')
print(which_df(df2) + ', BUT it should be DF No. 2')

memoization_test.zip

Cache entries not getting removed after TTL expires

I was expecting cache entries to be removed after TTL expires. This is useful when we want know how many entries are are actually cached currently.

[feature] item_size function

Overview

When determining how large the cache using max_size is it may be useful to treat some items as larger than other to provide a better proxy for their memory footprint. For example, I have a function that caches 3D meshes. Setting a max_size to a fixed number doesn't capture the fact that some 3D meshes are very large while others are very small.

Proposal

Similar to the custom_key_maker, a developer could provide an item_size function that returns an integer allowing them to calculate the size of cached items based on the cache entry. In the use-case described above, I might return a value based on the number of vertexes in my mesh.

[Q] Is there an ETA for Partial cache clearing ?

Is anyone actively working on this feature?

cached does not preserve type signature

Using cached does not preserve the type signature.

from memoization import cached
import inspect


def foo(a: str) -> int:
    return int(a)


def bar(a: str) -> int:
    return int(a)


@cached
def baz(a: str) -> int:
    return int(a)


assert inspect.getfullargspec(foo) == inspect.getfullargspec(bar), "foo != bar"
assert inspect.getfullargspec(foo) == inspect.getfullargspec(baz), "foo != baz"

Expected: No output

Actual result:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError: foo != baz

This prevents using type validation tools, such as mypy, from being used to validate the usage of methods wrapped in cached.

caching functions without arguments

Right now, syntax warning is generated with the following code.

@cached
def expensive_function():
   ...

However, the use of @cached in this context can be useful in some contexts. Example: @cached does effectively the same thing with significantly less code as the following construct using a global variable:

HAS_RUN=False
def expensive_function():
   global HAS_RUN
   if HAS_RUN:
      return
   else:
      HAS_RUN=True

Can this syntax warning be configurable and/or removed?

DEPRECATION: distro-info 0.23ubuntu1 has a non-standard version number

I am getting this deprecation warning: "DEPRECATION: distro-info 0.23ubuntu1 has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at pypa/pip#12063"

Any ideas to fix it?

Where is the cache dir?

May I ask where the cache is stored in disk? Can I define a custom directory to store the cache?

Thanks!

Why require the keymaker to have the same signature?

I have a use case where pretty much all mutable unhashable objects are used immutably/read-only and where many different functions can have all kinds of such objects, mixed with hashable objects.

For this, I want to simply use the id instead of the hash (actually combine the hash and ids of all parameters into a string and hash that), so a single keymaker with signature (*args, **kwargs) would be sufficient. However the library explicitly checks and disallows to use that which means I have to implement many functions with different signatures which all do exactly the same.

type signature preservation does not support class arguments

Example source:

from memoization import cached
import inspect

class A:
    @cached(ttl=1)
    def b(self, name: str) -> int:
        return len(name) * 2

    def a(self) -> bool:
        return self.b('hello') == 10

Using mypy for the above example gives the error:

/tmp/as-class.py:10: error: Too few arguments for "b" of "A"
/tmp/as-class.py:10: error: Argument 1 to "b" of "A" has incompatible type "str"; expected "A"

If you comment out the @cached line, mypy gives the response:

Success: no issues found in 1 source file

Memoization of classes does not work

I am attempting to memoize a class that has an expensive calculation in the init. If the class has a str method, decorating the class with a @cached() does not throw an error, but also does not seem to function (i.e. No memoization observed). If you use a custom key maker, you get an TypeError saying the key maker signature does not match the function arguments (Think that is a clue right there).

Memoizing classes is not that uncommon of a feature so I think this should be supported. Here is a simple version if the problem below for reproducing.

from memoization import cached
import time
import timeit

class Point():
    def __init__(self, x, y):
	    self.x = x
	    self.y = y

	    # some super intense computation here
	    time.sleep(5)

# custom key-maker for the class
def key_maker(point):
    key = hash((point.x, point.y))
    return key

@cached(custom_key_maker=key_maker)
class CachedPoint(Point):
    def __init__(self, x, y):
        super().__init__(x, y)


if __name__ == "__main__":

    t0 = timeit.default_timer()
    p = Point(1, 2)
    t1 = timeit.default_timer()
    print(f'Making a point took {t1-t0} seconds')


t0 = timeit.default_timer()
p = Point(1, 2)
t1 = timeit.default_timer()
print(f'Re-making a point took {t1-t0} seconds')

Caching repeated key after TTL not working

We have a challenging use-case involving creating animated countdown timers.
Timer animations are generated by a (separate) serverless function.
Timer definition is contained in the URL string - color, expiration date/time, and so on.

Caching of repeat requests with the -same- single param (the URL string) fails after the TTL.

@cached(ttl=10)
def fetch_count(url)
...
return image

fetch_counter(url)

cache hits mount up...but if, 15 seconds later...

fetch_counter(url)
fetch_counter(url)
fetch_counter(url)

according to cache_info(), there aren't any additional hits.
TTL clearing doesn't appear to expunge the key (url string in this case).

Workaround with clock value shouldn't be too difficult, just thought you'd want to know.

Rgds

Switch ON/OFF programatically ?

Is there some way to Switch ON/OFF programatically the cache ?

I have a scenario where I have the following logic :

 train() - cycle
 batch predict() - cycle
 predict()

I want to be able to switch memorization ON for batch-predict() cycle, but be OFF otherwise ?

How to serialize/unserialize the cache in/from a file?

Having the cache info is very useful, but I'm missing an entry to the cache itself so I could serialize it and reuse it later. Is there any way to do that already?

>>> f.cache_info()
CacheInfo(hits=8207, misses=1957, current_size=1957, max_size=None,
          algorithm=<CachingAlgorithmFlag.LRU: 2>, ttl=None,
          thread_safe=True, order_independent=False, use_custom_key=False)

remove particular element from cache?

Hello I have function which can return results or timeout, simplified example is below:

@cache()
def longjob(args):
   try:
      data =  getdata(args)
      return data
  except timeout:
      return 'data not available try again later'

When function cannot return data there is no reason to cache result. Is there any way to remove single entry from cache? I am aware of clear_cache() but this remove all cached values.

Feature request / Help wanted: Support for coroutines and async functions

Hi folks,
I really love your module and we even use it at work quit a lot :D
I wanted to contribute and add support for async methods but got stuck since I cannot really access the cache store behind the cached decorator...

here's the basic approach:

import asyncio
from time import perf_counter
from functools import wraps

def log(msg: str):
    print(f"[{round(perf_counter(),3)}]: {msg}")

class AsyncMemoize:
    def __init__(self) -> None:
        self.cache = {} # very silly store for proof of concept

    def __call__(self, function):
        @wraps(function)
        async def wrapped(*args):

            cache_key = hash(args) # 
            if self.cache.get(cache_key, ...) == ...:
                self.cache[cache_key] = await function(*args)
            return self.cache[cache_key]

        return wrapped

@AsyncMemoize()
async def calculate_stuff(n: int):
    await asyncio.sleep(1) # intensive calculation
    return n

async def main():
    log("Calculating...")
    await calculate_stuff(4)
    log("Calculating again...")
    await calculate_stuff(4)
    log("Calculating other...")
    await calculate_stuff(7)
    log("Calculating other again...")
    await calculate_stuff(7)
    log("All done!")

asyncio.run(main())

I could work with the cached decorator, however, this would mean making an async function call sync, and therefore blocking the thread, which is complete bullshit of cause 😆

If you can show me how to access the store directly, I'm gonna fork this repo and implement async support (which is really nice for things like server-side caching on Quart and FastAPI etc.) 😀

Best regards 👋

Caching is not working as expected when both max_size and TTL are used

Hi,
I am trying to use both max_size and ttl for my cache, but I see that once the cached element is evicted from the cache after the ttl, it is not being cached anymore. Using @cached(ttl=5) works as expected, the element is evicted after 5 seconds and after the next subsequent call, it is cached and retrieved from the cache for the next 5 seconds. But when I use @cached(max_size=5, ttl=5), after the element is evicted it does not cache the subsequent calls and all calls after that are hitting the function instead.
For example, refer to the below code snippet:

from memoization import cached
import time

# @cached(ttl=5)  # works as expected
@cached(max_size=5, ttl=5)  # does not cache after ttl
def testing_cache(x):
    print("not cached")
    return x


while True:
    print(testing_cache(5))
    print(testing_cache.cache_info())
    time.sleep(1)

Here is a test case, to make it easier:

import unittest
from memoization import cached, CachingAlgorithmFlag, _memoization
import random
from threading import Lock
import time


make_key = _memoization._make_key   # bind make_key function
exec_times = {}                     # executed time of each tested function
lock = Lock()                       # for multi-threading tests
random.seed(100)                    # set seed to ensure that test results are reproducible

for i in range(1, 3):
    exec_times['f' + str(i)] = 0  # init to zero


@cached(max_size=5, algorithm=CachingAlgorithmFlag.FIFO, thread_safe=False, ttl=0.5)
def f1(x):
    exec_times['f1'] += 1
    return x


@cached(ttl=0.5)
def f2(x):
    exec_times['f2'] += 1
    return x


class TestMemoization(unittest.TestCase):
    # this test fails
    def test_maxsize_TTL(self):
        self._general_ttl_test(f1)

    # this test passes
    def test_ttl_only(self):
        self._general_ttl_test(f2)

    def _general_ttl_test(self, tested_function):
        # clear
        exec_times[tested_function.__name__] = 0
        tested_function.cache_clear()

        arg = 1
        key = make_key((arg,), None)
        tested_function(arg)
        time.sleep(0.25)  # wait for a short time

        info = tested_function.cache_info()
        self.assertEqual(info.hits, 0)
        self.assertEqual(info.misses, 1)
        self.assertEqual(info.current_size, 1)
        self.assertIn(key, tested_function._cache)

        tested_function(arg)  # this WILL NOT call the tested function

        info = tested_function.cache_info()
        self.assertEqual(info.hits, 1)
        self.assertEqual(info.misses, 1)
        self.assertEqual(info.current_size, 1)
        self.assertIn(key, tested_function._cache)
        self.assertEqual(exec_times[tested_function.__name__], 1)

        time.sleep(0.35)  # wait until the cache expires

        info = tested_function.cache_info()
        self.assertEqual(info.current_size, 1)

        tested_function(arg)  # this WILL call the tested function

        info = tested_function.cache_info()
        self.assertEqual(info.hits, 1)
        self.assertEqual(info.misses, 2)
        self.assertEqual(info.current_size, 1)
        self.assertIn(key, tested_function._cache)
        self.assertEqual(exec_times[tested_function.__name__], 2)

        # The previous call should have been cached, so it must not call the function again
        info = tested_function.cache_info()
        self.assertEqual(info.current_size, 1)

        tested_function(arg)  # this SHOULD NOT call the tested function

        info = tested_function.cache_info()
        self.assertEqual(info.hits, 2)  # FAILS
        self.assertEqual(info.misses, 2)  # FAILS
        self.assertEqual(info.current_size, 1)
        self.assertIn(key, tested_function._cache)
        self.assertEqual(exec_times[tested_function.__name__], 2)


if __name__ == '__main__':
    unittest.main()

Memoization does not release memory after cache_clear

Looking at the profiler output, it seems like cached decorator is not releasing memory:

    32   2578.6 MiB      0.0 MiB           1       c = MemoizeClass()
    33   2829.4 MiB      0.0 MiB        1001       for i in range(1000):
    34   2829.4 MiB    250.8 MiB        1000           c.get_something(random.randint(0, 4000000000000))
    35   2829.4 MiB      0.0 MiB           1       print(c.get_something.cache_info())
    36   2829.5 MiB      0.1 MiB           1       print(len(list(c.get_something.cache_items())))
    37   2829.5 MiB      0.0 MiB           1       print("############## flushing the cache ################")
    38   2829.5 MiB      0.0 MiB           1       c.get_something.cache_clear()
    39   2829.5 MiB      0.0 MiB           1       print(len(list(c.get_something.cache_items())))
    40   2829.5 MiB      0.0 MiB           1       print(c.get_something.cache_info())
    41   2829.5 MiB      0.0 MiB           1       print(f"found some Garbage:{len(gc.garbage)} items")
    42   2829.5 MiB      0.0 MiB           1       print(f"collected: {gc.collect()}")

environment :
python 3.9.5
memoization: 0.4.0

code to reproduce the results. run with pytest -s for best results.

from memory_profiler import profile
import gc
from memoization import cached
import random
import hashlib


class MemoizeClass:
    def __int__(self):
        self.unique = random.randint(0, 4000000000000)

    @cached(max_size=1362)
    def get_something(self, param):
        return [param] * (2*10**5)
@profile
def test_memoization_cache():
    print("\n")
    c = MemoizeClass()
    for i in range(1000):
        c.get_something(random.randint(0, 4000000000000))
    print(c.get_something.cache_info())

    c = MemoizeClass()
    for i in range(1000):
        c.get_something(random.randint(0, 4000000000000))
    print(c.get_something.cache_info())

    del c
    print(f"found some Garbage:{len(gc.garbage)} items")
    print(f"collected: {gc.collect()}")

    c = MemoizeClass()
    for i in range(1000):
        c.get_something(random.randint(0, 4000000000000))
    print(c.get_something.cache_info())
    print(len(list(c.get_something.cache_items())))
    print("############## flushing the cache ################")
    c.get_something.cache_clear()
    print(len(list(c.get_something.cache_items())))
    print(c.get_something.cache_info())
    print(f"found some Garbage:{len(gc.garbage)} items")
    print(f"collected: {gc.collect()}")

    c = MemoizeClass()
    for i in range(1000):
        c.get_something(random.randint(0, 4000000000000))
    print(c.get_something.cache_info())

add a new parameter in cache wrapper

All things works fine when I used cache wrapper in the examples. But I met a big problem if I want to cache a inner function when design a lazy query tools. For example:

from memoization  import cached

class LazyQuery:
    def __init__:
        self.pipeline = list()
        self.cache = cached(max_size=10, ttl=10)

    def query1(**args):
        @cached
        def func():
              # do something()
            self.pipeline.append(func)
        return self

    def query2(**args):
        @cached
        def func():
              # do something()
            self.pipeline.append(func)
        return self

    # other query function with inner funciton cache wrapper

    def run():
        _input, _output_ = None, None
        for step in self.pipeline:
            _output = step(_input)
           _input = _output
        return _output


if __name__ == "__main__":
    lazy_query = LazyQuery()
    for i in range(5):
        lazy_query.query1().query2().run()
        lazy_query.pipeline.clear()

In fact, each inner cache wrapper funciton in every query has its own cache structures, like id(cache) in caching/lru_cache.py get_caching_wrapper(). Therefore, If the cached wrapper can add an extra position paramerter cache after custom_key_maker would be better.

def get_caching_wrapper(user_function, max_size, ttl, algorithm, thread_safe, order_independent, custom_key_maker, cache):

Ignoring some arguments when caching

Is there the opportunity to bypass some specific function arguments when caching?

An example of desired functionality is fulfilled by joblib

Use the cache directly without a decorator

I've a need to get/set values from an LFU cache directly, rather than as a function decorator. The need is as such:

def slow_function(*args, **kwargs)
    cache = choose_cache_out_of_many(*args)
    found = cache.get(*args, **kwags)
    if found: return found

    result = slow_code()

    cache.set(result, *args, **kwargs)

This pattern of having multiple caches and only knowing which one to leverage inside the function that is to be cached means I cannot use a decorator.

How can I access memoization caches directly?

use lru_dict?

You write:

    :param max_items: The max items can be held in memoization cache
                      * NOT RECOMMENDED *
                      This argument, if given, can dramatically slow down the performance.

Would it be better to use lru-dict?

easy way to handle random_seed-like paramaters

In the numpy-universe stochastic functions's seed can be either fixed by setting it to an int, or deterministic behavior can be switched off by setting the seed to None.

My workaround to ensure the correct behavior is:

def my_keymaker(<the whole signature>, random_seed=None):
    if random_seed is None:
       random_seed = np.random.normal()
    return <usual key for all parameters>, random_seed

@cached(custom_key_maker=my_keymaker)
def function_with_long_signature(<the whole signature>, random_seed=None)

I understand, with this approach numpy (or equivalent) receives None (and not the random float generated in the keymaker) but at the same time, we force to have new hash every time random seed is None.

This approach seems to work nicely but doesn't look very elegant, especially if the function has a long signature...

Any more comfortable way to disable cache when certain parameters take certain values?