lonelyenvoy / python-memoization Goto Github PK
View Code? Open in Web Editor NEWA powerful caching library for Python, with TTL support and multiple algorithm options.
License: MIT License
A powerful caching library for Python, with TTL support and multiple algorithm options.
License: MIT License
Are pandas dataframes supported as function arguments in a @cached decorated function?
I tried to simplify this example with a smaller dataframe but @cached does seem to behave as one would expect for smaller dataframes.
However, when I tried the minimal code below with the attached data I ran into a problem where the two clearly different dataframes are being interpreted as identical in the @cached decorated function. Thus, df2 doesn't make it through which_df
but instead gets the value from the cache since it assumes df2 is equals to df1 (and it is not!)
This is the test to replicate. Please use the attached data get the unexpected behavior explained in this issue
import pandas as pd
from memoization import cached
@cached()
def which_df(df):
# print("got inside function")
return df.name
df1 = pd.read_pickle('memoization_test.pkl')
df1.name = "This is DF No. 1"
df2 = df1.interpolate()
df2.name = "This is DF No. 2"
df1.equals(df2) # ==> False, since they are not identical
print(which_df(df1) + ', and it should be DF No. 1')
print(which_df(df2) + ', BUT it should be DF No. 2')
I was expecting cache entries to be removed after TTL expires. This is useful when we want know how many entries are are actually cached currently.
When determining how large the cache using max_size
is it may be useful to treat some items as larger than other to provide a better proxy for their memory footprint. For example, I have a function that caches 3D meshes. Setting a max_size
to a fixed number doesn't capture the fact that some 3D meshes are very large while others are very small.
Similar to the custom_key_maker
, a developer could provide an item_size
function that returns an integer allowing them to calculate the size of cached items based on the cache entry. In the use-case described above, I might return a value based on the number of vertexes in my mesh.
Is anyone actively working on this feature?
Using cached
does not preserve the type signature.
from memoization import cached
import inspect
def foo(a: str) -> int:
return int(a)
def bar(a: str) -> int:
return int(a)
@cached
def baz(a: str) -> int:
return int(a)
assert inspect.getfullargspec(foo) == inspect.getfullargspec(bar), "foo != bar"
assert inspect.getfullargspec(foo) == inspect.getfullargspec(baz), "foo != baz"
Expected: No output
Actual result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError: foo != baz
This prevents using type validation tools, such as mypy, from being used to validate the usage of methods wrapped in cached
.
Right now, syntax warning is generated with the following code.
@cached
def expensive_function():
...
However, the use of @cached
in this context can be useful in some contexts. Example: @cached
does effectively the same thing with significantly less code as the following construct using a global variable:
HAS_RUN=False
def expensive_function():
global HAS_RUN
if HAS_RUN:
return
else:
HAS_RUN=True
Can this syntax warning be configurable and/or removed?
I am getting this deprecation warning: "DEPRECATION: distro-info 0.23ubuntu1 has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at pypa/pip#12063"
Any ideas to fix it?
May I ask where the cache is stored in disk? Can I define a custom directory to store the cache?
Thanks!
I have a use case where pretty much all mutable unhashable objects are used immutably/read-only and where many different functions can have all kinds of such objects, mixed with hashable objects.
For this, I want to simply use the id instead of the hash (actually combine the hash and ids of all parameters into a string and hash that), so a single keymaker with signature (*args, **kwargs) would be sufficient. However the library explicitly checks and disallows to use that which means I have to implement many functions with different signatures which all do exactly the same.
Example source:
from memoization import cached
import inspect
class A:
@cached(ttl=1)
def b(self, name: str) -> int:
return len(name) * 2
def a(self) -> bool:
return self.b('hello') == 10
Using mypy for the above example gives the error:
/tmp/as-class.py:10: error: Too few arguments for "b" of "A"
/tmp/as-class.py:10: error: Argument 1 to "b" of "A" has incompatible type "str"; expected "A"
If you comment out the @cached
line, mypy gives the response:
Success: no issues found in 1 source file
I am attempting to memoize a class that has an expensive calculation in the init. If the class has a str method, decorating the class with a @cached() does not throw an error, but also does not seem to function (i.e. No memoization observed). If you use a custom key maker, you get an TypeError saying the key maker signature does not match the function arguments (Think that is a clue right there).
Memoizing classes is not that uncommon of a feature so I think this should be supported. Here is a simple version if the problem below for reproducing.
from memoization import cached
import time
import timeit
class Point():
def __init__(self, x, y):
self.x = x
self.y = y
# some super intense computation here
time.sleep(5)
# custom key-maker for the class
def key_maker(point):
key = hash((point.x, point.y))
return key
@cached(custom_key_maker=key_maker)
class CachedPoint(Point):
def __init__(self, x, y):
super().__init__(x, y)
if __name__ == "__main__":
t0 = timeit.default_timer()
p = Point(1, 2)
t1 = timeit.default_timer()
print(f'Making a point took {t1-t0} seconds')
t0 = timeit.default_timer()
p = Point(1, 2)
t1 = timeit.default_timer()
print(f'Re-making a point took {t1-t0} seconds')
We have a challenging use-case involving creating animated countdown timers.
Timer animations are generated by a (separate) serverless function.
Timer definition is contained in the URL string - color, expiration date/time, and so on.
Caching of repeat requests with the -same- single param (the URL string) fails after the TTL.
@cached(ttl=10)
def fetch_count(url)
...
return image
fetch_counter(url)
cache hits mount up...but if, 15 seconds later...
fetch_counter(url)
fetch_counter(url)
fetch_counter(url)
according to cache_info(), there aren't any additional hits.
TTL clearing doesn't appear to expunge the key (url string in this case).
Workaround with clock value shouldn't be too difficult, just thought you'd want to know.
Rgds
Is there some way to Switch ON/OFF programatically the cache ?
I have a scenario where I have the following logic :
train() - cycle
batch predict() - cycle
predict()
I want to be able to switch memorization ON for batch-predict() cycle, but be OFF otherwise ?
Having the cache info is very useful, but I'm missing an entry to the cache itself so I could serialize it and reuse it later. Is there any way to do that already?
>>> f.cache_info()
CacheInfo(hits=8207, misses=1957, current_size=1957, max_size=None,
algorithm=<CachingAlgorithmFlag.LRU: 2>, ttl=None,
thread_safe=True, order_independent=False, use_custom_key=False)
Hello I have function which can return results or timeout, simplified example is below:
@cache()
def longjob(args):
try:
data = getdata(args)
return data
except timeout:
return 'data not available try again later'
When function cannot return data there is no reason to cache result. Is there any way to remove single entry from cache? I am aware of clear_cache() but this remove all cached values.
Hi folks,
I really love your module and we even use it at work quit a lot :D
I wanted to contribute and add support for async methods but got stuck since I cannot really access the cache store behind the cached
decorator...
here's the basic approach:
import asyncio
from time import perf_counter
from functools import wraps
def log(msg: str):
print(f"[{round(perf_counter(),3)}]: {msg}")
class AsyncMemoize:
def __init__(self) -> None:
self.cache = {} # very silly store for proof of concept
def __call__(self, function):
@wraps(function)
async def wrapped(*args):
cache_key = hash(args) #
if self.cache.get(cache_key, ...) == ...:
self.cache[cache_key] = await function(*args)
return self.cache[cache_key]
return wrapped
@AsyncMemoize()
async def calculate_stuff(n: int):
await asyncio.sleep(1) # intensive calculation
return n
async def main():
log("Calculating...")
await calculate_stuff(4)
log("Calculating again...")
await calculate_stuff(4)
log("Calculating other...")
await calculate_stuff(7)
log("Calculating other again...")
await calculate_stuff(7)
log("All done!")
asyncio.run(main())
I could work with the cached
decorator, however, this would mean making an async function call sync, and therefore blocking the thread, which is complete bullshit of cause ๐
If you can show me how to access the store directly, I'm gonna fork this repo and implement async support (which is really nice for things like server-side caching on Quart and FastAPI etc.) ๐
Best regards ๐
Hi,
I am trying to use both max_size and ttl for my cache, but I see that once the cached element is evicted from the cache after the ttl, it is not being cached anymore. Using @cached(ttl=5) works as expected, the element is evicted after 5 seconds and after the next subsequent call, it is cached and retrieved from the cache for the next 5 seconds. But when I use @cached(max_size=5, ttl=5), after the element is evicted it does not cache the subsequent calls and all calls after that are hitting the function instead.
For example, refer to the below code snippet:
from memoization import cached
import time
# @cached(ttl=5) # works as expected
@cached(max_size=5, ttl=5) # does not cache after ttl
def testing_cache(x):
print("not cached")
return x
while True:
print(testing_cache(5))
print(testing_cache.cache_info())
time.sleep(1)
Here is a test case, to make it easier:
import unittest
from memoization import cached, CachingAlgorithmFlag, _memoization
import random
from threading import Lock
import time
make_key = _memoization._make_key # bind make_key function
exec_times = {} # executed time of each tested function
lock = Lock() # for multi-threading tests
random.seed(100) # set seed to ensure that test results are reproducible
for i in range(1, 3):
exec_times['f' + str(i)] = 0 # init to zero
@cached(max_size=5, algorithm=CachingAlgorithmFlag.FIFO, thread_safe=False, ttl=0.5)
def f1(x):
exec_times['f1'] += 1
return x
@cached(ttl=0.5)
def f2(x):
exec_times['f2'] += 1
return x
class TestMemoization(unittest.TestCase):
# this test fails
def test_maxsize_TTL(self):
self._general_ttl_test(f1)
# this test passes
def test_ttl_only(self):
self._general_ttl_test(f2)
def _general_ttl_test(self, tested_function):
# clear
exec_times[tested_function.__name__] = 0
tested_function.cache_clear()
arg = 1
key = make_key((arg,), None)
tested_function(arg)
time.sleep(0.25) # wait for a short time
info = tested_function.cache_info()
self.assertEqual(info.hits, 0)
self.assertEqual(info.misses, 1)
self.assertEqual(info.current_size, 1)
self.assertIn(key, tested_function._cache)
tested_function(arg) # this WILL NOT call the tested function
info = tested_function.cache_info()
self.assertEqual(info.hits, 1)
self.assertEqual(info.misses, 1)
self.assertEqual(info.current_size, 1)
self.assertIn(key, tested_function._cache)
self.assertEqual(exec_times[tested_function.__name__], 1)
time.sleep(0.35) # wait until the cache expires
info = tested_function.cache_info()
self.assertEqual(info.current_size, 1)
tested_function(arg) # this WILL call the tested function
info = tested_function.cache_info()
self.assertEqual(info.hits, 1)
self.assertEqual(info.misses, 2)
self.assertEqual(info.current_size, 1)
self.assertIn(key, tested_function._cache)
self.assertEqual(exec_times[tested_function.__name__], 2)
# The previous call should have been cached, so it must not call the function again
info = tested_function.cache_info()
self.assertEqual(info.current_size, 1)
tested_function(arg) # this SHOULD NOT call the tested function
info = tested_function.cache_info()
self.assertEqual(info.hits, 2) # FAILS
self.assertEqual(info.misses, 2) # FAILS
self.assertEqual(info.current_size, 1)
self.assertIn(key, tested_function._cache)
self.assertEqual(exec_times[tested_function.__name__], 2)
if __name__ == '__main__':
unittest.main()
Looking at the profiler output, it seems like cached decorator is not releasing memory:
32 2578.6 MiB 0.0 MiB 1 c = MemoizeClass()
33 2829.4 MiB 0.0 MiB 1001 for i in range(1000):
34 2829.4 MiB 250.8 MiB 1000 c.get_something(random.randint(0, 4000000000000))
35 2829.4 MiB 0.0 MiB 1 print(c.get_something.cache_info())
36 2829.5 MiB 0.1 MiB 1 print(len(list(c.get_something.cache_items())))
37 2829.5 MiB 0.0 MiB 1 print("############## flushing the cache ################")
38 2829.5 MiB 0.0 MiB 1 c.get_something.cache_clear()
39 2829.5 MiB 0.0 MiB 1 print(len(list(c.get_something.cache_items())))
40 2829.5 MiB 0.0 MiB 1 print(c.get_something.cache_info())
41 2829.5 MiB 0.0 MiB 1 print(f"found some Garbage:{len(gc.garbage)} items")
42 2829.5 MiB 0.0 MiB 1 print(f"collected: {gc.collect()}")
environment :
python 3.9.5
memoization: 0.4.0
code to reproduce the results. run with pytest -s
for best results.
from memory_profiler import profile
import gc
from memoization import cached
import random
import hashlib
class MemoizeClass:
def __int__(self):
self.unique = random.randint(0, 4000000000000)
@cached(max_size=1362)
def get_something(self, param):
return [param] * (2*10**5)
@profile
def test_memoization_cache():
print("\n")
c = MemoizeClass()
for i in range(1000):
c.get_something(random.randint(0, 4000000000000))
print(c.get_something.cache_info())
c = MemoizeClass()
for i in range(1000):
c.get_something(random.randint(0, 4000000000000))
print(c.get_something.cache_info())
del c
print(f"found some Garbage:{len(gc.garbage)} items")
print(f"collected: {gc.collect()}")
c = MemoizeClass()
for i in range(1000):
c.get_something(random.randint(0, 4000000000000))
print(c.get_something.cache_info())
print(len(list(c.get_something.cache_items())))
print("############## flushing the cache ################")
c.get_something.cache_clear()
print(len(list(c.get_something.cache_items())))
print(c.get_something.cache_info())
print(f"found some Garbage:{len(gc.garbage)} items")
print(f"collected: {gc.collect()}")
c = MemoizeClass()
for i in range(1000):
c.get_something(random.randint(0, 4000000000000))
print(c.get_something.cache_info())
All things works fine when I used cache wrapper in the examples. But I met a big problem if I want to cache a inner function when design a lazy query tools. For example:
from memoization import cached
class LazyQuery:
def __init__:
self.pipeline = list()
self.cache = cached(max_size=10, ttl=10)
def query1(**args):
@cached
def func():
# do something()
self.pipeline.append(func)
return self
def query2(**args):
@cached
def func():
# do something()
self.pipeline.append(func)
return self
# other query function with inner funciton cache wrapper
def run():
_input, _output_ = None, None
for step in self.pipeline:
_output = step(_input)
_input = _output
return _output
if __name__ == "__main__":
lazy_query = LazyQuery()
for i in range(5):
lazy_query.query1().query2().run()
lazy_query.pipeline.clear()
In fact, each inner cache wrapper funciton in every query has its own cache structures, like id(cache) in caching/lru_cache.py get_caching_wrapper()
. Therefore, If the cached wrapper can add an extra position paramerter cache after custom_key_maker would be better.
def get_caching_wrapper(user_function, max_size, ttl, algorithm, thread_safe, order_independent, custom_key_maker, cache):
Is there the opportunity to bypass some specific function arguments when caching?
An example of desired functionality is fulfilled by joblib
I've a need to get/set values from an LFU cache directly, rather than as a function decorator. The need is as such:
def slow_function(*args, **kwargs)
cache = choose_cache_out_of_many(*args)
found = cache.get(*args, **kwags)
if found: return found
result = slow_code()
cache.set(result, *args, **kwargs)
This pattern of having multiple caches and only knowing which one to leverage inside the function that is to be cached means I cannot use a decorator.
How can I access memoization
caches directly?
You write:
:param max_items: The max items can be held in memoization cache
* NOT RECOMMENDED *
This argument, if given, can dramatically slow down the performance.
Would it be better to use lru-dict?
In the numpy-universe stochastic functions's seed can be either fixed by setting it to an int, or deterministic behavior can be switched off by setting the seed to None.
My workaround to ensure the correct behavior is:
def my_keymaker(<the whole signature>, random_seed=None):
if random_seed is None:
random_seed = np.random.normal()
return <usual key for all parameters>, random_seed
@cached(custom_key_maker=my_keymaker)
def function_with_long_signature(<the whole signature>, random_seed=None)
I understand, with this approach numpy (or equivalent) receives None (and not the random float generated in the keymaker) but at the same time, we force to have new hash every time random seed is None.
This approach seems to work nicely but doesn't look very elegant, especially if the function has a long signature...
Any more comfortable way to disable cache when certain parameters take certain values?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.