doist / bitmapist Goto Github PK
View Code? Open in Web Editor NEWPowerful analytics and cohort library using Redis bitmaps
Powerful analytics and cohort library using Redis bitmaps
Please add integration with https://travis-ci.org/
A co-worker of mine showed me this library. I think this is an excellent use of bitmaps! I think I see another interesting extension of this bitmapist library and I wanted to get your feedback.
I work for a company that needs an almost real-time recommendation engine. I started thinking about the application of bitmaps into collaborative filtering. Basically in collaborative filtering you need a matrix that is comprised of [users * products]
The idea in this instance would basically be to have a SETBIT users:20160614:1 [product_id] 1
for each user that is representative of what product they like. You would also need to have a SETBIT products:20160614:1 [user_id] 1
this would have an index into each user.
Wait I think I see a problem with this. How would we represent an empty space in the matrix? A cooccurrence matrix has 3 states (Like, Dislike, Unknown). I guess you could probably combine them in some way or maybe my bit operations are rusty.
The main benefits in this would be the storage savings that this could have. When I did a quick calculation based on traffic a site like ours would see in a day it would be about (700 products * 300k users) 26.25MB
I understand that you are probably not interested in having this be part of this repo. I'd mainly like your feedback and advice. Thanks!
This happens when using bitmapist-server
as the backend. Possibly with redis
too.
def delete_all_events(system='default'):
"""
Delete all events from the database.
"""
cli = get_redis(system)
keys = cli.keys('trackist_*') # <- None
if len(keys) > 0:
cli.delete(*keys)
File "../lib/python3.7/site-packages/bitmapist/__init__.py", line 272, in delete_all_events
if len(keys) > 0:
TypeError: object of type 'NoneType' has no len()
Available at https://pypi.python.org/pypi/bitmapist#downloads, but isn't tagged in the GitHub repo.
@imankulov Just to confirm, tagging the current master with 3.100
should do, yes?
I'm trying to store the user likes in redis, using bitmaps to store this question_id is liked by these users. But apparently, unique events is somehow way slow for the operation.
In [1]: from bitmapist import mark_unique
In [2]: mark_unique("question_likes:1234", 567463)
In [3]: mark_unique("question_likes:1234", 5637363)
In [4]: mark_unique("question_likes:1234", 7363)
In [5]: mark_unique("question_likes:1234", 731263)
In [6]: mark_unique("question_likes:1234", 731263)
In [7]: from bitmapist import UniqueEvents
In [38]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:uids = UniqueEvents('question_likes:1234').get_uuids()
:t = time()
:for i in uids:
: print i
:print "elapsed time:", time() - t
:--
7363
567463
731263
5637363
elapsed time: 14.1893291473
I'm running the latest bitmapist on Redis server v=4.0.9 on linux mint.
Her is th debug object output for the key:
127.0.0.1:6379> DEBUG OBJECT trackist_question_likes:1234_u
Value at:0x7f979b6a51e0 refcount:1 encoding:raw serializedlength:8036 lru:12900027 lru_seconds_idle:252
This is not really a "code bug", but rather a potential misusage issue with the "BitOpNot" operator. The "BitOpNot" does exactly what it supposes to do (which is flipping the bit). However, since we are using bitmapist as a stat tool, we expect it to give us the result of negation of a population set, which "BitOpNot" does not provide.
Suppose you have total of 100 users. You use bitmapist to mark active user with event "active". Assume you have 5 active user and the bitmap data look like this:
1011011
Now you want to count how many active user you have
print len(MonthEvents('active', now.year, now.month))
#print 5, correct number.
Now suppose you wanna know how many inactive user you have.
print len(BitOpNot(MonthEvents('active', now.year, now.month)))
#print 2.
Here the negation of the "active user set" gives us size of 2 instead of 95.
The fundamental problem is that the variable length bitmap data only contain information about "who are active", but it does not contain information about the population size.
I use something like
def yield_values(self):
cli = self.redis_client
s = cli.get(self.redis_key)
for c in s:
bits = bin(ord(c))[2:]
bits = '00000000'[len(bits):] + bits
for i, b in enumerate(bits):
v = int(b)
if v:
yield i
def get_values(self):
return list(self.yield_values())
on yipit's class based version
Is there currently way to determine the total number of occurrences for a given event?
Would there be a way to achieve this within scope's of Weeks|Months|YearsEvents()
Hello,
I noticed a little quirk on NOT operators for empty bitmaps.
Say for example that a bitmap represents a byte of 0's (0000 0000).
When this byte is negated, it should give back (1111 1111) or a value of 255.
Instead, I am getting back another 0.
On non-empty bitmaps, the NOT operator seems to work fine, only up to the highest flagged bit.
Currently, if some bitmaps are empty, I have to manually flag/mark an event with a large dummy ID in order to make the NOT operator work.
Are there better ways of accomplishing this?
In your documentation it reads:
Using Redis bitmaps you can store events for millions of users in a very little amount of memory (megabytes). You should be careful about using huge ids (e.g. 2^32 or bigger) as this could require larger amounts of memory.
It might be a potential solution to create a hash table that keeps track of huge ids and maps them back down to smaller indexes.
For example, a user with an id of 192329230202 could be mapped to a smaller index 1 in the bitmap. This would require an O(1) lookup before a `SETBIT' so it shouldn't affect time performance, but it would require more space on disk.
Steps to implement:
GET "users_counter:20160614"
which would respond back with something like 2.User(id:192329230202)
to the user_index
table and reassign to User(internal_id:3)
. This would do something like SET "users_index:20160614:192329230202" 3
INCR users_count:20160614
SETBIT "events:search:20160614" 3 1
bit into feature signup at User(3)
index instead of at the end of the bitmap which requires creating and storing empty bits.This would most likely add complexity into how your query the data and you would have to store a reference to lookup each user. In my proposal I used a different lookup per day to reset the bitmap indexes everyday, but this might be more trouble than just maintaining one large ongoing table.
Let me know your thoughts and if this something you are interested in promoting into an enhancement.
Is there a more efficient way to store extra data for scenarios like 'user x replied question y correctly|falsely in z seconds'?
I think implementations such as
mark_event("question:y:x:1/0', z)
would be neither effective nor useful for queries.
Some examples on this would be very helpful.
How do I populate data into Bitmapist so that I can play with your queries..
Do you have a utility to populate domain/ application specific data and run bunch of the queries as you had put and gauge, how fast it works..
That will be helpful.
Krishna
Now that Redis has Hyperloglog
http://antirez.com/news/75
http://redis.io/commands/pfadd
which seems particularly well suited to counting uniques, it would be awesome to have a switch in this library that used it. In my use case, I have thousands of bitmapist keys, and the memory usage ends up climbing quickly.
Running
from bitmapist import mark_event
from bitmapist import cohort as bitmapist_cohort
mark_event('active', 123)
mark_event('song:add', 123)
mark_event('song:play', 123)
html_form = bitmapist_cohort.render_html_form(
action_url='/_Cohort',
selections1=[ ('Are Active', 'active'), ],
selections2=[ ('Played song', 'song:play'), ],
time_group='days',
select1='active',
select2='song:play'
)
dates_data = bitmapist_cohort.get_dates_data('active','song:play', 'days','default')
html_data = bitmapist_cohort.render_html_data(dates_data, 'days')
I get:
File "/usr/local/lib/python2.7/dist-packages/bitmapist/cohort/__init__.py", line 191, in get_dates_data
for i in range(0, date_range):
UnboundLocalError: local variable 'date_range' referenced before assignment
ImportError: No module named mako.lookup
Your cohort package requires mako, yet it's not listed in the install_requires
.
>>> myid
10204510554222024
>>> mark_event('active', myid)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/sardor/.virtualenvs/tarjimonlar/local/lib/python2.7/site-packages/bitmapist/__init__.py", line 173, in mark_event
client.execute()
File "/home/sardor/.virtualenvs/tarjimonlar/local/lib/python2.7/site-packages/redis/client.py", line 2578, in execute
return execute(conn, stack, raise_on_error)
File "/home/sardor/.virtualenvs/tarjimonlar/local/lib/python2.7/site-packages/redis/client.py", line 2492, in _execute_transaction
self.raise_first_error(commands, response)
File "/home/sardor/.virtualenvs/tarjimonlar/local/lib/python2.7/site-packages/redis/client.py", line 2526, in raise_first_error
raise r
ResponseError: Command # 1 (SETBIT trackist_active_2015-3 10204510554222024 1) of pipeline caused error: bit offset is not an integer or out of range
>>>
Apparently the user IDs are the offset, so whenever I mark an event, you set 1 on the offset which is the user ID. Does this mean user IDs can never be strings ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.