xychang / recursivehierarchicalclustering Goto Github PK

View Code? Open in Web Editor NEW

55.0 55.0 19.0 209 KB

Use iterative feature pruning to identify hierarchical clusters.

Home Page: http://sandlab.cs.ucsb.edu/clickstream/index.html

License: GNU General Public License v3.0

Python 60.41% CSS 4.05% JavaScript 25.98% HTML 9.55%

behavior-analysis clustering visualization

recursivehierarchicalclustering's People

Contributors

Stargazers

Watchers

Forkers

bolunwang wojohowitz00 bigrlab dreisel friverai abhinavgpt kenhbs tl-petter rain-junyu meenaparam jeaung ywu-stats wunderwuzzi1975 syahrulhamdani world4jason shaileshj2803 yujun001 claudn yang-tradelab

recursivehierarchicalclustering's Issues

Inconsistency

Hi,

I read your paper and found it very interesting so I am trying to implement it. I noticed an inconsistency between the paper and the codes, namely in the paper the events stream is processed as A(g1)D(g2)A(g2)C(g3) where A,C,D are events and g1, g2, g3 the gaps between each event whereas in the codes, it seems (from the README and the way the sparse matrix is constructed) that the events are processed as A(na)D(nd)C(nc), where A,C,D are events and na,nc,nd are their respective occurence in the user's history.

If we do an analogy with NLP, the version on the paper sounds more like a sequence model whereas the second one more like a bag-of-word model, which I expect to give different clustering results.

Could you please clarify which processing I should follow ?

Thanks!

cluster with users, but no action pattern(s)

Hi @xychang,

thank your very much for the paper you wrote & the great software you provided. I had no problems to get the code running under Windows 7 and Python 3.6.9. When I looked at the results that I got for the clickstream data that I put in from my webservice, I recognized a cluster (cluster-ID 43, see screenshot provided) that contains 2 users, but no action pattern(s). I double checked my input data, but I was not able to determine any (obvious) anomalies. Can you enlight me on this issue? I have attached the data (clickstreams_Heilmittel_Verloren.txt) to this issue.

Kind regards
Michael

clickstreams_Heilmittel_Verloren.txt

setting threadNum and minPerSlice

Hello, thank you for your wonderful project again, and I'm striving to run the clustering across 2 AWS EC2 servers. I modified code, and it successfully accesses to 2 servers with small input file. But the thing is that, it does not work with the large input file(~700000 users), so I wonder how I should set the threadNum and minPerSlice to work with 2 servers and 700000 users input.

Is there any further explanation on setting threadNum and minPerSlice to use all of the servers listed in 'servers' variable in servers.json?

Thank you for your amazing project again!

How can I change from analyzing the frequency of a pattern to analyzing it like on paper?

Hi! I have read your article, and it's just great! As you noted earlier, this code is presented for a more general analysis, namely, more for frequency analysis of patterns. You also wrote in the ReadMe that you can immediately select patterns with timestamps and upload them to input.txt . But I wanted to redo the code as indicated on the paper, but due to the fact that I'm a beginner, I don't understand exactly where to make edits. Tell me, please.
Thanks!
Sincerely,
Egor

Time gap events

How do I format time gap events on the input?

This is an amazing project.
Thank you!

local host in windows system

Thank you for getting back to me with my previous issue about python version. I switched to python2 and was able to run it at least. But new issue I encountered was the setting the server.config. I'm running it on my local computer which is windows 10 system. I tried several things and was able to access the localhost http://localhost:8000/. However, after change the setting to "server":["http://localhost:8000/"], I got following error IOError: [Errno 22] invalid mode ('w') or filename: u'out/tmp_1571936168_rootsid_http://localhost:8000/.pkl'.
How can I fix it? Looking forward to your feedback.

vis.json not generating

The vis.json file is not being created with data when running python visulization.py output/result.json input.txt vis/vis.json. It's due to the sweetspot being less than .01. What's the meaning of this filter?

Failed to reproduce the results with sample data 'input.txt' provided

Hi xychang,

Thank you for sharing your great work! I would greatly appreciate if you could help me resolve the below issue.

I first tried the CLI interface, and was able to generate 'results.json' and 'vis.json'. However, it didn't allow me to http://localhost:8000/multi_color.html?json=vis.json, so I decided to give Python interface a try.

I am using the below code and parameter configuration to reproduce 'results.json' and 'vis.json'.

import recursiveHierarchicalClustering as rhc
import recursiveHierarchicalClusteringFast as rhcFast
data = rhc.getSidNgramMap(inputPath)
treeData = rhcFast.run(inputPath, data, outPath)

environment: Jupyter Notebook

inputPath: I added your input.txt file to one directory and set inputPath = '/home/chenruihao/test_clustering/input.txt'

outPath: I didn't find description of outPath but found outputPath which is "The directory to place all temporary files as well as the final result.". I suppose outPath and outputPath are both the directory to store output files. so I set outPath = '/home/chenruihao/test_clustering/output/'

I got below error when I try to run the above code:

/home/chenruihao/test_clustering/recursiveHierarchicalClustering.py:247: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  result = np.linalg.lstsq(A, y)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-57-3760f95de317> in <module>
----> 1 treeData = rhcFast.run(inputPath, data, outPath)

~/test_clustering/recursiveHierarchicalClusteringFast.py in run(ngramPath, sid_seq, outPath)
    416 
    417     hc = HCClustering(
--> 418         matrix, sid_seq, outPath, [], idxToSid,
    419         sizeThreshold=0.05 * len(sid_seq), idfMap=idfMap)
    420     result = hc.runDiana()

~/test_clustering/recursiveHierarchicalClusteringFast.py in runDiana(self)
    337                     matrix = calculateDistance.partialMatrix(
    338                         sids,
--> 339                         rhc.excludeFeatures(rhc.getIdf(self.sid_seq, sids),
    340                                             newExclusions),
    341                         ngramPath,

NameError: name 'ngramPath' is not defined

Q1: How may I fix this error?
Fix trial: 'ngramPath' is called in 'recursiveHierarchicalClusteringFast.py', so I hard coded it in the below way:

looks like the run function under ngramPath seems to be the same as sys.argv[1], and by definition, ngramPath is the path to the computed pattern dataset, so I hard code ngramPath = '/home/chenruihao/test_clustering/input.txt', same as the inputPath, but I still got the above error...Would love to hear your thoughts.

Q2 I also want to understand what user_id were clustered, their membership, and their corresponding action-gap-action similar to the issue discussed in another thread. Would it be possible to just use the result.json file to answer my question as well as the question in the above thread, rather than modify the code?

My understanding is that from the result.json, it looks like for each level of cluster,

key = 1 stores the user_ids that were clustered in that level of cluster;
key = 2, exclusions stores the action-gap-action/token members of the cluster

Thanks!
Anthony

Python version update

Thanks for sharing this very interested project. I'd love to try it for my analysis.
I'm on Python 3.7.3 and now having trouble of running the example code with input.txt.
Error message is
" File "recursiveHierarchicalClustering.py", line 73
def splitCluster((baseCluster, diameter, baseSum, cid), matrix):
^
SyntaxError: invalid syntax

My guess is the Python version issue. I'm new to Python though... Looking forward to any help here.
Thanks!

Input variables

In the input, why does it take in frequency of actions? Doesn't the 2016 paper use time gaps between the actions?

Utilizing multiple servers

Hi,

I'm running your project with 10000 users with 2 servers using shared storage.
I've set threadNum as 5 and minPerSlice as 1000.

However, when I look into servers status by 'hotp' command, I've found that the project doesn't utilize both of the servers at the same time, but switches the server usage back and forth.

Is this the correct way that you've intended or did I set the parameter (threadNum and minPerSlice) in a wrong way?

In addition, I want to know if there are any guidelines on how to set threadNum and minPerSlice to effectively utilize many servers.

Thanks for your help in advance!

Using multiple servers

Hello, I am very impressed by your wonderful research!
Thank you for sharing it!

But, I faced a problem of using multiple servers,
I thought that the temporary files(.pkl files) are also created in the multiple servers. But even though it doesn't create temporary files in remote servers, I think it looks for the files in remote servers, creating errors.

Again, thank you for your work.

Cluster quality results?

Hi,

Massive thanks for this great tool and it works absolutely fine with my data. In the paper, you mentioned that you experimented with different values of k to create the k-gram sequences. What metric would you recommend to evaluate these clusters?

For e.g. if I experiment with k -> [1,2,3,4,5] I would have 5 set of results (assuming I dont include time gaps at this stage, as that would double the number of results). How would I decide which clustering is the best? Is it simply the modularity score? If yes, each cluster has a modularity value but is there a way to amalgamate that for an entire set of results?

Sharing the visualization

Hi @xychang
Is there any way I could easily save and share my visualization with coworkers and allow them to click on clusters to see the details about the features without running the command?

Thank you

Just to verify my test result

Hi,

I was finally able to run it through with the sample data.
I just wanted to verify that my result look as expected(attached visualization). Very cool visualization though!

Unable to run on sample data provided

Hi @xychang,
Thank you for providing this work. I wanted to test this approach for a project I am working on, but first I tried to simply run the code with the example input provided.

python recursiveHierarchicalClustering.py input.txt output/ 0.05

However, I got a ZeroDivisonError and was not able to create the result.json output for further examination.

Here's what my output looks like:

[LOG]: total users 200
[LOG]: starting in localhost for tmp_1570654341_root
[LOG]: 2019-10-09 16:52:21.887000 computing matrix for output/tmp_1570654341_root
[LOG]: start new thread 1
[LOG]: 2019-10-09 16:52:25.918000 matrix computation finished for output/tmp_1570654341_root
computing matrix taking 4.037000s
matrix size 1
Traceback (most recent call last):
  File "recursiveHierarchicalClustering.py", line 975, in <module>
    treeData = runDiana(outPath, sid_seq, matrixPath=matrixPath)
  File "recursiveHierarchicalClustering.py", line 779, in runDiana
    baseMatrix, matrixBasics)
  File "recursiveHierarchicalClustering.py", line 529, in evaluateModularity
    return (firstEntry - secondEntry / m) / m
ZeroDivisionError: float division by zero

System info:
Python: 2.7.14
numpy: 1.14.0

Can you please help me resolve this issue?

Verifying results

Hey,
I was using your awesome clickstream algorithm engine when I noticed something interesting.

Here is what I did:
I am trying to verify results of the algorithm, so the check I do is the following:

After running algorithm, open result.json file.
For all leaf nodes in result.json, find list of exclusions for example:
["t", [["l", [48, 167, 201, 283, 434, 468, 672, 883, 916, 970, 1015, 1271],

{"exclusionsScore": [1285.0, 336.0208333333333, 0.0, 0.0], "exclusions": ["S2319", "S674", "S3690", "S3361"]}],

To verify results I do a lookup for all users in this cluster (for instance userId 48) against their respective input file (input file contains actual log of actions performed by users which is used as input to algorithm) to verify that they actually have done at least one of ["S2319", "S674", "S3690", "S3361"] sequences.

Here are the results:
I found that I when do verify results - about 20% of users do not have any of the cluster sequences in the input file, meaning they did not perform any of the sequences of actions of the cluster they belong to.

Here is what I expected:
Does this result make sense? Shouldn’t users perform at least 1 sequence that appears in cluster they belong to?
Thank you very much