This repository is created to host exploratory code for the Wikipedia Revision Dataset, as part of the Cross Edit Pattern Detection Project.
This repository is forked from a Google corporate repository and will push changes regularly.
Author: Haoran Fei ([email protected])
Host: Zainan Zhou ([email protected])
Date: June 8th, 2020
Python3: GPL-Compatible License. GPL-compatible doesn’t mean that we’re distributing Python under the GPL. All Python licenses, unlike the GPL, let you distribute a modified version without making your changes open source.
Pandas: New BSD License.
Matplotlib: License based on PSF license.
Loading the First json data file and run article-based analysis:
$ python3 article_analytics.py --path ./data/cross_edits_tmp_ttl=72_revisioninfo_20200605_1023_segment-000##-of-00037.json --start 0 --stop 1
Loading all 37 json data files and run article-based analysis:
$ python3 article_analytics.py --path ./data/1023/segment-000##-of-00037.json --start 0 --stop 37
Loading the First json data file and run author-based analysis:
$ python3 author_analytics.py --path ./data/cross_edits_tmp_ttl=72_revisioninfo_20200605_1023_segment-000##-of-00037.json --start 0 --stop 1
Loading all 37 json data files and run author-based analysis:
$ python3 author_analytics.py --path ./data/1023/segment-000##-of-00037.json --start 0 --stop 37
A window will be flagged as anomaly if it satisfies the following condition:
M: metric considerd. Currently supports mean and median.
W: the window frame under consideration.
S: the complete dataset of the given key. This can be all edits on the same article/by the same author, depending on the key used.
k: value is either 1 or -1. It is 1 if we are concerned with abnormally high values only, and -1 if we are concerned with abnormally
low values only.
t: a percentage threshold for flagging anomal. Currently set at 50%.
All log files are located in the cross-edits-analysis/log directory. Each directory holds the logs for the corresponding analysis script.
Format of log line: Anomaly of (metric name) of (column name) detected for (key: this can be article/author or article/author pair) during period from
(starting time of window) to (ending time of window), with a () percent difference from baseline.