Giter VIP home page Giter VIP logo

wikichron's Introduction

WikiChron

WikiChron is a web tool for the analysis and visualization of the evolution of wiki online communities.

It analyzes the history dump of a wiki and give you nice graphs plotting that data.

Development

Install

Dependencies

Install instructions

Tkinter

If you are using Linux (and probably Mac OS X too), likely you will need to manually install the tkinter utility. You can check if tkinter is already installed in your system by following these instructions.

To install it, if you are in Ubuntu or derivatives you can use this command:

sudo apt-get install python3-tk

If you are using pyenv, you should follow the instructions posted in this answer of stack overflow.

igraph

The dependency python-igraph needs to compile some C code, so, in order to install it, you priorly need some dev libraries for python, xml, zlib and C compiler utilities.

For Ubuntu 18.04 (bionic) or newer versions you can directly install igraph for python3 with:

sudo apt-get install python3-igraph

For older versions of Ubuntu (16.04 and derivatives), you need to install the following dependencies prior to be able to compile and install igraph with pip:

sudo apt-get install build-essential python3-dev libxml2 libxml2-dev zlib1g-dev

Other deps

After that, simply run: pip3 install -r requirements.txt. pip will install all the remaining dependencies you need.

In case of versions of linux where igraph for python3 is not available in your package manager, you need to use the requirements file which includes the igraph package in order to be built and installed with pip: pip3 install -r requirements+igraph.txt

Using a virtual environment

A good pratice is to use a virtual environment in order to isolate the development environment from your personal stuff. This skips issues about having different Python versions, pip packages in the wrong place or requiring sudo privileges and so on.

In order to do this, first install virtualenv, either from your package manager or from pip.

Then, create a new virtual environment in the repo root folder with: virtualenv -p python3 venv/

Activate the virtual environment: source venv/bin/activate

And finally, install dependencies here: pip install -r requirements.txt

Input wiki data

Likely, the source data for wikichron will come from a XML file with the full edit history of the wikis you want to analyze. Go here if you want to learn more about Wikimedia XML dumps.

In order to get such XML dump, you can follow the instructions explained in the WikiChron's wiki.

Process the dump

Secondly, you'll need to convert that xml raw data into a processed csv that WikiChron can load and work with.

For that transformation, you should use our wiki-dump-parser script. You can find a short guide on how to use this script in this page of the WikiChron's wiki.

Provide some metadata of the wiki

Wikichron needs one last thing in order to serve the visualization of a wiki for you.

You need to have a wikis.json file in your data_dir/ directory with some metadata of the wikis you want to explore; like the number of pages, the number of users, the user ids of the bots, etc.

You can find some helpful instructions on how to edit or automatically generate this file using a script in this page of the WikiChron's wiki.

Run the application

Use: python3 -m wikichron or python3 wikichron/app.py

The webapp will be locally available under http://127.0.0.1:8880/app/

Optionally, you can specify a directory with the csv data of the wikis you want to analyze with the environment variable: WIKICHRON_DATA_DIR.

For instance, suppose that your data is stored in /var/tmp, you might launch wikichron using that directory with:

WIKICHRON_DATA_DIR='/var/tmp' python3 wikichron/app.py

It will show all the files ending in .csv as wikis available to analyze and plot.

Development environment

To get errors messages, backtraces and automatic reloading when source code changes, you must set the environment variable: FLASK_ENV to 'development', i.e.: export FLASK_ENV=development prior to launch app.py.

There is a simple but handy script called run_develop.sh which set the app for development environment and launches it locally.

You can get more information on this in the Flask documentation.

Deployment

The easiest way is to use Docker.

Otherwise, follow the Dash instructions: https://plot.ly/dash/deployment and inspect the deploy.sh script, which launches the app with the latest code in master and provides the appropriate arguments. Check it out and modify to suit your needs.

gunicorn config

For the deployment you need to set some configuration in a file called gunicorn_config.py.

You can start by copying the sample config file located in this repo and then edit the config parameters needed to suit your specific needs:

cp sample_gunicorn_config.py gunicorn_config.py

The documentation about the gunicorn settings allowed in this file can be found in the official gunicorn documentation.

The environment variable WIKICHRON_DATA_DIR is bypassed directly to WikiChron and sets the directory where WikiChron will look for the wiki data files, as it was explained previously in the Run the application section.

Setup cache

If you want to run WikiChron in production, you should setup a RedisDB server and add the corresponding parameters to the cache.py file.

Look at the FlaskCaching documentation for more information about caching.

Flask deployment config

This webapp use some configurable parameters related to the Flask instance underneath. Those paramenters are such as hostname, port and ip address for the cache and need to be set in a file called "production_config.cfg" which should be located inside the directory called "wikichron". An example of the values for those parameters are in the file called "sample_production_config.cfg". So simply copy that file and edit them accordingly:

cp wikichron/sample_production_config.cfg wikichron/production_config.cfg

Docker

You can use Docker to deploy WikiChron. The sample configurations are already set to use get the docker compose working smoothly.

Just copy the sample config files to the appropriate names (as explained above) and run:

docker-compose up

This will start up the wikichron webserver as well as a redis instance to use for the cache.

If, for any reason, you wanted to start a standalone docker instance for wikichron (without redis), you could use a command similar to this one:

docker run --mount type=bind,source=/var/data/wiki_dumps/csv,target=/var/data/wiki_dumps/csv -p 8080:8080 -it wikichron:latest

Third-party licenses

Font Awesome

We are using icons from the font-awesome repository. These icons are subjected to the Creative Commons Attribution 4.0 International license. You can find to the terms of their license here. In particular, we are using the following icons: share-alt-solid, info-circle

Modifications in font awesome icons

  • The file: share.svg is a modification of the share-alt-solid.svg file provided by fontawesome.

Publications

WikiChron is used for science and, accordingly, we have presented the tool in some scientific conferences. Please, cite us if you use the tool for your research work:

  • Abel Serrano, Javier Arroyo, and Samer Hassan. 2018. Webtool for the Analysis and Visualization of the Evolution of Wiki Online Communities. In Proceedings of the European Conference on Information Systems 2018 (ECIS '18). 10 pages.
  • Abel Serrano, Javier Arroyo, and Samer Hassan. 2018. Participation Inequality in Wikis: A Temporal Analysis Using WikiChron. In Proceedings of the 14th International Symposium on Open Collaboration (OpenSym '18). ACM, New York, NY, USA, Article 12, 7 pages. DOI: https://doi.org/10.1145/3233391.3233536.
  • Youssef El Faqir, Javier Arroyo, and Abel Serrano. 2019. Visualization of the evolution of collaboration and communication networks in wikis. In Proceedings of the 15th International Symposium on Open Collaboration (OpenSym '19). ACM, New York, NY, USA, Article 11, 10 pages. DOI: https://doi.org/10.1145/3306446.3340834

wikichron's People

Contributors

akronix avatar dependabot[bot] avatar fryoussef avatar javiag avatar laudf avatar lsdrh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

wikichron's Issues

More metrics on different leves of active editors

We are thinking that more selective metrics regarding active editors might be added, specially considering larger wikis.

So, following: WikiMedia's Wikistats metrics (enter here for an example); we could have active editors who have made: {>=1 edit, >= 5 edits, >= 25 edits, >= 100 edits}. Right now we are only have "active users" which we consider users who has made >= 1 edit (as explained in our wiki).

Next level would be to have "very active users" who are editors who have made >=100 edits, but we should consider if this meaningful for the vast majority of wikis that we are studying.

The user should be able to sort the metrics plots

A way to do it would be to let him short by "drag and drop" the metric names in the container where the selected metrics appear.

Perhaps, to make it intuitive for the users, the box that contains the metric name should also have a correlative number starting from 1.

Wild guess. Ask the interface designer.

Add info about metrics computation, assumptions and decisions taken

The computation of the metrics assume several aspects that should be clarified somewhere in wikichron, probably and info page.

A (probably) non-exhaustive list of aspects and decisions taken that should be clarified/explained are:

  • Definition of page, article, user, active user, anonymous user, contribution (are there any exception to the definition? how are they computed?)
  • How are the edits (and other metrics) of anonymous users computed (grouped by IP address?)
  • What happens with bots edits (they are removed from the metrics)? How are bots identified?
  • How are the montly counts computed specially in the case of "relative" dates? By natural months or by "relative" months?
  • Mathematical formula of each and every metric computed

Remove bot activity from the metrics

Since the tool is interested in analyzing/visualizing human activity in wikis, metrics must not consider bots or bot activity.

Hence, bots should be removed from users counts (active users, new users, new registered users, etc), from activity counts (gloabl edits, article edits, edits in user talk, edits in article talk, pages edited monthly...) and ratios (e.g. in edits per user, bot activity should not be taken into account into the numerator and bots should not be counted in the denominator).

Find out why the app is restarted every first launch in debug mode

For some reason, the code contained in main:

# create and config Dash instance
    app = create_app()

    # set layout, import startup js and bind callbacks
    set_up_app(app)

    # start Dash instance
    start_app(app)

is executed twice every time the app is launched.
I'm not sure if this is a expected behaviour and if it's because of Dash or because of our code. This is what we need to find out.

Be able to share current visualization with others

We want a way to share the selected params with other people or to save the current selection for further analysis. We currently are not interested in a user management in order to save and share sessions, so this should be implemented through an unique URL.
Using URL routing along with query string would be a good option to implement this.
See https://plot.ly/dash/urls for documentation of how Dash support URL routing.

Move from gdc.Tab to dcc.Tab?

Dash core components have included recently the Tab component. Explore if it suits our needs and if so, move to them.

Allow users to download a csv with the data selected

The csv will have the metrics for all the wikis selected and for each wiki it will be include an absolute and a relative timestamp.

In this way, users can analyze the data further in case they want, for example, apply some time series analysis methods.

Categorize wikis

Categories for wikis by pages number:

  • Large: wiki > 100 000 pages
  • Big: 100 000 > wiki > 10 000
  • Medium: 10 000 > wiki > 1000
  • Small: 1000 > wiki

[New metric] Percentage of anonymous edits

The percentage is computed 100*(edits_by_anonymous_users / edits)

It would be interesting as a monthly metric (considering only the edits of a each month) and the accumulative version (considering the edits from the creation of the wiki until the end of each month).

The slider behaviour should be refined

There are some issues that could be refined. Namely:

  • The endpoint of the slider should have a tag.
  • The tags in absolute dates should not be every 5 elements, but every 6 elements or every 12 elements (in this way, they work in a yearly basis). Furthermore, to make them correspond with the vertical lines in the plot, they should mark the beginning of the year (January)
  • Consecutive date tags may colapse in some cases. Consider rotating the tags, if possible.
  • It would be nice to see the date with the current position, when you are sliding the handle.

[New metric] 10:90 ratio

To measure how is the work distributed between "core" contributos and "occasional" contributors.

Add info of each wiki in the list of wikis for selection

First, change the name of the wiki to an operative name. Is there a unique id name for each wiki? If not, don't panic, the rest of the information shown should disambiguate the wiki.
Information that could be show is:

  • Year-month of creation
  • Number of users (as of the dum date)
  • Number of articles (as of the dum date)
  • Year-month of the dump
  • Link to the wiki home (it could be added as a hyperlink in the wiki name)
  • Logo (it could also link ot the wikihome).

When the non-oldest wikis have a lifespan that is not subset of the oldest wiki, the graphs do not show all the data

Suppose we have a wiki A whose dump goes from May 2010 to Jun 2017, and another wiki B whose dump goes from Jan 2018 to Sep 2018. In this case, the slider will only show and select dates for the oldest wiki (A) and won't allow to select dates for the youngest wiki (B), resulting in that the wiki B is not being shown in the graphs.

This can happen with any combination of wikis where the oldest wiki lifespan doesn't contain all the other's.

A solution for this would be to do a set add of the lifespan of all the wikis and use the resulting set lifespan for the slider.

[New metric] Ratios between top users

Two new metrics to measure the grade of inequality of workload between top users:

  • Ratio: 1% user / 5% user
  • Ratio: Top contributor / 5% user

Being:

  • 1% user -> user who occupies the 1% position of top contributors
  • 5% user -> user who occupies the 5% position of top contributors

New metrics for core users?

First, some concerns about some"distribution of work" metrics.

The metrics that use a ratio of percentiles of top contributors are not only hard to explain, but also a bit meaningless since they literally "personalize" the ratio. By personalize, I mean, that you are comparing the contributions of two (top) contributors. Is it really relevant?

Perhaps, it is much more relevant (and, btw, easier to explain) to proceed in a similar fashion to the 10:90 ratio and compare using a ratio the contributions of, for example, the top 1% of contributors against the contributors from 1% to 5%. Similarly we can compare the top 5% against the contributors from 5% to 10%.

Alternatively, we can plot the percentage of contributors that make several percents of contributions:

  • 99% of contributions
  • 95% of contributions
  • 90% of contributions
  • 80% of contributions

These metrics are used in the Figure 5 of "Temporal Analysis on Contribution Inequality in OpenStreetMap" by Yang et al (2015). We will separate each percentage in a plot-

We can also consider a similar metric but inverting the role of contributors and contributions. That is, the percentage of contributions made by the top % of contributors:

  • top contributor
  • 1% of top contributors
  • 5% of top contributors
  • 10% of top contributors

I think that these two last set of metrics should be clearer to understand by visual inspection and also to explain in words.

Length of the x axis when "downloading as png" the plots

The length of the x axis depends on the length of the legend (which inversely depends on the length of the name of the wikis). If you download a plot with wikis of short names the length of the x axis is longer than if the wikis have a long name. And no legend appears if the plot only has one wiki (so you get a plot with the longest possible x axis).

If possible, consider to use a fixed length for the axes (at least it should be the length of the plots with no legend) and a variable length for the legend (the length could depend on the names of the wikis). Alternatively, the legend could be below the plot, but the height of the axes should not change either.

Refine time axis selection

Relative/Absolute dates should be a switch in the right side, probably close to the slider with the time range.

Both time series should be computed at the same time (when the COMPARE button is pushed) and the switch should be on the right side to make possible the user change between both possible views whenever he or she wants.

In my opinion, if you visualize just one wiki the default mode should be absolute dates, but if you compare more than one the default mode should be relative. By "default mode" I mean what it should be shown when you push COMPARE. The user should not anticipate how he wants to show the time series before, but casually change from one to other in case he or she wants it.

Regarding the slider, it changes with the absolute/relative selection.

  • The slider numbers should change from integers to dates or viceversa.
  • The handles should be reset to all the time range. Probably more sophisticated behavior is not needed.

Organize the list of metrics for selection in sub-lists.

Tentative organization suggested (some metrics may not appear right now). The name of the list is followed by the list with the metrics. The "monthly/accum" means that both metric versions should appear.

New Pages (monthly / accum)
- Pages
- Articles
- Article talk
- User Talk
- Others

Edits (monthly / accum)
- In pages (old "General")
- In articles
- In article talk
- In user Talk
- Others (?)
- Edited pages (probably only non accum)
- Edited articles

Users (monthly / accum)
- New users
- New anonymous users
- New registered users
- Active users
- Active anonymous users
- Active registered users

Ratios (monthly / accum)
- Page edits per user
- Article edits per user
- Edits per pages (edits understood as page edits)
- Edits per articles (edits understood as article edits)

Think whether it is possible to elegantly show the accumulated and non accumulated metrics without duplicate the number of lines (elements). Perhaps, the elements could be inside a "table" with three columns: name and the column with non accum checkboxes and the column with accumulated checkboxes. If one metric has no "accumulated" version no checkbox appears in that column.

Think whether we should use the word "accumulated" or "cumulative".

Bot metrics

It is not a priority in WikiChron, but bot activity could be shown using a few metrics;

  • Bot edits (considering edits in general and not particularizing whether they edit articles or not)
  • Bot users

Implement Lorenz asymmetry coefficient as a metric of distribution of participation

It seems that could provide insight about the origin of the inequality: " If the LAC is less than 1, the inequality is primarily due to the relatively many small or poor individuals. If the LAC is greater than 1, the inequality is primarily due to the few largest or wealthiest individuals." It remains to be seen what happens if the origin is due to both factors as is often the case in wiki data.

https://en.wikipedia.org/wiki/Lorenz_asymmetry_coefficient

More details in the original article:
https://doi.org/10.1890/0012-9658(2000)081[1139:DIIPSO]2.0.CO;2

It is funny that is was proposed in an ecology journal, but interestingly has been used in other fields including wealth distribution, computing, etc.
https://scholar.google.es/scholar?cites=12769940368116922741&as_sdt=2005&sciodt=0,5&hl=en

Gaps in time series cause that they are not shown correctly

If you show a metric that has zero values in between (not at the end or at the beginning). The value seems to be removed for some reason and the resulting time series is shorter (and delayed in the time axis) .

Example from the good luck charlie wiki. The delay can be seen in the vertical reference lines.

imagen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.