Giter VIP home page Giter VIP logo

bigbang's Introduction

BigBang

BigBang is a toolkit for studying communications data from collaborative projects. It currently supports analyzing mailing lists from Sourceforge, Mailman, ListServ (version 16.5 and 17), Pipermail (version 0.09), Hypermail (version 2.4.0) or .mbox files.

Complete documentation for BigBang can be found on ReadTheDocs.

DOI codecov Gitter

Background

Many Standards Development Organizations (SDOs) have working groups that organize themselves through mailing lists. This mailing list data is a valuable source of research insights but can be challenging to gather and analyze. BigBang is an open source toolkit for studying processes of open collaboration and deliberation via analysis of the communications records. Its tools for collecting, analyzing, and visualizing mailing list data are used by a community of information policy researchers to study participation trends and interaction in these settings.

Three things BigBang Does

  • Ingress. Tools for collecting data from SDOs, especially their mailing lists.
  • Analysis. Tools for (pre)processing the data to produce useful insights.
  • Usability/Visualization. Tools for visualizing and interacting with data.

Institutional Collaboration

BigBang has been developed by a growing team of researchers spread across many universities and institutions, including UC Berkeley, University of Amsterdam, and New York University. Its development has been funded by Article 19 and Germany's Prototype Fund.

In addition to its scholarly use, BigBang has been building relationships with SDOs themselves. In 2021, the Internet Architecture Board hosted a workshop on Analyzing IETF Data, in which BigBang was featured as a tool for IAB to develop insights into internet governance.

BigBang as Research Software

BigBing is research software -- written by scholars for our research purposes.

It is part of Scientific Python ecosystem, drawing on many other open source scientific software libraries, such as NumPy, Matplotlib, Pandas, and Jupyter Notebook.

BigBang is a reflexive process. Several of the core developers are also qualitative scholars of socio-technical systems and institutions. Researchers commonly combine BigBang with participant observation in the SDOs they are studying. BigBang is governed by a steering committee of its core developers.

Installation*

You need to have Git and Pip (for Python3) installed.

Clone the repository and create a virtualenv:

git clone https://github.com/datactive/bigbang.git
cd bigbang
python3 -m venv env
# activate the virtualenv
. env/bin/activate

Inside the virtualenv, install BigBang:

pip install ".[dev]"

When you're done, you can deactivate the virtualenv:

deactivate

This video tutorial shows how to install BigBang. BigBang Video Tutorial

Usage

There are serveral Jupyter notebooks in the examples/ directory of this repository. To open them and begin exploring, run the following commands in the root directory of this repository:

source activate bigbang
jupyter notebook --notebook-dir=examples/

BigBang contains scripts that make it easy to collect data from a variety of sources. For example, to collect data from an open mailing list archive hosted by Mailman, use:

bigbang collect-mail --url https://mail.python.org/pipermail/scipy-dev/

You can also give this command a file with several urls, one per line. One of these is provided in the examples/ directory.

bigbang collect-mail --file examples/urls.txt

Once the data has been collected, BigBang has functions to support analysis.

You can read more about data source supported by BigBang in the documentation.

Development

Unit tests

To run the automated unit tests, use: pytest tests/unit.

Our current goal is code coverage of 60%. Add new unit tests within tests/unit. Unit tests run quickly, without relying on network requests.

Documentation

Docstrings are preferred, so that auto-generated web-based documentation will be possible (#412). You can follow the Google style guide for docstrings.

Formatting

Run pre-commit install to get automated usage of black, flake8 and isort to all Python code files for consistent formatting across developers. We try to follow the PEP8 style guide.

Community

If you are interested in participating in BigBang development or would like support from the core development team, please subscribe to the bigbang-dev mailing list and let us know your suggestions, questions, requests and comments. A development chatroom is also available.

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to make participation in our project and our community a harassment-free experience for everyone.

Publications

These academic publications use BigBang as part of their methods:

License

MIT, see LICENSE for its text. This license may be changed at any time according to the principles of the project Governance.

Acknowledgements

This project is funded by:

bigbang's People

Contributors

agrawalraj avatar christovis avatar davelester avatar davidberra avatar debugger22 avatar dwins avatar effyli avatar emilienschultz avatar falahat avatar hargup avatar huitseeker avatar jack005 avatar jesscxu avatar micahflee avatar mriduls avatar nasiff avatar nllz avatar npdoty avatar paulolimac avatar priyankaiitg avatar sbenthall avatar seekshreyas avatar seliopou avatar vsporeddy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigbang's Issues

.mbox input processing

Current scripts pull from a Mailman web archive. But in many cases researchers have access to a .mbox file. This should be an option for the parser.

resolving personal identity over multiple email addresses

See #43 for initial proposal.

Often the same person will write to a mailing list using different email addresses. In some cases, it will be desirable to consolidate these messages into a set that's believed to all be written by one person, over several addresses.

This is not a trivial problem. There are a number of ways we could support this kind of functionality:

  • A purely manual approach, with methods to support the researcher eyeballing addresses for similarity and combining them as part of manual data cleaning.
  • A heuristic algorithm, that implements some rules derived from empirical examples and provided out of the box as something to speed up analysis.
  • A trained classifier. This would involve labeling a data set with cases of duplicate email addresses and then training a classifier that could resolve duplicates. There are lots of available features here, including the text and timing of messages. This is a machine learning problem in its own right.

I'm in favor of quick and dirty approaches to this but want to keep the aspiration towards sweet statistical beauty alive as well.

extract thread information from mailing list archive

Given a complete mailing list archive, should be able to decompose it into multiple individual threads.

Can sort them by thread length and number of participants, for example. A good task: find the bikeshed discussions.

improve ascendancy computation efficiency

The current method of computing changing ascendancy over a large range of messages is very inefficient. I believe that it's possible to get an order of magnitude of improvement by doing some optimizations.

improve error handling in notebooks

For walking through the demos, it would be particularly useful if common errors are caught and interpreted. For example, in the Plot Activity notebook, if you don't already have the archives downloaded, you'll get an IndexError (list index out of range), when what you actually need to be told is that the archives for that mailing list are missing.

So, fix the error handling in that case. But also, have a convention for how to catch and display common errors for notebook/demo purposes.

search list archives by term

Stuart Geiger requests having the ability to query for lexical terms and return messages that have included those terms in email. For example, looking for 'bots' in the Wikipedia mail archives.

This is a good first target use case for including message parsing functionality.

facilitate use of pandas for functionality involving named columns

See #43 for original comment.

@npdoty notes some awkwardness around the representation of activity broken down by user that's currently used in the I Python notebooks, which is currently performed code in the process.py.

The problems come up when trying to display data about participation along with the names of the message senders.

A natural way to think about this is as tables with named columns--something that Pandas provides. We suspect that it would be cleaner to produce these graphs working with a suitably constructed Pandas DataFrame.

One way to implement this would be as a method on the object representation of a mailing list archive #46

Betweenness centrality and community membership study

Create an IPython notebook that explores the relationship between betweenness centrality and community membership the interaction graphs of multiple lists.

A) For the same combined graph and time period, record which mailing lists each sender wrote to. This is their "community membership".

B) For a combined interaction graphs created from several mailing lists over some time period, rank the participants according to their betweenness centrality.

http://networkx.github.io/documentation/networkx-1.9.1/reference/generated/networkx.algorithms.centrality.betweenness_centrality.html#networkx.algorithms.centrality.betweenness_centrality

C) Is membership in many communities correlated with betweenness centrality? Divide the dataset into 30-day periods, compute the number of communities (A) and betweenness centrality (B) for each participant. Plot the correlation between A and B for each time period.

plot list participation over time

For a mailing list, plot overall participation (in terms of number of messages sent) over time.

Use this for building out solid representation of a single mailing list within bigbang. integrate with the archive retrieval script

directory of .els import

One use case is that a researcher has a directory full of .els files.

This should be a possible input method for the parser.

Fix Show Interaction Graph notebook to use arbitrary window

Currently the Show Interaction Graph notebook selects archives from the file system based on an arbitrarily chosen months' archive.

This is unclean. Instead, it should point to the archive object representation (see #46 ) and then take a starting date and time window.

basic install docs and howto

The README should describe basic installation and provide Getting Started instructions for collecitng and visualizing an email list.

It would be great if this was also on the github.io page for this project.

refactor process.py code

The process.py code is difficult to navigate due to the switching between pandas and numpy.

It would be best to rationalize this before writing tests.

Analyze time-series data with component analysis

As per Ariel Nunez's suggestion, for bucketed time series data like the daily mailing list activity counts, there should be machinery for doing a component analysis of the activity.

This means turning a time stamp into a feature vector. The featurization frameowrk should be flexible enough to support both periodical features (such as 'Mondays'), progression over time (i.e. days from origin), and landmark events (such as release dates).

Use PCA or a similar algorithm to get a sense of the contribution of each feature to the total activity. Document this process in an I Python notebook, and include comments on the limitations of the linear model.

persistent data store for preprocessed email archives

Both downloaded mail tx.gz files and .mbox files are a raw data format.

The preprocessing of this data into some more structured form is a step that can take place before further downstream processing, and cache this on the file system.

This should involve a Python class for an archive. In addition to save and load functions, it should expose the data in a sensible way and provide helper functions.

parse out multiple references into separate ids

When there are multiple References in an email, these should be parsed out into a list of message IDs.

When these dictionary representations are turned into graphs, these multiple References should be processed as multiple edges rather than aggregate into a single output node.

line 3, 4 in conda-setup.sh problem with pip

After creating the BigBang environment and trying to run conda-setup.sh
Run into the following error message:
conda-setup.sh: line 3: pip: command not found
conda-setup.sh: line 4: pip: command not found

bus factor measure

Carrying over some from #10 into a new ticket that reflects the more advanced functionality.

"Given an ordering over participants 1st, 2nd, 3rd most active, and given a distribution of threads involving each participant, it would be nice to have a breakdown f average and max thread size given the conditions that the 1st, (or first and second, or first/second/third) are not participating.

That would be an indication of the project's bus factor.
http://en.wikipedia.org/wiki/Bus_factor"

improve error handling on collect_mail

Right now there are no built in checks or error responses to bad requests in the collect mail script.

Since this kind of problems add noise, there should be better checks in place for this

convert reply graph to interaction graph

Now there is a way of taking a month's archive and turning it into a nx.graph representing which messages reply to which.

Next I need a tool to turn this reply-graph into a graph of interactions between users.

In this interaction graph, nodes will represent mailing list users, keyed by the 'From:' field of the emails they send. Weighted edges to other nodes e.g. A -> B will show how many times A replied to B's message.

Object representation of an archive

Related to #39. An in-memory object representation of a mailing list archive is needed so that we can cleanly support inquiries such as #42 with a well-documented class interface.

document installatin/setup with Anaconda

Using Anaconda may greatly ease installation headaches since it comes with the scientific packages installed. Provide documentation for using this Python distribution.

I Python notebook 'examples' directory

As a way of providing an inroad to using the tool, create an 'examples' directory with I Python notebooks.

This can replace some of the hacky work done currently in /bin.

get a first pass at a gender participation statistic

There are some name-based gender recognition libraries in Python. (Could also use writing-style based libraries.) While not perfect, this could be a decent first pass on metrics for rates of participation between genders over time.

analysis of cascade dynamics in email threads

A research question based on conversations with @wazaahhh

Taking threads as a unit of analysis, look at their size and properties as a subgraphs of the larger interaction graph, such as clustering coefficient (indicating a tight conversation between participants or a far-ranging one?). Also, look at the effectiveness of different participants in triggering cascades.

For sizes of cascades, look at several metrics and test to see their values as a function of cascade size. Plot them. Regression analysis later?

Teasing apart the contributions of different actors and how their interactions change over time and as a function of who they are interacting with.

Granovetter, Stanford, and Leskovec as background theory.

Derive yearly summary graphs for visualization of community growth

For a mailing list archive (or several), be able to summarize activity on each mailing list by year and process into a separate graph.

Script out visualization these discussions in a series over time. Use consistent layout and coloring algorithms for each year in order to visualize community growth.

Fit a Hawkes process to the time series data

Thomas Maillart among others has suggested that the time series of responses on open source activity can be effectively modeled as a Hawkes process.

This page demonstrates R code for fitting this distribution to time series data (from Bitcoin data)

http://jheusser.github.io/2013/09/08/hawkes.html

Replicate this work for the time series data from mailing lists. Compute AIC and BIC criteria. This is a the foundation of further work on statistical model selection on this data.

60% test coverage

Automated testing is important. Do not close the 0.1 milestone without 60% automated test coverage of code.

test likelihood of power law versus log normal distribution on thread replies

There is a little bit of controversy over whether everything purported to have a power law distribution actually has one. It might just be log normal which is more like a null hypothesis.

http://vserver1.cscs.lsa.umich.edu/~crshalizi/weblog/491.html
http://arxiv.org/abs/0706.1062

Email response times are one of these things that might be power law
http://andrewgelman.com/2006/02/24/i_have_nothing_1/

It would be good to tests the response times within these mailing list threads and look at the likelihood ratios of power law vs. log normal generating processes.

matplotlib graph output

There should be the option of outputting a visualization of the resulting output graph in matplotlib. Try using NetworkX's Atlas visualization example. This could be cleaner than fiddling with Gephi when there are multiple unconnected subgraphs. Also, it would be good to order the graphs cronologically.

collect from W3C mailing list archives (HTML)

While W3C has .mbox download archives, those are restricted (to protect the more detailed header information). It would be nice to have a crawler that could download W3C archives in their Web-accessible, HTML form.

testing framework

bring in nosetest. it would be good to approach this project using test driven development as there's a lot of finnicky bits.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.