datactive / bigbang Goto Github PK

View Code? Open in Web Editor NEW

147.0 16.0 52.0 85.85 MB

Scientific analysis of collaborative communities

Home Page: http://datactive.github.io/bigbang/

License: MIT License

Python 84.97% HTML 12.81% Jupyter Notebook 2.22%

listserv mailman mbox datatracker

bigbang's Introduction

BigBang

BigBang is a toolkit for studying communications data from collaborative projects. It currently supports analyzing mailing lists from Sourceforge, Mailman, ListServ (version 16.5 and 17), Pipermail (version 0.09), Hypermail (version 2.4.0) or .mbox files.

Complete documentation for BigBang can be found on ReadTheDocs.

Background

Many Standards Development Organizations (SDOs) have working groups that organize themselves through mailing lists. This mailing list data is a valuable source of research insights but can be challenging to gather and analyze. BigBang is an open source toolkit for studying processes of open collaboration and deliberation via analysis of the communications records. Its tools for collecting, analyzing, and visualizing mailing list data are used by a community of information policy researchers to study participation trends and interaction in these settings.

Three things BigBang Does

Ingress. Tools for collecting data from SDOs, especially their mailing lists.
Analysis. Tools for (pre)processing the data to produce useful insights.
Usability/Visualization. Tools for visualizing and interacting with data.

Institutional Collaboration

BigBang has been developed by a growing team of researchers spread across many universities and institutions, including UC Berkeley, University of Amsterdam, and New York University. Its development has been funded by Article 19 and Germany's Prototype Fund.

In addition to its scholarly use, BigBang has been building relationships with SDOs themselves. In 2021, the Internet Architecture Board hosted a workshop on Analyzing IETF Data, in which BigBang was featured as a tool for IAB to develop insights into internet governance.

BigBang as Research Software

BigBing is research software -- written by scholars for our research purposes.

It is part of Scientific Python ecosystem, drawing on many other open source scientific software libraries, such as NumPy, Matplotlib, Pandas, and Jupyter Notebook.

BigBang is a reflexive process. Several of the core developers are also qualitative scholars of socio-technical systems and institutions. Researchers commonly combine BigBang with participant observation in the SDOs they are studying. BigBang is governed by a steering committee of its core developers.

Installation*

You need to have Git and Pip (for Python3) installed.

Clone the repository and create a virtualenv:

git clone https://github.com/datactive/bigbang.git
cd bigbang
python3 -m venv env
# activate the virtualenv
. env/bin/activate

Inside the virtualenv, install BigBang:

pip install ".[dev]"

When you're done, you can deactivate the virtualenv:

deactivate

This video tutorial shows how to install BigBang.

Usage

There are serveral Jupyter notebooks in the examples/ directory of this repository. To open them and begin exploring, run the following commands in the root directory of this repository:

source activate bigbang
jupyter notebook --notebook-dir=examples/

BigBang contains scripts that make it easy to collect data from a variety of sources. For example, to collect data from an open mailing list archive hosted by Mailman, use:

bigbang collect-mail --url https://mail.python.org/pipermail/scipy-dev/

You can also give this command a file with several urls, one per line. One of these is provided in the examples/ directory.

bigbang collect-mail --file examples/urls.txt

Once the data has been collected, BigBang has functions to support analysis.

You can read more about data source supported by BigBang in the documentation.

Development

Unit tests

To run the automated unit tests, use: pytest tests/unit.

Our current goal is code coverage of 60%. Add new unit tests within tests/unit. Unit tests run quickly, without relying on network requests.

Documentation

Docstrings are preferred, so that auto-generated web-based documentation will be possible (#412). You can follow the Google style guide for docstrings.

Formatting

Run pre-commit install to get automated usage of black, flake8 and isort to all Python code files for consistent formatting across developers. We try to follow the PEP8 style guide.

Community

If you are interested in participating in BigBang development or would like support from the core development team, please subscribe to the bigbang-dev mailing list and let us know your suggestions, questions, requests and comments. A development chatroom is also available.

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to make participation in our project and our community a harassment-free experience for everyone.

Publications

These academic publications use BigBang as part of their methods:

Becker, Christoph., ten Oever, Niels, and Riccardo Nanni. 2022 “The standardisation of lawful interception technologies in the 3GPP: interrogating 5G and surveillance amid US-China competition“, TPRC2022, Washington DC https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4167105
Benthall, Sebastian. 2015. “Testing Generative Models of Online Collaboration with BigBang.” In , 182–89. https://conference.scipy.org/proceedings/scipy2015/sebastian_benthall.html.
Doty, Nick. 2015. “Reviewing for Privacy in Internet and Web Standard-Setting.” In Security and Privacy Workshops (SPW), 2015 IEEE, 185–192. IEEE. https://ieeexplore.ieee.org/document/7163224/
Milan, Stefania, and Niels ten Oever. 2017. “Coding and Encoding Rights in Internet Infrastructure.” Internet Policy Review 6 (1)
ten Oever, Niels. 2018. “Productive Contestation, Civil Society, and Global Governance: Human Rights as a Boundary Object in ICANN.” Policy & Internet, June. https://doi.org/10.1002/poi3.172.
Nanni, Riccardo. “Digital Sovereignty and Internet Standards: Normative Implications of Public-Private Relations among Chinese Stakeholders in the Internet Engineering Task Force.” Information, Communication & Society 0, no. 0 (October 1, 2022): 1–21. https://doi.org/10.1080/1369118X.2022.2129270.
ten Oever, Niels. 2021. “‘This Is Not How We Imagined It’ - Technological Affordances, Economic Drivers and the Internet Architecture Imaginary.” New Media & Society. https://journals.sagepub.com/doi/full/10.1177/1461444820929320
ten Oever, N., Milan, S., & Beraldo, D. (2020). Studying Discourse in Internet Governance through Mailing-list Analysis. In D. L. Cogburn, L. DeNardis, N. S. Levinson, & F. Musiani (Eds.), Research Methods in Internet Governance. Cambridge, MA: MIT Press. https://direct.mit.edu/books/oa-monograph/4936/chapter/625914/Studying-Discourse-in-Internet-Governance-through

License

MIT, see LICENSE for its text. This license may be changed at any time according to the principles of the project Governance.

Acknowledgements

This project is funded by:

bigbang's People

Contributors

Stargazers

Watchers

bigbang's Issues

compute ascendency and overhead on interaction graphs over time, by 30-day period

Compute Ulanowicz's ascendancy and overhead measures for interaction graphs of participants in 30-day periods, and plot on histogram.

even better install instructions for local editability in conda install

See here:

https://gist.github.com/davidrpugh/89b83de23503b30155f5

.mbox input processing

Current scripts pull from a Mailman web archive. But in many cases researchers have access to a .mbox file. This should be an option for the parser.

avoid duplication in setup,py and requirements.txt

There is a certain amount of redundancy in setup.py and requirements.txt

This is due to the difference between those two files' roles.

https://caremad.io/blog/setup-vs-requirement/

Fix the requirements.txt file so that it tries to install from its own directory first. Or remove the requirements.txt file.

resolving personal identity over multiple email addresses

See #43 for initial proposal.

Often the same person will write to a mailing list using different email addresses. In some cases, it will be desirable to consolidate these messages into a set that's believed to all be written by one person, over several addresses.

This is not a trivial problem. There are a number of ways we could support this kind of functionality:

A purely manual approach, with methods to support the researcher eyeballing addresses for similarity and combining them as part of manual data cleaning.
A heuristic algorithm, that implements some rules derived from empirical examples and provided out of the box as something to speed up analysis.
A trained classifier. This would involve labeling a data set with cases of duplicate email addresses and then training a classifier that could resolve duplicates. There are lots of available features here, including the text and timing of messages. This is a machine learning problem in its own right.

I'm in favor of quick and dirty approaches to this but want to keep the aspiration towards sweet statistical beauty alive as well.

extract thread information from mailing list archive

Given a complete mailing list archive, should be able to decompose it into multiple individual threads.

Can sort them by thread length and number of participants, for example. A good task: find the bikeshed discussions.

improve ascendancy computation efficiency

The current method of computing changing ascendancy over a large range of messages is very inefficient. I believe that it's possible to get an order of magnitude of improvement by doing some optimizations.

improve error handling in notebooks

For walking through the demos, it would be particularly useful if common errors are caught and interpreted. For example, in the Plot Activity notebook, if you don't already have the archives downloaded, you'll get an IndexError (list index out of range), when what you actually need to be told is that the archives for that mailing list are missing.

So, fix the error handling in that case. But also, have a convention for how to catch and display common errors for notebook/demo purposes.

search list archives by term

Stuart Geiger requests having the ability to query for lexical terms and return messages that have included those terms in email. For example, looking for 'bots' in the Wikipedia mail archives.

This is a good first target use case for including message parsing functionality.

document instructions for joining the community

Document how to join the community.

Especially, how to join the project mailing list.

facilitate use of pandas for functionality involving named columns

See #43 for original comment.

@npdoty notes some awkwardness around the representation of activity broken down by user that's currently used in the I Python notebooks, which is currently performed code in the process.py.

The problems come up when trying to display data about participation along with the names of the message senders.

A natural way to think about this is as tables with named columns--something that Pandas provides. We suspect that it would be cleaner to produce these graphs working with a suitably constructed Pandas DataFrame.

One way to implement this would be as a method on the object representation of a mailing list archive #46

Betweenness centrality and community membership study

Create an IPython notebook that explores the relationship between betweenness centrality and community membership the interaction graphs of multiple lists.

A) For the same combined graph and time period, record which mailing lists each sender wrote to. This is their "community membership".

B) For a combined interaction graphs created from several mailing lists over some time period, rank the participants according to their betweenness centrality.

http://networkx.github.io/documentation/networkx-1.9.1/reference/generated/networkx.algorithms.centrality.betweenness_centrality.html#networkx.algorithms.centrality.betweenness_centrality

C) Is membership in many communities correlated with betweenness centrality? Divide the dataset into 30-day periods, compute the number of communities (A) and betweenness centrality (B) for each participant. Plot the correlation between A and B for each time period.

compute number of unique participants active in mailing list, for arbitrary # of bins

compute number of unique participants active in a mailing list during bins of arbitrary size

pandas dataframe for time series representation of mailing list

Use the Pandas data frame to store time series representation of a mailing list

identify institutional affiliation of mail senders

For a mail sender, have an automated way of identifying their institutional affiliation.

Possible sources of information:

the email domain
email signatures
LinkedIn or other external data lookup

plot list participation over time

For a mailing list, plot overall participation (in terms of number of messages sent) over time.

Use this for building out solid representation of a single mailing list within bigbang. integrate with the archive retrieval script

directory of .els import

One use case is that a researcher has a directory full of .els files.

This should be a possible input method for the parser.

improperly parsing RE: as headers

RE: in the subject of an email is getting parsed as a header in some cases in the mailmain archive parser.

This should be fixed.

Fix Show Interaction Graph notebook to use arbitrary window

Currently the Show Interaction Graph notebook selects archives from the file system based on an arbitrarily chosen months' archive.

This is unclean. Instead, it should point to the archive object representation (see #46 ) and then take a starting date and time window.

basic install docs and howto

The README should describe basic installation and provide Getting Started instructions for collecitng and visualizing an email list.

It would be great if this was also on the github.io page for this project.

refactor process.py code

The process.py code is difficult to navigate due to the switching between pandas and numpy.

It would be best to rationalize this before writing tests.

Analyze time-series data with component analysis

As per Ariel Nunez's suggestion, for bucketed time series data like the daily mailing list activity counts, there should be machinery for doing a component analysis of the activity.

This means turning a time stamp into a feature vector. The featurization frameowrk should be flexible enough to support both periodical features (such as 'Mondays'), progression over time (i.e. days from origin), and landmark events (such as release dates).

Use PCA or a similar algorithm to get a sense of the contribution of each feature to the total activity. Document this process in an I Python notebook, and include comments on the limitations of the linear model.

persistent data store for preprocessed email archives

Both downloaded mail tx.gz files and .mbox files are a raw data format.

The preprocessing of this data into some more structured form is a step that can take place before further downstream processing, and cache this on the file system.

This should involve a Python class for an archive. In addition to save and load functions, it should expose the data in a sensible way and provide helper functions.

parse out multiple references into separate ids

When there are multiple References in an email, these should be parsed out into a list of message IDs.

When these dictionary representations are turned into graphs, these multiple References should be processed as multiple edges rather than aggregate into a single output node.

line 3, 4 in conda-setup.sh problem with pip

After creating the BigBang environment and trying to run conda-setup.sh
Run into the following error message:
conda-setup.sh: line 3: pip: command not found
conda-setup.sh: line 4: pip: command not found

bus factor measure

Carrying over some from #10 into a new ticket that reflects the more advanced functionality.

"Given an ordering over participants 1st, 2nd, 3rd most active, and given a distribution of threads involving each participant, it would be nice to have a breakdown f average and max thread size given the conditions that the 1st, (or first and second, or first/second/third) are not participating.

That would be an indication of the project's bus factor.
http://en.wikipedia.org/wiki/Bus_factor"

make tests pass

I shouldn't be making a ticket for this, but apparently this broke tests:

sbenthall/bigbang@292ef38...be43944

better fix it.

improve error handling on collect_mail

Right now there are no built in checks or error responses to bad requests in the collect mail script.

Since this kind of problems add noise, there should be better checks in place for this

convert reply graph to interaction graph

Now there is a way of taking a month's archive and turning it into a nx.graph representing which messages reply to which.

Next I need a tool to turn this reply-graph into a graph of interactions between users.

In this interaction graph, nodes will represent mailing list users, keyed by the 'From:' field of the emails they send. Weighted edges to other nodes e.g. A -> B will show how many times A replied to B's message.

Object representation of an archive

Related to #39. An in-memory object representation of a mailing list archive is needed so that we can cleanly support inquiries such as #42 with a well-documented class interface.

charts of the distribution of senders

See #43. On the 'Analyzing Mail Senders' notebook, it would be good to have code to generate charts of the distribution of senders.

document installatin/setup with Anaconda

Using Anaconda may greatly ease installation headaches since it comes with the scientific packages installed. Provide documentation for using this Python distribution.

I Python notebook 'examples' directory

As a way of providing an inroad to using the tool, create an 'examples' directory with I Python notebooks.

This can replace some of the hacky work done currently in /bin.

get a first pass at a gender participation statistic

There are some name-based gender recognition libraries in Python. (Could also use writing-style based libraries.) While not perfect, this could be a decent first pass on metrics for rates of participation between genders over time.

Identify / characterize interactions with and without Fernando

It should be obvious where Fernando is in the ipython-dev graph. It would be trivial but interesting to see if there are non-trivial threads that don't involve him.

test for changes in list compressibility over time

using gzipped file length as a proxy for Kolmogorov complexity, compute the contribution of messages to a mailing list in terms of their contribution of information measured as compressed bits.

Need instruction for clone for repo

analysis of cascade dynamics in email threads

A research question based on conversations with @wazaahhh

Taking threads as a unit of analysis, look at their size and properties as a subgraphs of the larger interaction graph, such as clustering coefficient (indicating a tight conversation between participants or a far-ranging one?). Also, look at the effectiveness of different participants in triggering cascades.

For sizes of cascades, look at several metrics and test to see their values as a function of cascade size. Plot them. Regression analysis later?

Teasing apart the contributions of different actors and how their interactions change over time and as a function of who they are interacting with.

Granovetter, Stanford, and Leskovec as background theory.

Script for collecting (cloning) git data

We need a script for collecting git repository data. Preferably an executable in the /bin directory; see collect_mail.py as a n example

include pygraphviz dependency, document installation

Include pygraphviz as a dependency, and document installation as this is tricky.

group messages by sender and plot participation over time by each separately

Could do this with multiple colors in a single chart...

Derive yearly summary graphs for visualization of community growth

For a mailing list archive (or several), be able to summarize activity on each mailing list by year and process into a separate graph.

Script out visualization these discussions in a series over time. Use consistent layout and coloring algorithms for each year in order to visualize community growth.

Fit a Hawkes process to the time series data

Thomas Maillart among others has suggested that the time series of responses on open source activity can be effectively modeled as a Hawkes process.

This page demonstrates R code for fitting this distribution to time series data (from Bitcoin data)

http://jheusser.github.io/2013/09/08/hawkes.html

Replicate this work for the time series data from mailing lists. Compute AIC and BIC criteria. This is a the foundation of further work on statistical model selection on this data.

60% test coverage

Automated testing is important. Do not close the 0.1 milestone without 60% automated test coverage of code.

extract email address from From field and pass downstream for processing

Email addresses can be an important sign of institutional affiliation. This information should be preserved for future processing.

constant time index into a collection of messages by message_id

The data structure the contains messages should have a constant-time index on message IDs.

This should improve performance on #10 and possibly #31

The place to implement this is in the class that's going to be the stub of #39

test likelihood of power law versus log normal distribution on thread replies

There is a little bit of controversy over whether everything purported to have a power law distribution actually has one. It might just be log normal which is more like a null hypothesis.

http://vserver1.cscs.lsa.umich.edu/~crshalizi/weblog/491.html
http://arxiv.org/abs/0706.1062

Email response times are one of these things that might be power law
http://andrewgelman.com/2006/02/24/i_have_nothing_1/

It would be good to tests the response times within these mailing list threads and look at the likelihood ratios of power law vs. log normal generating processes.

matplotlib graph output

There should be the option of outputting a visualization of the resulting output graph in matplotlib. Try using NetworkX's Atlas visualization example. This could be cleaner than fiddling with Gephi when there are multiple unconnected subgraphs. Also, it would be good to order the graphs cronologically.

collect from W3C mailing list archives (HTML)

While W3C has .mbox download archives, those are restricted (to protect the more detailed header information). It would be nice to have a crawler that could download W3C archives in their Web-accessible, HTML form.

testing framework

bring in nosetest. it would be good to approach this project using test driven development as there's a lot of finnicky bits.