Giter VIP home page Giter VIP logo

variationist's Introduction

MIT License v0.1.4 Python 3.9+ Documentation Tutorials

๐Ÿ•ต๏ธโ€โ™€๏ธ Variationist is a highly-modular, flexible, and customizable tool to analyze and explore language variation and bias in written language data. It allows researchers, from NLP practitioners to linguists and social scientists, to seamlessly investigate language use across many dimensions and a wide range of use cases.

Alan Ramponi, Camilla Casula and Stefano Menini. 2024. Variationist: Exploring Multifaceted Variation and Bias in Written Language Data. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 346โ€“354, Bangkok, Thailand. ACL. [cite] [paper]

Installation

Python package

๐Ÿ•ต๏ธโ€โ™€๏ธ Variationist can be installed as a python package from PyPI using the pip command as follows:

pip install variationist

Installing from source

Alternatively, ๐Ÿ•ต๏ธโ€โ™€๏ธ Variationist can be installed from source as follows:

  1. Clone this repository on your own path:
git clone https://github.com/dhfbk/variationist.git
  1. Create an environment with your own preferred package manager. We used python 3.9 and dependencies listed in requirements.txt. If you use conda, you can just run the following commands from the root of the project:
conda create --name variationist python=3.9         # create the environment
conda activate variationist                         # activate the environment
pip install --user -r requirements.txt              # install the required packages

Quickstart

๐Ÿ•ต๏ธโ€โ™€๏ธ Variationist works in a few line of codes and supports a wide variety of use cases in many dimensions. Below is an introductory example on how it can be used to explore variation and bias on a very simple dataset with a single text column and just a variable.

1) Import ๐Ÿ•ต๏ธโ€โ™€๏ธ Variationist

We first import the main classes useful for computation and visualization as follows:

from variationist import Inspector, InspectorArgs, Visualizer, VisualizerArgs

A brief description for the classes is the following:

  • Inspector (and InspectorArgs): it takes care of orchestrating the analysis, from importing and tokenizing the data to calculating the metrics and creating outputs with all the calculated metrics for each text column, variable, and combination thereof. It relies on InspectorArgs, a dataclass that allows the user to specify a variety of arguments that relate to the analysis.
  • Visualizer (and VisualizerArgs): it takes care of orchestrating the creation of a variety of interactive charts showing up to five dimensions based on the results and metadata from a prior analysis using Inspector. It relies on VisualizerArgs, a class storing the specific arguments for visualization.

2) Define and run the Inspector

Now, we aim to inspect the data. For this example, we use a column text and just a single label variable (with a default nominal variable type and a default general variable semantics); however, note that ๐Ÿ•ต๏ธโ€โ™€๏ธ Variationist can seamlessly handle a potentially unlimited number of variables and up to two text columns during computation. We just use npw_pmi as our association metric and rely on single tokens as our unit of information, using a default tokenizer. We also ask for some preprocessing steps (stopwords removal in English and lowercasing). The output is stored in the results variable but it can alternatively be serialized to a .json file for later use.

# Define the inspector arguments
ins_args = InspectorArgs(text_names=["text"], var_names=["label"], 
    metrics=["npw_pmi"], n_tokens=1, language="en", stopwords=True, lowercase=True)

# Run the inspector and get the results
res = Inspector(dataset="data.tsv", args=ins_args).inspect()

3) Define and run the Visualizer

Finally, we aim to visualize the results. The visualizer currently handles the creation of interactive charts for more than 30 combinations of variable type and semantics up to five dimensions, in which two of them are naturally fixed: the units (nominal) and their metric scores (quantitative). For this example, we output in the output folder my_charts the results in a html format (i.e., the default and suggested one for the sake of interactivity).

# Define the visualizer arguments
vis_args = VisualizerArgs(output_folder="charts", output_formats=["html"])

# Create interactive charts for all metrics
charts = Visualizer(input_json=res, args=vis_args).create()

Optionally, interactive charts can be visualized in notebooks by just taking the object returned from the create() function. For instance, if the object is stored in a variable named charts, visualization would be as simple as writing the following string in the notebook: charts[$METRIC][$CHART_TYPE], where $METRIC is the metric of interest and $CHART_TYPE is a specific chart type associated with that metric.

Tutorials

You can find our tutorials to learn how to better leverage ๐Ÿ•ต๏ธโ€โ™€๏ธ Variationist in the examples/ folder.

There you can also find a set of interesting case studies using real-world datasets! ๐Ÿ“ˆ

Documentation

You can find more information on specific topics in the following documents:

  • Input dataset: from .tsv or .csv files to pandas dataframes and Hugging Face datasets
  • Units: from tokens and n-grams to co-occurrences with windows and duplicate handling
  • Tokenizers: from a whitespace tokenizer to Hugging Face tokenizers and custom ones
  • Variables: possible variable types and variable semantics, and their interdependence
  • Metrics: from basic statistics to lexical diversity, association metrics, and custom ones
  • Charts: from scatter charts to choroplets, from heatmaps to temporal line plots and others
  • Custom components: how to define your own components

A technical documentation for ๐Ÿ•ต๏ธโ€โ™€๏ธ Variationist is also available at: https://variationist.readthedocs.io/en/latest/.

Video

An short introductory video is available here.

Roadmap

๐Ÿ•ต๏ธโ€โ™€๏ธ Variationist aims to be as accessible as possible to researchers from a wide range of fields. We thus aim to provide the following features in the next releases:

  • An easy to use graphical user interface to be installed locally or used through Hugging Face Spaces;
  • Extension of the unit concept to also cover linguistic aspects beyond the lexical level.

Contributors

Citation

If you use ๐Ÿ•ต๏ธโ€โ™€๏ธ Variationist in your work, please cite our paper as follows:

@inproceedings{ramponi-etal-2024-variationist,
    title = "Variationist: Exploring Multifaceted Variation and Bias in Written Language Data",
    author = "Ramponi, Alan and Casula, Camilla and Menini, Stefano",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-demos.33",
    pages = "346--354"
}

variationist's People

Contributors

alanramponi avatar ca-milla avatar stefanomenini avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

variationist's Issues

empty tokens in whitespace tokenizer

whitespace tokenizer adds an extra empty token at the end of each sentence:

E.g.
['After', 'finding', 'peculiar', 'key', 'three', 'smart', 'adventurous', 'kids', 'launch', 'quest', 'uncover', 'whereabouts', 'coveted', 'archaeological', 'treasure', '']

Handle label sparsity for temporal/spatial/quantitative using "granularity"

Some variables (e.g., dates in the standard Twitter format YYYY-MM-DDTHH:MM:SS.000Z) are likely to take a different value for each text, making the final results for the metrics sparse, uninformative, or even useless (e.g., a PMI rank for each exact datetime). It would make sense to run the computation on a given granularity instead (e.g., "year", "year-month", "year-month-day").

I strongly support to leave to the user any preprocessing stuff (we cannot handle any data variant the user is thinking about!), but the case of datetimes is quite standard (we can just support 2-3 formats and document it), and temporal aggregation seems a good feature that could be appreciated!

Extra: the same principle can be applied on spatial data with coordinates, for which a given granularity would instead be an integer denoting kilometers. Using the Haversine formula and creating a set of bounding boxes based on the granularity, the results will make much more sense.

Note: this issue is complementary to the var_subsets feature, here we are still working on the variable values before determining which ones are of interest to the user

Add bi- & tri-gram support

Add support for n-grams (n>1) in tokenization, so metrics can be calculated on them in addition to single tokens.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.