Giter VIP home page Giter VIP logo

arabica's Introduction

pypi License: MIT

Arabica

Python package for exploratory text data analysis

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include social media conversations, product reviews, research metadata, central bankers’ communication, and newspaper headlines. Arabica makes exploratory analysis of these datasets simple by providing:

  • Descriptive n-gram analysis: n-gram frequencies
  • Time-series n-gram analysis: n-gram frequencies over a period
  • Text visualization: n-gram heatmap, line plot, word cloud
  • Sentiment analysis: VADER sentiment classifier
  • Financial sentiment analysis: with FinVADER
  • Structural breaks identification: Jenks Optimization Method

It automatically cleans data from punctuation on input. It can also apply all or a selected combination of the following cleaning operations:

  • Remove digits from the text
  • Remove the standard list(s) of stopwords
  • Remove an additional list of stop words

Arabica works with texts of languages based on the Latin alphabet, uses cleantext for punctuation cleaning, and enables stop words removal for languages in the NLTK corpus of stopwords.

It reads dates in:

  • US-style: MM/DD/YYYY (2013-12-31, Feb-09-2009, 2013-12-31 11:46:17, etc.)
  • European-style: DD/MM/YYYY (2013-31-12, 09-Feb-2009, 2013-31-12 11:46:17, etc.) date and datetime formats.

Installation

Arabica requires Python 3.8 - 3.10, NLTK - stop words removal, cleantext - text cleaning, wordcloud - word cloud visualization, plotnine - heatmaps and line graphs, matplotlib - word clouds and graphical operations, vaderSentiment - sentiment analysis, finvader - financial sentiment analysis, and jenskpy for breakpoint identification.

To install using pip, use:

pip install arabica

Usage

  • Import the library:
from arabica import arabica_freq
from arabica import cappuccino
from arabica import coffee_break 
  • Choose a method:

arabica_freq enables a specific set of cleaning operations (lower casing, numbers, common stop words, and additional stop words removal) and returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period.

def arabica_freq(text: str,                # Text
                 time: str,                # Time
                 date_format: str,         # Date format: 'eur' - European, 'us' - American
                 time_freq: str,           # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup'
                 max_words: int,           # Maximum of most frequent n-grams displayed for each period
                 stopwords: [],            # Languages for stop words
                 skip: [],                 # Remove additional stop words
                 numbers: bool = False,    # Remove numbers
                 lower_case: bool = False  # Lowercase text
) 

cappuccino enables cleaning operations (lower casing, numbers, common stop words, and additional stop words removal) and provides plots for descriptive (word cloud) and time-series (heatmap, line plot) visualization.

def cappuccino(text: str,                # Text
               time: str,                # Time
               date_format: str,         # Date format: 'eur' - European, 'us' - American
               plot: str,                # Chart type: 'wordcloud'/'heatmap'/'line'
               ngram: int,               # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
               time_freq: str,           # Aggregation period: 'Y'/'M', if no aggregation: 'ungroup'
               max_words int,            # Maximum of most frequent n-grams displayed for each period
               stopwords: [],            # Languages for stop words
               skip: [],                 # Remove additional stop words
               numbers: bool = False,    # Remove numbers
               lower_case: bool = False  # Lowercase text
)

coffee_break provides sentiment analysis and breakpoint identification in aggregated time series of sentiment. The implemented models are:

  • VADER is a lexicon and rule-based sentiment classifier attuned explicitly to general language expressed in social media

  • FinVADER improves VADER's classification accuracy on financial texts, including two financial lexicons

Break points in the time series are identified with the Fisher-Jenks algorithm (Jenks, 1977. Optimal data classification for choropleth maps).

def coffee_break(text: str,                 # Text
                 time: str,                 # Time
                 date_format: str,          # Date format: 'eur' - European, 'us' - American
                 model: str,                # Sentiment classifier, 'vader' - general language, 'finvader' - financial text                
                 skip: [],                  # Remove additional stop words
                 preprocess: bool = False,  # Clean data from numbers and punctuation
                 time_freq: str,            # Aggregation period: 'Y'/'M'
                 n_breaks: int              # Number of breakpoints: min. 2
)

Documentation, examples and tutorials

For more examples of coding, read these tutorials:

General use:

  • Sentiment Analysis and Structural Breaks in Time-Series Text Data here
  • Visualization Module in Arabica Speeds Up Text Data Exploration here
  • Text as Time Series: Arabica 1.0 Brings New Features for Exploratory Text Data Analysis here

Applications:

  • Business Intelligence: Customer Satisfaction Measurement with N-gram and Sentiment Analysis here
  • Research meta-data analysis: Research Article Meta-data Description Made Quick and Easy here
  • Media coverage text mining
  • Social media analysis

💬 Please visit here for any questions, issues, bugs, and suggestions.

Citation

Using arabica in a paper or thesis? Please cite this paper:

@article{Koráb:2024,
  author   = {{Koráb}, P., and {Poměnková}, J.},
  title    = {Arabica: A Python package for exploratory analysis of text data},
  journal  = {Journal of Open Source Software},
  volume   = {97},
  number   = {9},
  pages    = {6186},
  year     = {2024},
  doi      = {doi.org/10.21105/joss.06186},
}

arabica's People

Contributors

petrkorab avatar drchandrakant avatar imbishal7 avatar oliviaguest avatar petr-korab-testing avatar

Stargazers

 avatar  avatar Nolan Townsend avatar Camilo Piñón avatar Bobo Jamson avatar  avatar Rick Otten avatar J avatar Raffaela Loffredo avatar Magnus Wahlberg avatar  avatar Seth avatar Richard Sieg avatar Ettore Rizza avatar Jacob Dodd avatar Logan avatar Josh Nicholas avatar Ryne Andal avatar Martin Soderstrom avatar  avatar Adebayo Akinlalu avatar  avatar Michael Skolnik avatar João Palmeiro avatar  avatar nizq avatar corybaird avatar Omkar Kabde avatar  avatar Yuk Liang Khor avatar Gabriel Appau Abeyie avatar  avatar  avatar  avatar  avatar  avatar  avatar Dimitar Trajanov avatar Flávia Costa avatar  avatar  avatar  avatar  avatar yhf avatar Pavel Klymenko avatar Marcell Nagy avatar Kevin Weingarten avatar K. N. avatar  avatar Daniel Cestari avatar devdiary2203 avatar alexander erofeev avatar  avatar Yang avatar Nico Müller avatar  avatar Keyor avatar Vane avatar wasita avatar Lubdhak Mondal avatar

Watchers

Kostas Georgiou avatar  avatar Lisa avatar

arabica's Issues

Mizani v0.10.0

It appears that in the new update of the Mizani package, "multitype_sort" was removed. This causes an import error when I try to pip install the package. I can successfully run the package if I downgrade Mizani to v0.9.2.

Does this support Arabic?!!

Does this package support analysis of text data in the Arabic languge?! I came in here thinking that it must!! And if not, don't you think the package name will be a major source of confusion?!!

Missing modules after install

I tried installing a few times with pip install arabica (Windows), there were no errors but some modules are missing, for example the coffee_break. Here are the modules that got installed:
11/12/2023 09:54

.
11/12/2023 10:21 ..
11/12/2023 09:54 10,748 arabica_freq.py
11/12/2023 09:54 23,228 cappuccino.py
11/12/2023 09:54 1,684 clean_ngram.py
11/12/2023 09:54 146 clean_numbers.py
11/12/2023 09:54 1,850 group.py
11/12/2023 09:54 460 preprocess.py
11/12/2023 09:54 449 stopwords.py
11/12/2023 09:54 77 init.py
11/12/2023 09:54 pycache

Automated tests:

requirement for JOSS journal.
Automated tests:
Are there automated tests or manual steps described so that the functionality of the software can be verified?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.