Giter VIP home page Giter VIP logo

playtime's Introduction

TF-IDF, Sigma and Other (Experimental) Texts Analysing Tools

TF-IDF and Sigma analysis written in Python, which outputs results to the convenient *.xlsx spreadsheets for detailed analysis.

TF-IDF analysis allows to detect the most "important" words in the given text of some text corpus (set of articles, etc). These "important" words are those which occur in the particular document more than in any other document of the same text corpus.

While TF-IDF analysis is useful for a set of articles, Sigma analysis is useful to analyze the most "important" words in a single, usually large text (books, documents, etc).

There are a couple of more advanced scripts:

  • Matrix output for Gephi in gephi.py. Sample output file is gephi.csv in this repository.
  • Horizontal visibility graph building with hor-vis-graph.py. A couple of sample files are included in hor-vis-graph/ directory.
  • Other experiments (see below)

Preview / Examples

Experimental semantic network builder (main concepts from this article):

Graph

TF-IDF applied to some news articles text corpus:

Excel Spreadsheet

Sigma method applied to the book "The Hunger Games":

Excel Spreadsheet

Analysis of article about Putin with horizontal visibility graph and other articles text corpus:

Graph

Usage

  1. Install Python 3, clone the repository, enter repository directory with cd edu-tf-idf.
  2. Install required dependencies: pip3 install -r requirements.txt.
  3. Place texts to analyze in /texts directory (there are a couple already).
  4. Run the analyzer with py tf-idf.py command (there are many!).

Example

TF-IDF: Run the program (by default, picks texts from texts/news):
py tf-idf.py

Result:

Reading texts...
Done! Computing TF-IDF ranks...
Progressing text 2225/2225
Done! Writing results...
Writing worksheet 2225/2225
Done!

Output goes to tf-idf.xlsx file ready for analysis.

Sigma method (by default, picks texts from texts/books):
py sigma.py

Result goes to sigma.xlsx file.

Horizontal Visibility Graph (exports to hor-vis-graph/ directory, picks from texts/news):
py hor-vis-graph.py

Check the result in hor-vis-graph/ directory, visualize it using Gephi.

Individual Text Analysis

Run experimental semantic network builder with

py analyze_text.py texts/news/tech/001.txt

Check the result in analyzed/<text-title> directory, visualize it using Gephi.

License

MIT © Nikita Savchenko

playtime's People

Contributors

nikitaeverywhere avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.