Giter VIP home page Giter VIP logo

philosophy-nlp-text-classification's Introduction

title

Philosophy NLP & Text Classification

A book of philosophy represents an effort to systematically organize one's thought about the world. Using the data from the history of philosophy to classify texts thus enables us to, by proxy, classify how people think about the world. Where some projects focus on sentiment analysis, here we focus on conceptual, or ideological analysis.

This project uses 51 texts spanning 10 schools of philosophical thought Based on these, we develop classification models, word vectors, and general EDA. This can then be used to understand user's worldviews by comparing them to historical schools of thought. And once we understand a person's worldview, there is no limit to what we can do with that information - from advertising to political campaigning through to self-exploration and therapy. To see these tools in action, check out philosophydata.com

Repo Contents

Throughout the project, we implemented the OSEMN procedure for data science

Obtaining the data

Our data came from two primary sources: Project Gutenberg and our own collection of pdfs. These pdfs were transformed into txt files for easier processing. Overall, we had 51 books, 31 authors, and 10 schools of philosophy in our dataset.

Scrubbing the data

Working with books is hard! There were any number of textual irregularities, repetitive but meaningless phrases, and other oddities we had to correct for. This was especially true of the texts that came from pdfs, as many times headers, page numbers, or footnotes would be included in the main body, often interrupting sentences. To clean this up, we did extensive cleaning in our cleaning notebook. This included:

  • removing roman numerals and odd characters
  • removing sentences that were too short to have real meaning
  • removing sentences in non-English languages
  • consolidating different characters to one and the same (making sure '&' was 'and', for example)
  • building an extensive dictionary of textual corruptions and their corrected forms for ad hoc correction

Exploring the data

We explored the data in our preliminary EDA steps. There, we generated word frequency tables, bigram frequency tables, and word clouds. The word clouds, for example, are here:

title

Our w2v work also helped explore the data as it brought to light a lot of oddities in the texts (odd words correlated with normal ones) and helped us see how words were used in the corpus.

In the course of our work on the dash app, we developed visualization functions that could generate basic textual statistics for every school, author, and title in the corpus.

Modeling the data

Our work on modeling took two primary forms - Bayesian models and deep NLP. Overall our aim was to maximize accuracy. We were able to attain 77% accuracy with Bayesian models and just shy of 80% accuracy with neural networks utilizing LSTM. A confusion matrix is presented below.

title

Our w2v models, based on pre-trained GloVe vectors and then trained on our corpus, were able to create meaningful and illuminating vectors for each school and for the corpus as a whole. Below, for example, is a comparison of the base GloVe vector's interpretation of 'philosophy', as compared to the trained model's interpretation of the same.

title

Note that 'metaphysics' is a term used by non-philosophers to describe things like astrology and spiritualism. Philosophers, however, use it in a more specific way where it becomes a key concept of the discipline. Thus our trained vectors identify metaphysics and philosophy where the base GloVe model does not.

Interpreting the data

What does all this mean? Well, for one thing, it means that given any specific body of text (the larger the better) we can correctly identify the philosophical leanings of the author 80% of the time. This means that we can tell whether they value reason (analytic philosophy) or think it is forever limited (continental philosophy), whether they prefer their senses (empiricism) or conceptual analysis (rationalism), or whether they are left-leaning economically (communism) or right-leaning (capitalism).

With this knowledge, we can then make sure to speak to our audiences in the ways that they will understand and identify with. Don't produce detailed logical arguments for people who value their gut instincts, don't go on and on about conceptual definitions when your clients want to hear the facts and figures.

The model can be used to identify the philosophical leanings of potential clients, customers, or employees (in the classification phase) and to better understand those leanings (in the exploration phase, via w2v analysis). This can facilitate more productive conversations, better customer relations, and stronger employee productivity.

Conclusion & Next Steps

Overall, we were able to attain about 80% accuracy over 10 classes - a solid result. Our w2v models gave clear pictures of their respective schools, though they were less helpful at the level of authors since they had less text to go on there. The data was visualized and made available for use in classification and exploration on a dash app. The repo for that app is here.

The solid accuracy scores and wide-open potential applications of the model made the project seem like a promising one for further development. Our next steps include:

  • working with the Twitter API to build a function that can take a Twitter user's handle and generate a classification for that user.
  • adding additional texts to the corpus, especially those associated with Stoicism, eastern philosophy (Confucianism, Taoism, Buddhism), and contemporary political movements.
  • building a recommender system to recommend books or other media based on a user's philosophical leanings.

Thank you for your interest. If you are interested in contributing to the project or have questions about it, please feel free to email me at [email protected] or to reach out on Twitter - @aslidsiksoraksi.

philosophy-nlp-text-classification's People

Contributors

kcalizadeh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.