Giter VIP home page Giter VIP logo

topic-modelling-map-reduce-algorithms-lsi-lda-and-hdp's Introduction

NLP-TOPIC-MODELLING-

RNN ,LSTM and Multiple Topic Map reduce Algorithms

Setup Virtual Environment(nlp_requirements.txt) env to install all files in the requirements.txt file.

  1. cd to the directory where requirements.txt is located
  2. activate your virtualenv eg.- source bin/activate
  3. run: pip install -r nlp_requirements.txt in your shell

PRE-PROCESSING :

  1. Batch Processing (Time Series Preprocessing.py) Dataset in different format including huge continuous text and time series data. (Sometimes data format conversion requires as core panda series , Big-query data and various databases )* • Text data can be split into lesser than 1 Lac words in different files • Time Series data divide into year, month, week and days according to data size.

  2. NLP (Word Processing.py) • Tokenization: Divide sentence into smaller parts and removing punctuations. • Lemmatization: Stemming means cutting ends and beginnings of words into root form • Removing Stop-Words: Removing prepositions from Tokens

  3. Map-Reduce Topic Modelling( Topic Modelling.py): Labelling unsupervised text from topic modelling with extracting Most frequently and highest embedding value Tokens • Dictionary: Creating set of all tokens in document as genism dictionary projection format • Corpus: word embedding process word to vector form

1]LSI Model: Top 10 or N number of topics consider while representing into concept or terms in graphical format eg. Bitcoin Tweets Dataset:

[(0, '0.802*"bitcoin" + 0.257*"bitcoin" + 0.257*"blockchain" + 0.248*"crypto" +0.233*"cryptocurrency" + 0.203*"ethereum" + 0.164*"SCREEN_NAME" + 0.070*"airdrop" + 0.056*"token" + 0.055*"price"'), (1, '0.377*"bitcoin" + -0.307*"token" + -0.270*"bitcoin" + -0.269*"freetoken" + -0.263*"30;000" + -0.263*"15;000" + -0.261*"worth" + -0.251*"blockchain" + -0.229*"crypto" + -0.202*"airdrop"'), (2, '0.344*"bitcoin" + -0.288*"SCREEN_NAME" + 0.264*"30;000" + 0.264*"15;000" + -0.262*"blockchain" + 0.262*"worth" + 0.251*"freetoken" + -0.241*"ethereum" + -0.225*"bitcoin" + 0.215*"token"'), (3, '-0.855*"SCREEN_NAME" + -0.211*"airdrop" + -0.167*"cybersecurity" + 0.157*"blockchain" + -0.155*"bounty" + 0.149*"crypto" + -0.098*"token" + 0.095*"price" + -0.085*"freetoken" + -0.079*"altcoin"'), (4, '0.476*"ratio" + -0.391*"cybersecurity" + 0.257*"hitbtc" + .236*"SCREEN_NAME" + -0.196*"cryptocurrency" + -0.193*"ethereum" + -0.186*"airdrop" + 0.183*"price" + 0.159*"trading" + 0.150*"arbitraj"'), (5, '0.563*"ratio" + 0.279*"hitbtc" + 0.246*"ethereum" + 0.227*"cybersecurity" + -0.194*"trading" + -0.181*"crypto" + 0.175*"arbingtool" + 0.175*"arbitraj" + 0.174*"arbitrage" + 0.173*"cryptocurrency"'), (6, '0.407*"bitcoin" + -0.385*"escort" + -0.254*"costarica" + -0.229*"mexico" + -0.207*"guatemala" + -0.207*"\u2605\u2605\u2605" + -0.195*"cryptocurrency" + -0.177*"nowplaying" + -0.171*"\u279c" + -0.171*"blockchain"'), (7, '-0.271*"stock" + -0.262*"investment" + -0.250*"makemoneyonline" + -0.246*"trading" + -0.236*"workfromhome" + -0.235*"block" + -0.231*"internetmarketing" + -0.231*"strategy" + -0.230*"affiliate" + -0.229*"paypal"'), (8, '0.676*"airdrop" + -0.251*"cybersecurity" + -0.228*"blockchain" + 0.225*"javatoken" + -0.159*"SCREEN_NAME" + 0.144*"crypto" + 0.141*"bounty" + 0.115*"joining" + 0.114*"limited" + 0.114*"jtokens"'), (9, '0.491*"crypto" + -0.370*"bitcoin" + 0.220*"cybersecurity" + 0.213*"fintech" + 0.190*"cryptocurrency" + -0.184*"escort" + 0.176*"market" + 0.175*"wallet" + -0.166*"\u2605\u2605\u2605" + 0.158*"money2020"')]

2]LDA Model: Process to discover topics from documents as independent distribution. Disadvantage of this model is optimal solution not achieved in minimal iterations. Example 3 datsets of tweets analysed as Outputs in html format Bitcoin LDA ,Facebook LDA and Etherium LDA files.

3]HDP Model: All topics are considered and learned from tokens to improvise result. Advanced version of LDA Model to get exact concept and terms where discussed in huge documents.outputs in image format

*Coherence Value: To find value of N ,N is number of topics considered in Topic Modelling.

topic-modelling-map-reduce-algorithms-lsi-lda-and-hdp's People

Contributors

abhishekeb211 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.