Giter VIP home page Giter VIP logo

mp3's Introduction

MP3: Implementing the PLSA Algorithm

DUE: OCT 23, 2022 at 11:59pm

In this MP, you will implement the PLSA algorithm discussed in lectures 9.7 and 9.8 of the Text Mining Coursera course. You are not required to implement the PLSA algorithm with a background model (we will run tests assuming the background model has not been implemented). You are provided with two data sets in the data folder: test.txt and dblp.txt. You can assume that each line is a separate document.

The test.txt data set contains 1000 lines. Each line is a document. The first 100 documents are labeled with the topic the document belongs to, either 0 (“Seattle”) or 1 (“Chicago”). Each of the first 100 document is generated by sampling completely from the topic that is labelled (i.e., generated from one topic only). The rest 900 documents are generated from a mixture model of the topic “Seattle” and “Chicago”. Run your code to test if your EM algorithm returns reasonable results.

The DBLP.txt data set is made up of research paper abstracts from DBLP. Each line is a document. Make sure to test your code on the simpler data set test.txt before running it on DBLP.txt.

You are provided with a code skeleton plsa.py. Do not change anything in the def __init__() function. But feel free to change anything in the main() function to test your code.

Some suggested tips:

  1. Using matrices is strongly encouraged (writing nested for-loops for calculation is painful)
  2. Python libraries numpy and scipy are useful in matrix based calculations

Writing the PLSA Algorithm:

The original PLSA model does not contain a background model. This MP also is based on the original PLSA model, you do not have to worry about the background model. However, lectures are all about PLSA with a background model, so you should not attempt to map the formulas on lecture slides directly to the code. Instead, you would need to adjust the formulas on slides by assuming that there is zero probability that the background model would be chosen. That is, you should set λB to zero. If you do this, you will see that the E-step would essentially only compute a distribution over k topics for z=1, ..., k, given each word, i.e., p(z=i|w). The M-step would also be simpler as p(Z=B|w) is also zero for all words (due to the fact that λB=0). If you are still confused, please take a look at Section 3 of Chase Geigle's note on EM [2] and the note below.

The main data structures involved in the implementation of this EM algorithm are three matrices:

  1. T (topics by words): this is the set of parameters characterizing topic content that we denoted by θi's. Each element is the probability of a particular word in a particular topic.

  2. D (documents by topics): this is the set of parameters modeling the coverage of topics in each document, which we denoted by pij's. Each element is the probability of a particular topic is covered in a particular document.

  3. Z (hidden variables): For every document, we need one Z which represents the probability that each word in the document has been generated from a particular topic, so for any document, this is a "word-by-topic" matrix, encoding p(Z|w) for a particular document. Z is the matrix that we compute in the E-step (based on matrices T and D, which represent our parameters). Note that we need to compute a different Z for each document, so we need to allocate a matrix Z for every document. If we do so, the M-step is simply to use all these Z matrices together with word counts in each document to re-estimate all the parameters, i.e., updating matrices T and D based on Z. Thus at a high level, this is what's happening in the algorithm:

    • T and D are initialized.
    • E-step computes all Z's based on T and D.
    • M-step uses all Z's to update T and D.
    • We iterate until the likelihood doesn't change much when we would use T and D as our output. Note that Zs are also very useful (can you imagine some applications of Zs?).

Resources:

[1] Cheng’s note on the EM algorithm
[2] Chase Geigle’s note on the EM algorithm, which includes a derivation of the EM algorithm (see section 4), and
[3] Qiaozhu Mei’s note on the EM algorithm for PLSA, which includes a different derivation of the EM algorithm.

THIS MP IS CODING AND DEBUGGING HEAVY! PLEASE DON'T LEAVE IT TILL THE LAST MINUTE!

mp3's People

Contributors

priyanka-dey avatar priyanka-dey-19 avatar kevinros avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.