Giter VIP home page Giter VIP logo

text_mining_at_lasi19's Introduction

LASI'19 workshop on Text Mining for Learning Content Analysis

This repository stores materials for the Text Mining for Learning Content Analysis workshop organized at the Learning Analytics Summer Institute 2019 (LASI'19), at University of British Columbia, Vancouver, Canada, on June 17-19, 2019.

The stored R scripts cover 3 topics:

  • General text mining (TM) workflow exemplified through a binary text classification task. It covers the overall TM process, starting with text preprocessing, going through the creation of a few different classification models, and ending up with the testing of the best model. Scripts covering this topic:

    • preprocess_20News_dataset.R
    • newsgroup_classifier.R
    • tm_utils.R
  • Introduction to word vectors (word embeddings). The aim is to familiarize with the notion of word vectors through exploration of a pre-built word vector model. In particular, GloVe model (w/ 300 dimensions) is used. T-sne dimensionality reduction technique is used for visualization of word vectors in 2D space. Relevant scripts are:

    • exploring_word_vectors.R
    • word_vec_utils.R
  • Using word vectors for text classification. This includes two ways of using a pre-built word vector model to create an input for a classification algorithm: i) using weighted average of word vectors to form document vectors; ii) using Word Mover Distance to compute the similarity of documents based on their word vectors. The pre-built model introduced in topic 2 (GloVe) is used in this topic, as well. Scripts that cover this topic:

    • newsgroup_GloVe_classifier.R
    • tm_utils.R
    • word_vec_utils.R

Note also that some prebuilt models are available in the 'models' folder. They are made available so that we do not need to wait for models to build during the workshop.

The first and third topic are based on the 20 Newsgroups dataset. This dataset, widely used in text mining tasks and benchmarks, is a collection of approximately 20,000 newsgroup documents (forum posts), partitioned (nearly) evenly across 20 different newsgroups, each corresponding to a different topic. The csv files, in the data/20news folder, are derived from this dataset (subsetted and pre-processed).

Slides that introduce relevant concepts and methods are available at the links given below. The slides also cover some recent research work in Learning Analytics that was either partially or fully based on TM methods and techniques.

If interested in learning more, you may want to check materials from the previous edition of this workshop, held at LASI'18.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.