Giter VIP home page Giter VIP logo

domainadaption's Introduction

domainadaption

0. Available data

Amazon product reviews in four categories: books, dvd, electronics, and kitchen & housewares.
1000 positive, 1000 negative and various unlabeled reviews per category.
Data is available here.

├── sorted_data_acl/
    ├── books/
    │   ├── negative.review
    │   ├── positive.review
    │   ├── unlabeled.review
    ├── dvd/
    │   ├── negative.review
    │   ├── positive.review
    │   ├── unlabeled.review
    ├── electronics/
    │   ├── negative.review
    │   ├── positive.review
    │   ├── unlabeled.review
    |── kitchen_&_houswares/
        ├── negative.review
        ├── positive.review
        ├── unlabeled.review

1. Create Embeddings

1.1. Run preprocess_dataset_for_embeddings.py

This will create a reviews_forEmbedding.txt file in each category folder. The file will contain all reviews (positive, negative and unlabeled) of that categories with one sentence of a review per line. The sentences do not contain any special characters or any punctuation.

1.2. Run sorted_data_acl/merge_reviews.sh

This will merge all the above files into one file and store them in the sorted_data_acl/all/ folder.

1.3. Run create_word_embeddings.sh

This will create word embeddings for each category (including all) of the reviews using GloVe. In particular, this creates the following 4 files in each category folder:

  • reviews.vocab: word count per category in the format word -> count
  • reviews.cooccur: cooccurance matrix of words
  • reviews.cooccur.shuf: sorted cooccurence matrix
  • reviews.vectors.txt: word embeddings per category in the format word -> vector
1.4. Run build_embedding_dictionary.py

This will create Python dictionaries in the format word -> vector from the files reviews.vectors.txt and store it in the files reviews.vectors.pkl.

2. Transform Text Reviews into Embedded Reviews

2.1. Run preprocess_dataset.py

This will create a reviews_positive.txt, ratings_positive.txt, reviews_negative.txt and ratings_revative.txt files in each category folder. The files will contain the respective reviews and ratings with one review/rating per line.

2.2. Run merge_preprocessed_reviews.py

This will merge the preprocessed reviews from all four categories into the all/ folder.

2.3. Run embed_reviews.py

This will transfrom the text reviews into embedded reviews by converting each word into a vector using the dictionaries from previous steps. The resulting matrices will be stores in reviews_positive.npy and reviews_negative.npy.

3. Classify Sentiments

3.1. Run sentiment_classification.py

This will train a neural network to classify the sentiments in each category.

domainadaption's People

Contributors

lorenzoritter avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.