Giter VIP home page Giter VIP logo

pom's Introduction

Table of Contents

Overview

This dataset is an annotated variant of the Persuasive Opinion Multimedia (POM) corpus. It was developed for the opinion prediction task and includes opinion annotations at the expression and word levels. Expression-level annotations label the textual span of the opinion. Word-level annotations (e.g. holder, target, polarity) label the word components of the opinion. Further details can be found in (Garcia et al. 2019 (1)). As part of preprocessing, punctuation was added to the text of the original corpus. The dataset is stored as a pickled pandas MultiIndex DataFrame.

The hierarchical index structure can be understood according to the tuple which forms the MultiIndex object. The first element of the index is one of the following values: features, labels, level_0, seq_level_labels_lvl1 or words.

Each row in words is indexed by the following tuple of values: (index_text, id_sentence, level_1) where index_text indexes the raw filename for each movie review, id_sentence indexes the sentences in the review, and level_1 indexes each word in each sentence. This same tuple indexes the rows of each of the following pieces of data in the dataframe.

Features

Features consist of the tuple (features, [feature name], dimension) where the number of dimensions count the number of columns that comprise a particular feature. This data originates from the original POM corpus (Park et al. 2014) but was re-aligned so that it could be incorporated into this dataset.

feature name feature type dimensions
feature_COAVAREP audio 43
feature_FACET 4.1 video 43
feature_FACET 4.2 video 36
feature_glove_vectors text 300
intervals word start, word stop 2

Video labels

Video labels consist of the tuple (labels, [label name], dimension) where the number of dimensions count the number of columns that comprise a particular label. This data also originates from the original POM corpus (Park et al. 2014) and was re-aligned.

label name dimensions
label_video_personality 16
label_video_persuasion 1
label_video_sentiment 1

Opinion labels

Opinion labels consist of the tuple (seq_level_labels_lvl1, seq_level_labels_lvl2, [label]). The field label consist of all holders, polarities, and targets in the dataset. Each label is boolean. The exception is the sentence-level 4_levels_polarity label which can take the value '0' (no opinion), '1' (negative opinion), or '2' (positive opinion).

label granularity
4_levels_polarity sentence-level
Actor expression-level
Atmosphere and mood expression-level
Character design expression-level
Composer - Singer - Soundmaker expression-level
Director expression-level
Music and Sound effects expression-level
Negative expression-level
Negative_levels expression-level
Neutral expression-level
Other expression-level
Other people involved in movie making expression-level
Overall expression-level
Polarity word-level
Positive expression-level
Positive_levels expression-level
Price expression-level
Producer expression-level
Screenplay expression-level
Target word-level
Token word-level
Very\\_Negative expression-level
Very\\_Positive expression-level
Vision and Special effect expression-level

There are two unique expression-level labels: Negative_levels and Positive_labels. They are both aggregate labels that only take the value '1' if either the values Negative OR Very\\_Negative (Positive OR Very\\_Positive) take the value '1' at the expression level.

An example of a sentence from the dataset is:

This movie came out a few years ago and it is awesome

This sentence has a 4_levels_priority of '2' because the sentence contains the positive expression "it is awesome". The target word is "it" so this word has a value of '1' for the label Target. Finally "it is" refers to the overall film so the words "it" and "is" both have values of '1' for the labels Very\\_Positive, Positive_levels, and Overall.

Considerations

Researcher should keep in mind that this dataset differs from the original POM dataset due to the follow data process:

  1. Annotators did not take into account the video portion of the dataset during annotation. Only the transcripts of each review were considered.
  2. While the original dataset contained punctuation (e.g. silent pauses), this dataset does not contain punctuation and only provides sentence segmentation. This could be of significant importance for those who want to use certain audio features from the CMU SDK -- such as pause (Park et al. 2014).
  3. Because punctation has been removed the Levenshtein distance was used in order to re-match the annotated transcripts with the transcripts of the original dataset.
  4. Finally the annotated transcripts were re-integrated with the remaining features in the original POM dataset.

Download Link

The dataset is available for download through registration at the following link:

http://service.tsi.telecom-paristech.fr/cgi-bin/user-service/subscribe.cgi?form=&license=1&ident=POM

If prompted to sign in simply click 'Cancel' in order to navigate to the registration page.

Filezilla is the recommended FTP client. Please make sure to use the following configuration when connecting to the server.

title title

Acknowledgement

The documentation of this dataset and its issues, and code to parse the data were contributed by Tanvi Dinkar.

Contact Information

Please direction any questions or concerns regarding this dataset to Chloé Clavel ([email protected]) or Tanvi Dinkar ([email protected]).

Citation information

@article{garcia2019multimodal,
  title={A multimodal movie review corpus for fine-grained opinion mining},
  author={Garcia, Alexandre and Essid, Slim and d'Alch{\'e}-Buc, Florence and Clavel, Chlo{\'e}},
  journal={arXiv preprint arXiv:1902.10102},
  year={2019}
}

@article{garcia2019token,
  title={From the token to the review: A hierarchical multimodal approach to opinion mining},
  author={Garcia, Alexandre and Colombo, Pierre and Essid, Slim and d'Alch{\'e}-Buc, Florence and Clavel, Chlo{\'e}},
  journal={arXiv preprint arXiv:1908.11216},
  year={2019}
}

@inproceedings{park2014computational,
  title={Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach},
  author={Park, Sunghyun and Shim, Han Suk and Chatterjee, Moitreya and Sagae, Kenji and Morency, Louis-Philippe},
  booktitle={Proceedings of the 16th International Conference on Multimodal Interaction},
  pages={50--57},
  year={2014}
}

pom's People

Contributors

eusip avatar tdinkar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pom's Issues

Dataset

Thank you very much for your work on this dataset. And I have registered for nearly a day. When will I give the dataset?

About license

Hi, thank you for sharing repogitory.
I have a question about license.
What is the license for this dataset?
Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.