Giter VIP home page Giter VIP logo

adamax's Introduction

A Movie behind a Script

Data story: https://epfl-ada-2018.github.io/ada2018-movie-scripts/

Abstract

We have come a long way since the public screening of ten of Lumière brothers' short films in Paris on 28 December 1895 to Avatar, a movie whose cost of development reached more than 200 million dollars and recorded nearly 2 billion dollars in gross, the highest of all-time. From a historical point of view, works of art have acted as a reflection of our society and our culture. During all this time, scripts have remained the base stone to a movie. What script must one write for a movie to prosper? We have data spanning nearly 100 years with over 4,000 movies to try to solve this puzzle. Using the OpenSubtitles dataset and the IMDb dataset to analyze movies' scripts and measure their popularity we hope to provide a better insight to what makes a good and a bad movie.

Research questions

  • What significant features in the movie's script must a good movie have?
    • Is a lot of dialogue present?
    • How large is the spread of words used?
    • Must it employ a particular set of words?
  • Based on certain features of the script, can we predict the corresponding movie's popularity?
    • We measure its popularity by measuring its IMDb rating, number of votes and the box office.
  • What are the common features in good and bad movies?
    • Can we find similarities between good (or bad) movies based on their script's features?
    • Can we therefore define new movie categories (instead of their genre e.g.)?
  • Bonus: can we relate the language used by reviewers on IMDb to the movie's script and genre?

Dataset

  • OpenSubtitles: consists of 3.74 million subtitle files over 62 languages and covers a total of 152,939 movies or TV episodes. The size of the dataset is 31GB and is provided in XML format.

In the paper OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles, it is explained that the original subtitle files came in the following textual format but were translated to an XML format.

An extract of a subtitle file:

3
00:00:39,299 --> 00:00:41,099
Sir, we're getting
a distress call

4
00:00:41,168 --> 00:00:42,634
from a civilian aircraft.

5
00:00:46,540 --> 00:00:49,641
CIC visually confirms
a Cessna 172.

Subtitle files contain for each subtitle a unique number of the subtitle shown, timestamps for the duration the subtitle is shown and the text displayed.

We intend to count the frequency of words, the number of distinct words in a movie and compute the average length of sentences to determine the complexity of a movie's script.

  • IMDb Datasets: contain the information of movies and shows, cast, actors, directors and writers, TV episodes and ratings and votes for each title. Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set.

Each movie/show in the OpenSubtitles dataset is determined by its IMDb identifier allowing us to enrich the OpenSubtitles dataset with the IMDb dataset.

A list of internal milestones up until project milestone 2

04.11

  • Set up the Git repository and project skeleton.

11.11

  • Find the most convenient way to store and look up our data.
  • Download the data.
  • Clean our dataset.
  • Develop methods to analyze the subtitles: counting words, spread of words, mean length of sentences, dialogue time.

18.11

  • Test our methods.
  • Gather results.
  • Start analysis.

25.11

  • Set up our goals and plans for the next milestone.

Questions for TAs

  • Is there a good library/way to analyze text?
  • Are we allowed other programming languages and/or tools than Python? NVivo for instance.

Contributors

  • Xabier: plotting graphs during data analysis, running tests, preliminary data analysis, commenting results, topic detection.
  • Martin: plotting graphs during data analysis, problem formulation, coming up with algorithms, coding up algorithms, sentiment analysis.
  • Adrian: plotting graphs during data analysis, writing the data story, managing the git repository, final presentation speaker.

adamax's People

Contributors

adriguerra avatar mesguerraf avatar xabitsuki avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.