Giter VIP home page Giter VIP logo

storylens's Introduction

STORYLENS

The multistream corpora (StoryLens) created for Recognyze eval in InVID project.

CITATION

If you use this corpora in your evaluations, please cite the following paper (BibTeX):

   @inproceedings{brasoveanu2018wims,
        author = {Adrian M. P. Bra{\c{s}}oveanu and Lyndon J.B. Nixon and Albert Weichselbraun},
        title  = {StoryLens: A Multiple Views Corpus for Location and Event Detection},
        booktitle = {Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics (WIMS 2018)},
        address = {Novi Sad, Serbia},
        publisher = {ACM},
        year   = {2018},
        date   = {25-27 June 2018}
   }

A MULTISTREAM CORPORA

A multistream corpora contains content from different types of streams.

The current corpora contains annotations based on the following stream types:

  • news - 100 documents
  • twitter - 200 documents
  • youtube - 100 documents

We might consider adding more documents in time.

DOCUMENTS

The YouTube, Twitter and newsmedia documents are not provided with this corpus due to copyright reasons.

The original documents can be retrieved by crawling their URLs. In order to provide third parties with the possibility to do this we provide a list of Document Ids in the following folder: List. Here are the links to the individual lists:

The output for the Twitter partition of the corpora only contains the annotations due to copyright restrictions, but the actual texts of the tweets can be downloaded by ids using free scripts\footnote{Tweet Downloader by ID example: https://gist.github.com/giacbrd/b996cfe2f1d24752f23bd119fdd678f2}.

ONTOLOGY

The focus is on location entities, therefore all types of conflicts between locations and other types of entities are included.

The annotations taken into account when building the gold standard files are the following:

  • Natural Location (LOC) - e.g., Danube River, Alps
  • Geo-Political Entity (GPE) - e.g., Vienna, Austria
  • Facility (FAC) - e.g., Brooklyn Bridge, Interstate 66
  • Person (PER) - e.g., Prince Charles, Donald Trump
  • Organization (ORG) - e.g., Google, Apple
  • Product (PROD) - e.g., IPhone, Samsung Galaxy 8
  • Work (WORK) - e.g., Mona Lisa, Star Trek
  • Event (EVENT) - e.g., 9/11, Grenfell Tower fire
  • misc (MISC) - any other type of entity

The ontology can be found here: Recognyze Ontology.

ANNOTATION GUIDELINE

The Annotation Guideline is based on TAC and ACE guidelines.

It can be found in the following folder: Guideline.

GOLD

The Gold folder contains the judged results.

The links provided are based on the current LIVE DBpedia (September - December 2017) version that would correspond to DBpedia 2017-10 or 2018-04, therefore link changes can occur.

In case you find one of the following error types please feel free to contact us in order to update it:

  • New entities that were not annotated
  • Different possibilities to annotate various entities
  • New links (where no entitiy was found before or where NIL entities currently exist)

LENSES

The Lenses folder contains some exmple lenses.

We currently provide:

  • Long - longest match for any entity
  • Embedded - includes embedded entities
  • (DBpediaLens - lens related to a certain DBpedia version (e.g., 2016-10 or 2016-04) - currently in preparation)

For future versions of the corpora we will also include:

  • events - arguably only named events (EVENT) such as Grenfell Tower Disaster
  • stories - the narratives focused around big events

UPDATES

Due to the fact that the publication associated with this dataset is still under review and the DBpedia LIVE version used during annotations is not available as a dump, we reserve the right to change small parts of this dataset in the near future.

Example updates might include:

  • New entities - typically entities detected during evaluations or reported by third-party users
  • New Links - if available
  • New Lenses - if needed for a particular use case

TWEET DOWNLOADER

In order to download the full tweets please use any tweet downloader, for example Tweet Downloader

OTHER FORMATS

If there is a need to use this corpora in other formats than the ones provided by us, please contact us.

NOTES

Official version is published on GitHub without the original documents due to copyright reasons.

If you plan to use this corpora in an evaluation suite please contact us.

If you discover various errors in this dataset (e.g., missing annotation, wrong types, etc,) feel free to contact us and we will update it.

COPYRIGHT

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

storylens's People

Contributors

adib2011 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.