Giter VIP home page Giter VIP logo

opendata's Introduction

【Cofacts 真的假的】Open Datasets

In order to facilitate academic research and analysis in fact-checking field under closed messaging platform, Cofacts releases all instant messages and replies in its database to the public domain, under CC0 license. Everyone can freely distribute and leverage the dataset.

Files and formats

Please find the open data files under data/ directory. All CSV files are utf-8 encoded and compressed in a zip file.

Fields across different entities

  • userIdsha (string) Hashed user identifier.
  • appId (string) Possible values:
    • LEGACY_APP: Articles collected before 2017-03.
    • RUMORS_LINE_BOT: Articles collected with the current LINE bot client after 2017-03.

The two fields together identifies an unique user across different CSV files. For instance, if one row (reply) in replies.csv and another row (feedback) in article_reply_feedbacks.csv have identical userIdsha and appId, the reply and the feedback are submitted by the same user.

Fields

articles.csv

The instant messages LINE bot users submitted into the database.

Field Data type Description
id String
references Enum string Where the message is from. Currently the only possible value is LINE.
userIdsha String Author of the article.
appId String
tags Text Preserved for category labels, currently empty.
normalArticleReplyCount Integer The number of replies are associated to this article, excluding the deleted reply associations.
text Text The instant message text
hyperlinks Text Preserved. Now empty.
createdAt ISO time string When the article is submitted to the database.
updatedAt ISO time string Preserved, currently identical to createdAt
lastRequestedAt ISO time string The submission time of the last reply_request is sent on the article, before the article is replied.

article_replies.csv

Articles and replies are in has-and-belongs-to-many relationship. That is, an article can have multiple replies, and a reply can be connected to multiple similar articles.

article_replies is the "join table" between articles and replies, bringing articleId and replyId together, along with other useful properties related to this connection between an article and a reply.

One pair of articleId, replyId will map to exactly one article_reply.

Field Data type Description
articleId String Relates to id field of articles
replyId String Relates to id field of replies
userId String The user connecting the reply with the article
negativeFeedbackCount Integer Number of article_reply_feedbacks that has score -1
positiveFeedbackCount Integer Number of article_reply_feedbacks that has score 1
replyType Enum string Duplicated from replies's type.
appId String
status Enum string NORMAL: The reply and article are connected. DELETED: The reply does not connect to the article anymore.
createdAt ISO time string The time when the reply is connected to the article
updatedAt ISO time string The latest date when the reply's status is updated

replies.csv

Editor's reply to the article.

Field Data type Description
id String
type Enum string Type of the reply chosen by the editor. RUMOR: The article contains rumor. NON_RUMOR: The article contains fact. OPINIONATED: The article contains personal opinions. NOT_ARTICLE: The article should not be processed by Cofacts.
reference Text For RUMOR and NON_RUMOR replies: The reference to support the chosen type and text. For OPINIONATED replies: References containing different perspectives from the article. For NOT_ARTICLE: empty string.
userId String The editor that authored this reply.
appId String
text Text Reply text writtern by the editor
createdAt ISO Time string When the reply is written

reply_requests.csv

Before an article is replied, users may submit reply_requests to indicate that they want this article to be answered.

When an article is first submitted to the article, an reply request is also created. Any further queries to the same article submits new reply_requests.

An user can only submit one reply request to an article.

Field Data type Description
articleId String The target of the request
reason Text The reason why the user wants to submit this reply request
positiveFeedbackCount Text Number of editors think the reason is reasonable
negativeFeedbackCount Text Number of editors think the reason is nonsense
createdAt ISO Time string When the reply request is issued

article_reply_feedbacks.csv

Editors and LINE bot users can express if a reply is useful by submitting article_reply_feedbacks toward a article_reply with score 1 or -1.

The feedback is actually submitted toward an article_reply, the connection between an article and a reply. This is because a reply can be connected to multiple articles. A reply that makes sense in one article does not necessarily mean that it is useful in answering another article. Therefore, the feedback count for a reply connecting to different articles are counted separately.

Field Data type Description
articleId String Relates to articleId of the target article_reply
replyId String Relates to replyId of the target article_reply
score Integer 1: Useful. -1: Not useful.
comment Text Why the user chooses such score for this article reply
createdAt ISO Time string When the feedback is submitted

License

CC0
To the extent possible under law, g0v Cofacts Project has waived all copyright and related or neighboring rights to Cofacts Dataset. This work is published from: Taiwan.

⚠ [NOTICE] Caveats of using this data ⚠

The methodology we use to collect these data (i.e. how Cofacts works) could have some impact on the data credibility.

How cofacts work

Please keep in mind that all data in this dataset are user-generated, thus is not free from noise and sampling bias coming from these sources:

  • The distribution Cofacts' users may not reflect the real distribution of all LINE users in Taiwan.
  • Users may not use Cofacts in the same way we want them to be. Some articles may not be actual messages circulating in LINE network.
  • replies may contain factual error. All replies should be merely regarded as "responses to the original message (article) to provide different point of view". They are neither the "truth" nor the editor's personal opinion.
  • There may also exist malicious users sending garbage articles into the database. (Previous incident report)
  • The program to collect data and to generate dataset may contain error. The dataset may be inaccurate systematically in this way.

Lastly, the dataset is provided without warrenty.

THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET.

Generating opendata files

We generate the opendata files by backing up production DB to local machine, then run this script on local machine.

According to rumors-deploy, the production DB raw data should be available in rumors-deploy/volumes/db-production. (Staging is in db-staging instead).

To backup production DB, Just tar the rumors-deploy/volumes/db-production, download to local machine, extract the tar file and put it in esdata directory of this project's root. esdata should contain only nodes directory now.

Run this to spin up a local elasticsearch for the backed up file

$ docker-compose up

This spins up elasticsearch on localhost:62223, with Kibana available in localhost:62224, using the data in esdata.

Lastly, run this to generate files to data/ directory:

$ npm start

opendata's People

Contributors

mrorz avatar renovate-bot avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.