Giter VIP home page Giter VIP logo

covid-weibo's Introduction

Weibo COVID dataset

Sina Weibo (新浪微博), commonly referred to as "Chinese Twitter", is a micro-blogging site. This Weibo dataset was used in Analysis of misinformation during the COVID-19 outbreak in China: cultural, social and political entanglements. You can access the ppaer through this link. The data was crawled on the Weibo platform from December 7, 2019, to April 4, 2020.

The data is crawled in two phases, covering a total of 4,047,389 Weibo posts. The first crawler ran on February 26, 2020, and collected 3.3 million Weibo posts from January 18, 2020, to February 26, 2020. The second crawler ran on April 4, crawling from December 7, 2020, to April 4, 2020, to complement the original dataset. The temporal distribution of the Weibo data is shown in Figure 1.

Figure 1. Weibo data temporal distribution

How to acquire the dataset

This Weibo data was collected for and can only be used for research purposes. Please complete google form to acquire this corpus. We will email you the instructions to download the data.

If you use this data in your research, please cite the following article. Please reach out to Yujia Zhai at [email protected] for questions related to the data.

Leng, Yan, Yujia Zhai, Shaojing Sun, Yifei Wu, Jordan Selzer, Sharon Strover, Julia Fensel, Alex Pentland, and Ying Ding. "Analysis of misinformation during the COVID-19 outbreak in China: cultural, social and political entanglements." arXiv preprint arXiv:2005.10414 (2020).

@ARTICLE {9340553,
author = {Y. Leng and Y. Zhai and S. Sun and Y. Wu and J. Selzer and S. Strover and H. Zhang and A. Chen and Y. Ding},
journal = {IEEE Transactions on Big Data},
title = {Misinformation during the COVID-19 outbreak in China: cultural, social and political entanglements},
year = {5555},
volume = {},
number = {01},
issn = {2332-7790},
pages = {1-1},
keywords = {covid-19;blogs;social networking (online);pandemics;media;urban areas;big data},
doi = {10.1109/TBDATA.2021.3055758},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {jan}
}

Data collection method

The Weibo data is obtained by a python crawler. The crawler automatically uses Weibo's advanced search function for keyword indexing.

The keywords we used included:

  • COVID-19
  • novel coronavirus(新型冠状病毒)
  • corona(新冠)
  • epidemics(疫情)
  • novel pneumonia(新型肺炎)
  • pneumonia in Wuhan(武汉+肺炎)

The crawler program automatically entered one of the keywords into the query box and set the query time range to be a specific hour. As illustrated in Figure 2, the crawler sets the time range from January 18, 2020, 00:00:00 to January 18, 2020, 01:00:00.

For each query, the time range increased by one hour, and each query searched all the new posts within an hour. Each query returned a maximum of 50 pages, each contained around 20 posts. If the number posts exceed the page limits, we cannot fully collect the information due to the limitations of the search function.

weibo_search.png

Figure 2. Weibo's advanced search function for keyword indexing

Metadata of the Weibo data

We illustrate a weibo in Figure 3, which contains the user name, content, timestamp, number of reposts, comments, and likes.

Alt text

  • Figure 3. A screenshot of one Weibo post *

We present the Weibo metadata in Table 1 and Figure 4.

Date field Description
weibo_id the unique identifier of the weibo
user_id the unique identifier for user
from the type of the device sent this weibo
content content of the weibo
timestamp posting time
repost the original weibo reposted by this weibo, containing the content/timestamp/image/user_id. Note: not all weibo has this variable

Figure 4. Metadata of Weibo data

covid-weibo's People

Contributors

yleng avatar zhaihulu avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.