kaidmml / fakenewsnet Goto Github PK

This is a dataset for fake news detection research

Python 100.00%

fakenewsnet's Introduction

FakeNewsNet

*** We will never ask for money to share the datasets. If someone claims that s/he has the all the raw data and wants a payment, please be careful. ***

We released a tool FakeNewsTracker, for collecting, analyzing, and visualizing of fake news and the related dissemination on social media. Check it out!

The latest dataset paper with detailed analysis on the dataset can be found at FakeNewsNet

Please use the current up-to-date version of dataset

Previous version of the dataset is available in branch named old-version of this repository.

Overview

Complete dataset cannot be distributed because of Twitter privacy policies and news publisher copy rights. Social engagements and user information are not disclosed because of Twitter Policy. This code repository can be used to download news articles from published websites and relevant social media data from Twitter.

The minimalistic version of latest dataset provided in this repo (located in dataset folder) include following files:

politifact_fake.csv - Samples related to fake news collected from PolitiFact
politifact_real.csv - Samples related to real news collected from PolitiFact
gossipcop_fake.csv - Samples related to fake news collected from GossipCop
gossipcop_real.csv - Samples related to real news collected from GossipCop

Each of the above CSV files is comma separated file and have the following columns

id - Unique identifider for each news
url - Url of the article from web that published that news
title - Title of the news article
tweet_ids - Tweet ids of tweets sharing the news. This field is list of tweet ids separated by tab.

Installation

Requirements:

Data download scripts are writtern in python and requires python 3.6 + to run.

Twitter API keys are used for collecting data from Twitter. Make use of the following link to get Twitter API keys
https://developer.twitter.com/en/docs/basics/authentication/guides/access-tokens.html

Script make use of keys from tweet_keys_file.json file located in code/resources folder. So the API keys needs to be updated in tweet_keys_file.json file. Provide the keys as array of JSON object with attributes app_key,app_secret,oauth_token,oauth_token_secret as mentioned in sample file.

Install all the libraries in requirements.txt using the following command

pip install -r requirements.txt

Configuration:

FakeNewsNet contains 2 datasets collected using ground truths from Politifact and Gossipcop.

The config.json can be used to configure and collect only certain parts of the dataset. Following attributes can be configured

num_process - (default: 4) This attribute indicates the number of parallel processes used to collect data.
tweet_keys_file - Provide the number of keys available configured in tweet_keys_file.txt file
data_collection_choice - It is an array of choices of various parts of the dataset. Configure accordingly to download only certain parts of the dataset.
Available values are
{"news_source": "politifact", "label": "fake"},{"news_source": "politifact", "label": "real"}, {"news_source": "gossipcop", "label": "fake"},{"news_source": "gossipcop", "label": "real"}
data_features_to_collect - FakeNewsNet has multiple dimensions of data (News + Social). This configuration allows one to download desired dimension of the dataset. This is an array field and can take following values.
- news_articles : This option downloads the news articles for the dataset.
- tweets : This option downloads tweets objects posted sharing the news in Twitter. This makes use of Twitter API to download tweets.
- retweets: This option allows to download the retweets of the tweets provided in the dataset.
- user_profile: This option allows to download the user profile information of the users involved in tweets. To download user profiles, tweet objects need to be downloaded first in order to identify users involved in tweets.
- user_timeline_tweets: This option allows to download upto 200 recent tweets from the user timeline. To download user's recent tweets, tweet objects needs to be downloaded first in order to identify users involved in tweets.
- user_followers: This option allows to download the user followers ids of the users involved in tweets. To download user followers ids, tweet objects need to be downloaded first in order to identify users involved in tweets.
- user_following: This option allows to download the user following ids of the users involved in tweets. To download user's following ids, tweet objects needs to be downloaded first in order to identify users involved in tweets.

Running Code

Inorder to collect data set fast, code makes user of process parallelism and to synchronize twitter key limitations across mutiple python processes, a lightweight flask application is used as keys management server. Execute the following commands inside code folder,

nohup python -m resource_server.app &> keys_server.out&

The above command will start the flask server in port 5000 by default.

Configurations should be done before proceeding to the next step !!

Execute the following command to start data collection,

nohup python main.py &> data_collection.out&

Logs are wittern in the same folder in a file named as data_collection_<timestamp>.log and can be used for debugging purposes.

The dataset will be downloaded in the directory provided in the config.json and progress can be monitored in data_collection.out file.

Dataset Structure

The downloaded dataset will have the following folder structure,

├── gossipcop
│   ├── fake
│   │   ├── gossipcop-1
│   │	│	├── news content.json
│   │	│	├── tweets
│   │	│	│	├── 886941526458347521.json
│   │	│	│	├── 887096424105627648.json
│   │	│	│	└── ....		
│   │	│  	└── retweets
│   │	│		├── 887096424105627648.json
│   │	│		├── 887096424105627648.json
│   │	│		└── ....
│   │	└── ....			
│   └── real
│      ├── gossipcop-1
│      │	├── news content.json
│      │	├── tweets
│      │	└── retweets
│		└── ....		
├── politifact
│   ├── fake
│   │   ├── politifact-1
│   │   │	├── news content.json
│   │   │	├── tweets
│   │   │	└── retweets
│   │	└── ....		
│   │
│   └── real
│      ├── poliifact-2
│      │	├── news content.json
│      │	├── tweets
│      │	└── retweets
│      └── ....					
├── user_profiles
│		├── 374136824.json
│		├── 937649414600101889.json
│   		└── ....
├── user_timeline_tweets
│		├── 374136824.json
│		├── 937649414600101889.json
│	   	└── ....
└── user_followers
│		├── 374136824.json
│		├── 937649414600101889.json
│	   	└── ....
└──user_following
        	├── 374136824.json
		├── 937649414600101889.json
	   	└── ....

News Content

news content.json: This json includes all the meta information of the news articles collected using the provided news source URLs. This is a JSON object with attributes including:

text is the text of the body of the news article.
images is a list of the URLs of all the images in the news article web page.
publish date indicate the date that news article is published.

Social Context

tweets folder: This folder contains all tweets related to the news sample. This contains the tweet objects of the all the tweet ids provided in the tweet_ids attribute of the dataset csv. All the files in this folder are named as <tweet_id>.json . Each <tweet_id>.json file is a JSON file with format mentioned in https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html.

retweets folder: This folder contains the retweets of the all tweets posted sharing a particular news article. This folder contains files named as <tweet_id>.json and it contains a array of the retweets for a particular tweets. Each object int the retweet array have format mentioned in https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.

user_profiles folder: This folder contains all the user profiles of the users posting tweets related to all news articles. This same folder is used for both datasources ( Politifact and GossipCop). It contains files named as <user_id>.json and have JSON formated mentioned in https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object.html

user_timeline_tweets folder: This folder contains files representing the time line of tweets of users posting tweets related to fake and real news. All files in the folder are named as <user_id>.json and have JSON array of upto 200 recent tweets of the users. The files have format mentioned same as https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html.

user_followers folder: This folder contains all the user followers ids of the users posting tweets related to all news articles. This same folder is used for both datasources ( Politifact and GossipCop). It contains files named as <user_id>.json and have JSON data with user_id and followers attributes.

user_following folder: This folder contains all the user following ids of the users posting tweets related to all news articles. This same folder is used for both datasources ( Politifact and GossipCop). It contains files named as <user_id>.json and have JSON data with user_id and following attributes.

References

If you use this dataset, please cite the following papers:

@article{shu2018fakenewsnet,
  title={FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media},
  author={Shu, Kai and  Mahudeswaran, Deepak and Wang, Suhang and Lee, Dongwon and Liu, Huan},
  journal={arXiv preprint arXiv:1809.01286},
  year={2018}
}

@article{shu2017fake,
  title={Fake News Detection on Social Media: A Data Mining Perspective},
  author={Shu, Kai and Sliva, Amy and Wang, Suhang and Tang, Jiliang and Liu, Huan},
  journal={ACM SIGKDD Explorations Newsletter},
  volume={19},
  number={1},
  pages={22--36},
  year={2017},
  publisher={ACM}
}

@article{shu2017exploiting,
  title={Exploiting Tri-Relationship for Fake News Detection},
  author={Shu, Kai and Wang, Suhang and Liu, Huan},
  journal={arXiv preprint arXiv:1712.07709},
  year={2017}
}

fakenewsnet's People

Contributors

Stargazers

Watchers

Forkers

benjamesbabala hjdo2 abhishek-verma leewaygroups amee1829 kunigaz nguyenvo09 lngvietthang zed7576 shuoooo aam3 wnjhamilton zwado naushadzaman zhanglipku eeekmeek saradhix isabelamarcal snu-cse sherlock42 merajat raihan2108 vampyrick leereak manalig7 igortereshchenko scone-snu umair20172060 ucb-w266-liar-project-ajs shaynak jadhavguru anikaraghu nutrmaku sdelaurentiis cat-314 nnn1988 laizheng annngg95018 pariyat ben-aaron188 jihochoi orionmat qss2012 marinaibrishimova amirabendhia fanglanting talitagroetzinger gurshaansinghbajaj harshitm98 osamask yuancz charlottesean parety dst1213 wibruce dsilentkiller sdilbaz kumudchauhan ajayrawatsap mdepak alessmonda minittt zyj6 channel960608 s214534 asai28 chickeneatrice abnsl0014 vishalkesti382 gattuv kpapadak happysmoothie isspek socratesclub cylinbao maazamjad francosta cgraham01 littlesuncaicai juslegacy shankar0206 yueyedeai kimnt93 damnationucl varshini-reddy zheng19931128 saurav-chopra shubhampachori12110095 caiobrighenti zenyi1 yagina xkuang lichao88 jasondarkblue zhangzho wade1990 socialtrendly rajasree-r roshini74 nayeon7lee

fakenewsnet's Issues

Helpp

Hello! Is it possible to set up another country?

no tweets retweets

I am only getting news.json, not the tweets and re-tweets.
I filled up the twitter keys, any help
thanks

Size of dataset

Hello

I do have a question regarding downloading the dataset.
As downloading is very slow for me, i would like to know how far I am in the downloading process.
So far i am downloading the politifact part of the dataset, including tweets and retweets. The part i already downloaded onto disk currently takes up 7.1GB of storage.
Does anyone know how big the final dataset will be?

Speeding up the download

I have been running the code non stop for about two weeks now and I do get the feeling, that somehow it will take an even longer time to get the dataset ready.

And when posting a question about data collection limits on the twitter dev forum, it was pointed out ,that the code is using sub optimal lookup for the tweet gathering. Forum post
I wanted to bring this to attention, so that the collection process could be sped up for everyone using this dataset.

Getting "No connection could be made because the target machine actively refused it'" error on resuming

I was downloading the dataset as instructed.
All the news content got downloaded and now the tweets/retweets stuff was getting downloaded when my laptop decided to restart overnight (it was 33% of approx 165k done last I had checked).

So I started it again in the morning by removing the news articles option from the config.json
Now it has been running for 1 hour now but with just the following error message:

" File "\FakeNewsNet\code\util\TwythonConnector.py", line 65, in get_resource_index
    response = requests.get(self.url + resource_type)
  File "C:\Users\asus\Anaconda3\envs\thesis\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\asus\Anaconda3\envs\thesis\lib\site-packages\requests\api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\asus\Anaconda3\envs\thesis\lib\site-packages\requests\sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\asus\Anaconda3\envs\thesis\lib\site-packages\requests\sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\asus\Anaconda3\envs\thesis\lib\site-packages\requests\adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /get-keys?resource_type=get_retweet (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000002601A614048>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))"

Is this because these were done already? Right now it has done 6% all with this error. So it will go till 33% and then resume downloading the remaining part?
Or is something wrong?
It says Failed to establish new connection cz target machine refused it. What is the reason for that? I do not think there is a problem with my Twitter API keys.

I did not apply for a developer account

I did not apply for a developer account，but I need data for research, can anyone help me??Thank you!!!

retweets

Hi, is the code able to get all the retweets of a tweet, or just the most recent 100?

Twitter's Rest API seems to be only able to get the first 100. Do you know if there's a way to workaround that restriction?

Is there the complete dataset ready to download?

Hi,
According to some difficulties for crawling data, is there the complete dataset ready to download? or is it possible to upload the dataset for someone who has been able to successfully download it?

Dataset decay

i was wondering, if you could upload a new version of the dataset. As you seem to be crawling the data from gossipcop and politifact, it would be nice to get an updated dataset.

Because, for example, when regarding the gossipcop data.
Out of the 5323 news items, only 1749 still have tweet data attributed to them. For every other item, the calls came back empty, as the tweets were either deleted or hidden.
Which is understandable as spreading fake news is against twitter guidelines, as far as i am aware.

But because of this, it would be nice, to get a refreshed version of the dataset, maybe with new news items.

Thank you in advance

news_content incomplete due to unavailable URLs

Hi!

I encounter an issue with the new_content database. I have collected the fake news content data of both Politifact and GossipCop. However, I noticed that a substantial number of URLs in politifact_fake.csv and gossipcop_fake.csv have become unavailable or have changed, making it impossible to collect these articles. This happend for 120 fake Politifact and 845 fake GossipCop articles.

I was wondering if there is any way to collect all the (fake) news content data as used in your latest dataset paper.

Thank you!

Marieke

Any way to filter the real news-related images?

Thanks for your sharing. But seems most crawled images in a news article are not related to the content (e.g., website logo), is there any convenient way to filter out the real related images? Thanks!

Instructions to download for windows systems

Hi I'm trying to download the dataset on my windows machine. I am having trouble starting up the Flask server and the main python file. Instructions to download would be fantastic!

Problem with crawling retweets.

The tweets are being crawled just fine but I am having troubles with retweets. I receive a 403 forbidden error or a IndexError: list index out of range and cannot crawl anything. I can't quite figure out why this error is thrown at
retweets = connection.get_retweets(id=tweet.tweet_id, count=100, cursor=-1). Also there should be no problem with proving the twitter's API keys at tweet_keys_file.txt file since the tweets themselves are being collected correctly whilst some of the retweets also do get collected. As a result most of retweets aren't being collected with the .json file just containing {"retweets": []}. Some are though. Below are the full errors:

ERROR:root:Exception in getting retweets for tweet id 960225064590327808 using connection <Twython: f9U42MLkHhl1V7o4aO1cGsNtf> Traceback (most recent call last): File "FakeNewsNet\code\retweet_collection.py", line 19, in dum p_retweets_job | 144/165352 [15:34<18:07:51, 2.53it/s] retweets = connection.get_retweets(id=tweet.tweet_id, count=100, cursor=-1) File "Anaconda3\lib\site-packages\twython\endpoints.py", line 85, in get_retweets params=params) File "Anaconda3\lib\site-packages\twython\api.py", line 270, in get return self.request(endpoint, params=params, version=version) File "Anaconda3\lib\site-packages\twython\api.py", line 264, in request api_call=url) File "Anaconda3\lib\site-packages\twython\api.py", line 199, in _request retry_after=response.headers.get('X-Rate-Limit-Reset')) twython.exceptions.TwythonError: Twitter API returned a 403 (Forbidden), Forbidden.

The other error:

ERROR:root:Exception in getting retweets for tweet id 920382448294400000 using connection None Traceback (most recent call last): File "FakeNewsNet\code\retweet_collection.py", line 18, in dump _retweets_job | 2999/165352 [2:17:06<12:53:53, 3.50it/s] connection = twython_connector.get_twython_connection("get_retweet") File "FakeNewsNet\code\util\TwythonConnector.py", line 61, in g et_twython_connection return self.streams[resource_index] IndexError: list index out of range

Tweet propagation chain

Hi, thanks for sharing FakeNewsNet project.

I would like to know if there is any option to create a tweet propagation chain. Something like User A creates Tweet 1, User B retweets Tweet 1 resulting Tweet1-Retweet, User C retweets Tweet1-Retweet and so on. This would give us a tree or graph like structure.

Thanks much.

Release of latest FakeNewsNet dataset

Thank you for releasing a public dataset for studying the problem of fake news detection and spreading on social media.

I was wondering when can we access the latest version of the dataset as described in your latest arxiv paper?

Thanks

download got stuck about:TwythonConnector

Excuse me,
The download got stuck after downloading the news, i found the status_code in the request loop of TwythonConnector was 503.the url is http://localhost:5000/get-keys?resource_type=get_tweet, I didn't quite understand the composition of this website.
Is this something to do with the flask?i ran the code in windows and skipped the first step.

add sys.path in app.py and run the program, it went on working

Number of images

Hi, maybe it wasn't so clear to me the number of images available on the tweets. How many tweets have include multimedia content like images?

Code for collecting replies and commets？

Hi,
Great job, thanks! But can you provide the code for collecting some of the dimensions of social engagments including replies and commets?
Thanks a lot !

'& was unexpected at this time'

After running the command "nohup python -m resource_server.app &> keys_server.out&", I get the error "& was unexpected at this time", I don't know how to solve this. The command is run within the code folder, and I have the twitter keys in "tweet_keys_file.json", any help would be appreciated.

Can I run the code on Window?

Hi, I run python main.py on window command promt instead of nohup python -m resource_server.app &> keys_server.out&, but it returns the following error
OverflowError: Python int too large to convert to C long

Is it possible to run the code on window?or is it a must to use linux?
or the numbers exceeds sys.maxsize?

sorry if i have asked stupid question...
thanks!

download simultaneously problems!

Hi all,

I have ten 10 pairs of keys.

Just put them into the tweet_Keys_file.txt? or should I have ten txt files with one pair in each file?
How to set the config.json after finishing step1?

"dump_location": "fakenewsnet_dataset",
"dataset_dir": "../dataset",
"tweet_keys_file": "resources/tweet_keys_file.txt",
"num_process": 4,
"num_twitter_keys": 1,

Regards

cannot download!

data_collection.out:

File "/home/lrn/Documents/FakeNewsNet-master/code/util/util.py", line 39, in init
self.twython_connector = TwythonConnector("localhost:5000", tweet_keys_file)
File "/home/lrn/Documents/FakeNewsNet-master/code/util/TwythonConnector.py", line 13, in init
self.init_twython_objects(key_file)
File "/home/lrn/Documents/FakeNewsNet-master/code/util/TwythonConnector.py", line 29, in init_twython_objects
oauth_token=line[2], oauth_token_secret=line[3]))
IndexError: list index out of range

Downloads the news.json, but not the other entities such as tweets, retweets etc.

The code exactly downloads the news.json. However it is unable to retrieve tweets, retweets etc, according to the hierarchy shown. The twitter keys were generated and used as given.

what is the size of the whole FakeNewsNet dataset?

Request for code for collecting social engagement

Hi,

I am interested in the social engagement (number of likes, replies, tweets, retweets) of the news articles, not necessarily in information about the users. I saw that code for collecting these dimensions can be provided upon request. Is there anyway to implement this code?

Thank you,
Marieke

not able to download the data. could you plz help out how to download the data?

showing the following information in data_collection.out file
nohup: ignoring input

0it [00:00, ?it/s]

How to use user_id from User.txt

Can someone explain to me how to use user_id from User.txt?

Is it possible to retrieve the metadata of the user with this id? (maybe using Twitter Search API)

FakeNewsNet/Data/BuzzFeed/User.txt

I believe these are hashed user_id. Am I right?

Number of news shares is exactly the same for all fake and real news documents

Based on the explanations in readme file and using the *NewsUser.txt and *UserUser.txt files, I computed some statistics about the documents in the dataset, specifically, the number of times that news articles, whether fake or real, were shared. Here are the results:

Fake PolitiFact: 120
Real PolitiFact: 120
Sum PolitiFact: 240
---------------------
Fake Buzzfeed: 91
Real Buzzfeed: 91
Sum Buzzfeed: 182
---------------------
Sum all: 422
Fake spread count: 20683
Real spread count: 20683
---------------------
Fake affected count: 639982
Real affected count: 639982

All the numbers are exactly the same for both fake and real news documents and both for PolitiFact and Buzzfeed. I am wondering if I did something wrong or if there is an issue with the number of shares in the dataset?

Request for user_following data

Hi,
Has anyone been able to download the user_followers and the user_following data?
Please help!

where is the newspaper module?

stuck with this code:
'from newspaper import Article'

How to see how many processes are running?

I have set 28 pairs of keys in key jason file and 100 processes in configure jason file. The script is running. How to know the number of processes and keys running in fact? I am using Ubuntu 18.04 with python 3.6

twython

I have issue with this statement

from twython import Twython
ImportError: No module named twython

any help, some version issue?
thanks

Downloading stuck!

I just got this in the morning. What happened?

2019-12-06 12:02:49,204 23325 news_content_collection ERROR Exception in getting data from url http://dailyfeed.news/barack-obama-tweets-sick-attack-on-john-mccain-says-he-should-have-died/
Traceback (most recent call last):
File "/home/wentao/Data/FakeNewNet-multi/FakeNewsNet-master/code/news_content_collection.py", line 49, in crawl_link_article
article.parse()
File "/home/wentao/.local/lib/python3.6/site-packages/newspaper/article.py", line 191, in parse
self.throw_if_not_downloaded_verbose()
File "/home/wentao/.local/lib/python3.6/site-packages/newspaper/article.py", line 532, in throw_if_not_downloaded_verbose
(self.download_exception_msg, self.url))
newspaper.article.ArticleException: Article download() failed with 404 Client Error: Not Found for url: https://corporatedispatch.com/barack-obama-tweets-sick-attack-on-john-mccain-says-he-should-have-died on URL http://dailyfeed.news/barack-obama-tweets-sick-attack-on-john-mccain-says-he-should-have-died/
2019-12-06 12:03:14,778 23325 news_content_collection ERROR Exception in getting data from url http://therightists.com/gretchen-carlson-the-2nd-amendment-was-written-before-guns-were-invented/
Traceback (most recent call last):
File "/home/wentao/Data/FakeNewNet-multi/FakeNewsNet-master/code/news_content_collection.py", line 49, in crawl_link_article
article.parse()
File "/home/wentao/.local/lib/python3.6/site-packages/newspaper/article.py", line 191, in parse
self.throw_if_not_downloaded_verbose()
File "/home/wentao/.local/lib/python3.6/site-packages/newspaper/article.py", line 532, in throw_if_not_downloaded_verbose
(self.download_exception_msg, self.url))
newspaper.article.ArticleException: Article download() failed with ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) on URL http://therightists.com/gretchen-carlson-the-2nd-amendment-was-written-before-guns-were-invented/
2019-12-06 12:03:21,730 23325 news_content_collection ERROR Exception in getting data from url http://yournewswire.com/pope-francis-jesus-metaphorical/
Traceback (most recent call last):
File "/home/wentao/Data/FakeNewNet-multi/FakeNewsNet-master/code/news_content_collection.py", line 49, in crawl_link_article
article.parse()
File "/home/wentao/.local/lib/python3.6/site-packages/newspaper/article.py", line 191, in parse
self.throw_if_not_downloaded_verbose()
File "/home/wentao/.local/lib/python3.6/site-packages/newspaper/article.py", line 532, in throw_if_not_downloaded_verbose
(self.download_exception_msg, self.url))
newspaper.article.ArticleException: Article download() failed with 410 Client Error: Gone for url: https://newspunch.com/pope-francis-jesus-not-literal/ on URL http://yournewswire.com/pope-francis-jesus-metaphorical/

unable to download user profiles features

when i'm trying to download user profiles after downloading tweets, The following problem arises:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
nohup: ignoring input
Traceback (most recent call last):
File "main.py", line 70, in
download_dataset()
File "main.py", line 66, in download_dataset
data_collector.collect_data(data_choices)
File "/root/FakeNewsNet-master/code/user_profile_collection.py", line 149, in collect_data
"{}/{}/{}".format(self.config.dump_location, choice["news_source"], choice["label"])))
File "/root/FakeNewsNet-master/code/user_profile_collection.py", line 25, in get_user_ids_in_folder
tweet_object = json.load(open("{}/{}".format(tweets_dir, tweet_file)))
File "/usr/lib/python3.6/json/init.py", line 299, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

would you please help me to solve this problem?

Trying to parse data for fake news detection

Hello,
I am trying to work with your data for my Master's thesis, but I am not able to get a proper dataset, your given dataset is in .json format, which I converted into .csv format but the column names do not match due to some missing columns, hence the data is not properly being displayed, is there any possibility that you could share the latest dataset or help me by listing the features you used for your project.
Thanks in advance :)

how to understand UserFeature.mat

No more issues.

How to resume downloading? And, only news content.json is downloading?

Hi,
Thank you for the repo. Really appreciate it!

I have been trying to download the entire dataset as per the instructions. When I run the main.py, it downloads only 432 fake and 624 real. Is that all the dataset that we have available now (everything else removed) ? It only downloads the news content.json though, even though in the config.json I have other things mentioned as well. Why is this happening? How can I obtain the engagement data?

Also, it just stops at that (I guess after downloading 7 of Gossipcop, so 432+624 +7). It has been 10 hours and it hasn't downloaded anything since then. I am guessing it is because of the limits imposed by Twitter API usage? But then when will it resume? Will it resume automatically? Or do I need to start it again (by running the 2 given commands)? If I do start it again, will it resume downloading the remaining or start all over again?

News content only dataset

Hi @KaiDMML! Thank you for making FakeNewsNet avalable.

I'm trying to apply the dataset using only the news content, but I'm finding some problems when reproducing the collection. There are lots of sites that have been closed or moved to other domains, so I can't download the articles contents.

I understand the Twitter privacy policies concerns, but is it possible to provide the text only dataset with news contents?

Thanks!

I need to politifact & gossipcop URL

hello
I need URL links for each row in CSV file from Gossipcop & politifact that you start collecting data from there. how can access this pages URL? can I use id column? HOW?
this is very important for my research . thank you for your response

License

I have seen this and looks great effort.
I would like to ask please if you can tell me about the license info for testing and usage would be great. Thank you. !!

some tweet objects are duplicated in fake and real (at the same time)

I was doing some preprocessing, and I found out that there could be matching tweet objects(containing same tweet_id, creation time, user_id, etc, the only thing is different is the label. (although news pieces are about 2 complete different things).
please let me know if im wrong.

Suggestion: better example in readme for *NewsUser.txt files

I think it may be better to make it clear that what the columns are in this file. In the given example, since both of the numbers in second and third column are 1, it is a little confusing that which column corresponds to the user and which column is related to the number of shares?!

PolitiFactNewsUser.txt: the news-user relationship. For example, '240 1 1' means news 240 is posted/spreaded by user 1 for 1 time.

cant download tweets and other information,why?

very slow downloading of followers data!

Hi, despite very good downloading of news content and tweets in a reasonable time, downloading of other data is awful. For example, after around every minute of downloading user_followers data, downloading process pauses for 15 minutes. It means that it takes more than a year to download such data!!!!
It seams that downloading methods of tweets and others are different...

What's the description of the fields appearing in UserFeature.mat?

As title. Thanks!

Some of news contain tweets which aren't related

Dear @KaiDMML

I'm trying to use this dataset for my research.
I investigated some tweets and I found that some are not related to news at all.

For example, in the real category of politifact, articles of CQ.com had so many japanese tweets with https://t.co/XXXXXX.
politifact8005 is one of CQ.com's articles and this has many tweets but mostly are just applying for promotional marketing campaigns (example tweet id: 1021190359525847040). Other tweets also refer to completely unrelated topics.

Also, I believe that news content.json contains a login error. Instead of containing the data, they only contain the text of the login page:

Need help? Contact the CQ Hotline at (800) 678-8511 or [email protected]

I can confirm similar phenomenon in all the other categories.
Is this intended? I am currently filtering those cases by using unicodedata.east_asian_width().

Download Time

Hello!

I am trying to download the dataset, I have managed to download the news_content (took a long time already ), and I am now in the process of retrieving tweets. However, I have been running it for 10 hours now and it only says 35% downloaded, is that normal?

Also, this error re-occurs all the time :

ERROR:root:exception in collecting tweet objects [...] twython.exceptions.TwythonError: Twitter API returned a 403 (Forbidden), User has been suspended.

Can somebody help me out?

Thank you!

Marion

Fail to access the twitter developer account

Somebody for help!!!

Twitter API keys are essential for collecting data from Twitter.But my developer account application is not approved. And it seems that there is only one chance to apply for the account.
I wonder if there is a simpler way to obtain the data?

Thank you!

How to split FakeNewsNet into the training and testing sets?

Hello,
I've read your paper but I don't know how to split your dataset the same way you did in your paper.
Could you please provide the training and testing sets separately?
Thanks.

problem with retweet files

Hi, contents of almost all the retweet files are {"retweets": []}, without any data. what is the problem? Is it normal?

kaidmml / fakenewsnet Goto Github PK

fakenewsnet's Introduction

FakeNewsNet

Overview

Installation

Requirements:

Configuration:

Running Code

Dataset Structure

References

fakenewsnet's People

Contributors

Stargazers

Watchers

Forkers

fakenewsnet's Issues

Recommend Projects

Recommend Topics

Recommend Org