Trouble parsing Twarc dump

I'm trying to use birdspotter from the commandline on Windows under Anaconda 3, and it can't seem to parse the tweets I've collected using Twarc. The output is the following (NB the first six lines are repeated many times (as many times as there are tweets, perhaps?)):

  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\birdspotter\", line 184, in extractTweets
    temp_content = {'status_text':j['text'], 'user_id' : j['user']['id']}
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\birdspotter\", line 184, in extractTweets
    temp_content = {'status_text':j['text'], 'user_id' : j['user']['id']}
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\birdspotter\", line 184, in extractTweets
    temp_content = {'status_text':j['text'], 'user_id' : j['user']['id']}
Traceback (most recent call last):
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\pandas\core\indexes\", line 2657, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'user_id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\Scripts\birdspotter", line 30, in 
    bs = BirdSpotter(dumpPath, quiet=quiet)
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\birdspotter\", line 55, in __init__
    self.extractTweets(path,  tweetLimit = tweetLimit, embeddings=embeddings)
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\birdspotter\", line 241, in extractTweets
    userDataframe = pd.DataFrame(user_list).fillna(0).set_index('user_id')
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\pandas\core\", line 4178, in set_index
    level = frame[col]._values
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\pandas\core\", line 2927, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\pandas\core\indexes\", line 2659, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'user_id'

Looking at the BirdSpotter code, it's not immediately clear why this is being thrown. The property is valid in the tweets, but the property .text is not - instead it's .full_text in this dataset (collected over the last couple of weeks - perhaps APIs have changed). Maybe that's the real issue.

This fails on a single tweet too (first of the collection).

Any tips on how I can get this up and running?



Error when trying to use BirdSpotter on specialised Twitter Dump


Hello , I am trying to use BirdSpotter with a Twitter Dump created by me.


It shows the following error :
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

The Twitter dump is in jsonl format.

Here is the Twitter Dump -

{\ "created_at\ ": \ "Sun Jan 10 23:55:50 0000 2021\ ", \ "id\ ": \ "1348418302952017922, \ "id_str\ ": \ "1348418302952017922\ ", \ "text\ ": \ "RT @Cloudphish: Arlington Board of Realtors\\n#phishing #CyberSecurity #infosec #technology\\n\ ", \ "truncated\ ": False, \ "entities\ ": {\ "hashtags\ ": [{\ "text\ ": \ "phishing\ ", \ "indices\ ": [44, 53]}, {\ "text\ ": \ "CyberSecurity\ ", \ "indices\ ": [54, 68]}, {\ "text\ ": \ "infosec\ ", \ "indices\ ": [69, 77]}, {\ "text\ ": \ "technology\ ", \ "indices\ ": [78, 89]}], \ "symbols\ ": [], \ "user_mentions\ ": [{\ "screen_name\ ": \ "Cloudphish\ ", \ "name\ ": \ "Cloudphish Anti Phishing\ ", \ "id\ ": 1141376699835330560, \ "id_str\ ": \ "1141376699835330560\ ", \ "indices\ ": [3, 14]}], \ "urls\ ": [{\ "url\ ": \ "\ ", \ "expanded_url\ ": \ "\ ", \ "display_url\ ": \ "star\ ", \ "indices\ ": [90, 113]}]}, \ "source\ ": \ "<a href=\ "\ ", \ "in_reply_to_status_id\ ": None, \ "in_reply_to_status_id_str\ ": None, \ "in_reply_to_user_id\ ": None, \ "in_reply_to_id_str\ ": None, \ "in_reply_to_screen_name\ ": None, \ "user\ ": {\ "id\ ": 1142424032794406912, \ "id_str\ ": \ "1142424032794406912\ ", \ "name\ ": \ "Cyber Security News\ ", \ "screen_name\ ": \ "CyberSecurityN8\ ", \ "location\ ": \ "\ ", \ "description\ ": \ "The place for InfoSec, CyberSecurity, DevSecOps, DataSecurity and many more!!! Stay tuned.\ ", \ "url\ ": None, \ "entities\ ": {\ "description\ ": {\ "urls\ ": []}}, \ "protected\ ": False, \ "followers_count\ ": 25712, \ "friends_count\ ": 2, \ "listed_count\ ": 349, \ "created_at\ ": \ "Sat Jun 22 13:28:09 0000 2019\ ", \ "favourites_count\ ": 0, \ "utc_offset\ ": None, \ "time_zone\ ": None, \ "geo_enabled\ ": False, \ "verified\ ": False, \ "statuses_count\ ": 1187335, \ "lang\ ": None, \ "contributors_enabled\ ": False, \ "is_translator\ ": False, \ "is_translation_enabled\ ": False, \ "profile_background_color\ ": \ "F5F8FA\ ", \ "profile_background_image_url\ ": None, \ "profile_background_image_url_https\ ": None, \ "profile_background_tile\ ": False, \ "profile_image_url\ ": \ " 1EO_normal.jpg\ ", \ "profile_image_url_https\ ": \ " 1EO_normal.jpg\ ", \ "profile_banner_url\ ": \ "\ ", \ "profile_link_color\ ": \ "1DA1F2\ ", \ "profile_sidebar_border_color\ ": \ "C0DEED\ ", \ "profile_sidebar_fill_color\ ": \ "DDEEF6\ ", \ "profile_text_color\ ": \ "333333\ ", \ "profile_use_background_image\ ": True, \ "has_extended_profile\ ": False, \ "default_profile\ ": True, \ "default_profile_image\ ": False, \ "following\ ": False, \ "follow_request_sent\ ": False, \ "notifications\ ": False, \ "translator_type\ ": \ "none\ ", \ "withheld_in_countries\ ": []}, \ "geo\ ": None, \ "coordinates\ ": None, \ "place\ ": None, \ "contributors\ ": None, \ "is_quote_status\ ": False, \ "retweet_count\ ": 19, \ "favorite_count\ ": 0, \ "favorited\ ": False, \ "retweeted\ ": False, \ "possibly_sensitive\ ": False, \ "possibly_sensitive_appealable\ ": False, \ "lang\ ": \ "en\ ", \ "user_id\ ": 1348418306215211013, \ "retweeted_status\ ": {\ "created_at\ ": \ "Sun Jan 10 23:55:37 +0000 2021\ ", \ "id\ ": 1348418248770015232, \ "id_str\ ": \ "1348418248770015232\ ", \ "text\ ": \ "Arlington Board of Realtors\\n#phishing #CyberSecurity #infosec #technology\\n\ ", \ "truncated\ ": False, \ "entities\ ": {\ "hashtags\ ": [{\ "text\ ": \ "phishing\ ", \ "indices\ ": [28, 37]}, {\ "text\ ": \ "CyberSecurity\ ", \ "indices\ ": [38, 52]}, {\ "text\ ": \ "infosec\ ", \ "indices\ ": [53, 61]}, {\ "text\ ": \ "technology\ ", \ "indices\ ": [62, 73]}], \ "symbols\ ": [], \ "user_mentions\ ": [], \ "urls\ ": [{\ "url\ ": \ "\ ", \ "expanded_url\ ": \ "\ ", \ "display_url\ ": \ "star\ ", \ "indices\ ": [74, 97]}]}, \ "source\ ": \ "<a href=\ "\ ", \ "in_reply_to_status_id\ ": None, \ "in_reply_to_status_id_str\ ": None, \ "in_reply_to_user_id\ ": None, \ "in_reply_to_id_str\ ": None, \ "in_reply_to_screen_name\ ": None, \ "user\ ": {\ "id\ ": 1141376699835330560, \ "id_str\ ": \ "1141376699835330560\ ", \ "name\ ": \ "Cloudphish Anti Phishing\ ", \ "screen_name\ ": \ "Cloudphish\ ", \ "location\ ": \ "Bedford, MA\ ", \ "description\ ": \ "Cloudphish provides cloud based email validation protecting against all forms of email phishing like spear phishing, spoofing, impersonation and ceo fraud. \ud83d\udee1\ufe0f\ ", \ "url\ ": \ "\ ", \ "entities\ ": {\ "url\ ": {\ "urls\ ": [{\ "url\ ": \ "\ ", \ "expanded_url\ ": \ "\ ", \ "display_url\ ": \ "\ ", \ "indices\ ": [0, 23]}]}, \ "description\ ": {\ "urls\ ": []}}, \ "protected\ ": False, \ "followers_count\ ": 30, \ "friends_count\ ": 123, \ "listed_count\ ": 1, \ "created_at\ ": \ "Wed Jun 19 16:06:25 0000 2019\ ", \ "favourites_count\ ": 29, \ "utc_offset\ ": None, \ "time_zone\ ": None, \ "geo_enabled\ ": False, \ "verified\ ": False, \ "statuses_count\ ": 650, \ "lang\ ": None, \ "contributors_enabled\ ": False, \ "is_translator\ ": False, \ "is_translation_enabled\ ": False, \ "profile_background_color\ ": \ "000000\ ", \ "profile_background_image_url\ ": \ "\ ", \ "profile_background_image_url_https\ ": \ "\ ", \ "profile_background_tile\ ": False, \ "profile_image_url\ ": \ "\ ", \ "profile_image_url_https\ ": \ "\ ", \ "profile_banner_url\ ": \ "\ ", \ "profile_link_color\ ": \ "673AB7\ ", \ "profile_sidebar_border_color\ ": \ "000000\ ", \ "profile_sidebar_fill_color\ ": \ "000000\ ", \ "profile_text_color\ ": \ "000000\ ", \ "profile_use_background_image\ ": False, \ "has_extended_profile\ ": True, \ "default_profile\ ": False, \ "default_profile_image\ ": False, \ "following\ ": False, \ "follow_request_sent\ ": False, \ "notifications\ ": False, \ "translator_type\ ": \ "none\ ", \ "withheld_in_countries\ ": []}, \ "geo\ ": None, \ "coordinates\ ": None, \ "place\ ": None, \ "contributors\ ": None, \ "is_quote_status\ ": False, \ "retweet_count\ ": 2, \ "favorite_count\ ": 0, \ "favorited\ ": False, \ "retweeted\ ": False, \ "possibly_sensitive\ ": False, \ "possibly_sensitive_appealable\ ": False, \ "lang\ ": \ "en\ "}}

This is the BirdSpotter code:

from birdspotter import BirdSpotter import ast import json import logging input2 = str(input('File : ')) logging.basicConfig(level=logging.DEBUG) bs = BirdSpotter(input2) labeledUsers = bs.getLabeledUsers(out='./output.csv') cascades = bs.getCascadesDataFrame() bs.featureDataframe[['screen_name', 'botness']].sort_values(by='botness', ascending=False) print('No Errors')

I would be really grateful if I got a quick response.

Thanks in advance

Loading ouput from twitter-intact-stream failed

Hi, I used the crawler from twitter-intact-stream to collect tweets. Then I uncompressed the output file, add the extension .jsonl, then load it with birdspotter. The following error happened:

Extracting raw tweets: 6186it [00:03, 1872.15it/s]
Traceback (most recent call last):
File "", line 1, in
File "/home/tam/anaconda3/lib/python3.8/site-packages/birdspotter/", line 56, in init
self.extractTweets(path, tweetLimit = tweetLimit, embeddings=embeddings)
File "/home/tam/anaconda3/lib/python3.8/site-packages/birdspotter/", line 241, in extractTweets
for temp_user, temp_tweet, temp_content, temp_description, temp_cascade in itertools.chain(*map(self.process_tweet, tqdm(raw_tweets, desc="Extracting raw tweets"))):
File "/home/tam/anaconda3/lib/python3.8/site-packages/birdspotter/", line 142, in process_tweet
temp_text = (j['text'] if 'text' in j.keys() else j['full_text'])
KeyError: 'full_text'

ValueError: empty vocabulary; perhaps the documents only contain stop words

Yes.I am back.

I was trying to use BirdSpotter with another Twitter dump and I got this error.

This is my jsonl file -


{"created_at": "Sun Jan 10 23:57:57 +0000 2021", "id": 1348418836060643332, "id_str": "1348418836060643332", "text": "RT @randomsakuga: Key Animation:\nSeries: Rise of the Teenage Mutant Ninja Turtles (2019)\n\n h…", "truncated": false, "entities": {"hashtags": [], "symbols": [], "user_mentions": [{"screen_name": "randomsakuga", "name": "randomsakuga", "id": 835969639851126784, "id_str": "835969639851126784", "indices": [3, 16]}], "urls": [{"url": "", "expanded_url": "", "display_url": "", "indices": [33, 56]}, {"url": "", "expanded_url": "", "display_url": "…", "indices": [114, 137]}]}, "source": "<a href='' rel='nofollow'>Twitter for Android</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 2809289914, "id_str": "2809289914", "name": "peppy ", "screen_name": "fluttershoot", "location": "she/they whtvr! 23", "description": "hi im peppy! goblin, goth, clown, monster, freak enthusiast this is my personal twitter so i post everything! || pfp by @EliseraArt ", "url": null, "entities": {"description": {"urls": []}}, "protected": false, "followers_count": 628, "friends_count": 2396, "listed_count": 4, "created_at": "Sun Oct 05 23:02:50 +0000 2014", "favourites_count": 80025, "utc_offset": null, "time_zone": null, "geo_enabled": false, "verified": false, "statuses_count": 10299, "lang": null, "contributors_enabled": false, "is_translator": false, "is_translation_enabled": false, "profile_background_color": "C0DEED", "profile_background_image_url": "", "profile_background_image_url_https": "", "profile_background_tile": false, "profile_image_url": "", "profile_image_url_https": "", "profile_banner_url": "", "profile_link_color": "1DA1F2", "profile_sidebar_border_color": "C0DEED", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "profile_use_background_image": true, "has_extended_profile": true, "default_profile": true, "default_profile_image": false, "following": false, "follow_request_sent": false, "notifications": false, "translator_type": "null", "withheld_in_countries": []}, "geo": null, "coordinates": null, "place": null, "contributors": null, "retweeted_status": {"created_at": "Sun Jan 10 17:00:40 +0000 2021", "id": 1348313822810013708, "id_str": "1348313822810013708", "text": "Key Animation:\nSeries: Rise of the Teenage Mutant Ninja Turtles (2019)…", "truncated": true, "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [{"url": "", "expanded_url": "", "display_url": "", "indices": [15, 38]}, {"url": "", "expanded_url": "", "display_url": "…", "indices": [96, 119]}]}, "source": "<a href='' rel='nofollow'>Hootsuite Inc.</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 835969639851126784, "id_str": "835969639851126784", "name": "randomsakuga", "screen_name": "randomsakuga", "location": "", "description": "Providing some good animation on your timeline. The medias are taken from @sakugabooru", "url": "", "entities": {"url": {"urls": [{"url": "", "expanded_url": "", "display_url": "", "indices": [0, 23]}]}, "description": {"urls": []}}, "protected": false, "followers_count": 258337, "friends_count": 33, "listed_count": 1618, "created_at": "Sun Feb 26 21:47:48 +0000 2017", "favourites_count": 6, "utc_offset": null, "time_zone": null, "geo_enabled": false, "verified": true, "statuses_count": 9668, "lang": null, "contributors_enabled": false, "is_translator": false, "is_translation_enabled": false, "profile_background_color": "000000", "profile_background_image_url": "", "profile_background_image_url_https": "", "profile_background_tile": false, "profile_image_url": "", "profile_image_url_https": "", "profile_banner_url": "", "profile_link_color": "ABB8C2", "profile_sidebar_border_color": "000000", "profile_sidebar_fill_color": "000000", "profile_text_color": "000000", "profile_use_background_image": false, "has_extended_profile": false, "default_profile": false, "default_profile_image": false, "following": false, "follow_request_sent": false, "notifications": false, "translator_type": "null", "withheld_in_countries": []}, "geo": null, "coordinates": null, "place": null, "contributors": null, "is_quote_status": false, "retweet_count": 2009, "favorite_count": 8910, "favorited": false, "retweeted": false, "possibly_sensitive": false, "possibly_sensitive_appealable": false, "lang": "en"}, "is_quote_status": false, "retweet_count": 2009, "favorite_count": 0, "favorited": false, "retweeted": false, "possibly_sensitive": false, "possibly_sensitive_appealable": false, "lang": "en"}

Trouble with formats/filenames and downloading resources

(This was posted as a comment in issue #1 but is reposted here as a new issue.)

Running on a Mac I got this output trying to run birdspotter on a single tweet:

`Thanks for the pointers - I used the rename .full_text to .text trick to get a bit further, but am still having a little trouble, now on a Mac, not just on Windows 10. Can you tell me if this error is related, or should it be a different Issue:

(election) derek@orac ➜ arson ~/Library/Python/3.7/bin/birdspotter -i one_tweet.json -o botscores/overall Starting Tweet Extraction 0it [00:00, ?it/s] Traceback (most recent call last): File "/Users/derek/Library/Python/3.7/bin/birdspotter", line 30, in bs = BirdSpotter(dumpPath, quiet=quiet) File "/Users/derek/Library/Python/3.7/lib/python/site-packages/birdspotter/", line 55, in init self.extractTweets(path, tweetLimit = tweetLimit, embeddings=embeddings) File "/Users/derek/Library/Python/3.7/lib/python/site-packages/birdspotter/", line 241, in extractTweets userDataframe = pd.DataFrame(user_list).fillna(0).set_index('user_id') File "/Users/derek/Library/Python/3.7/lib/python/site-packages/pandas/core/", line 4411, in set_index raise KeyError("None of {} are in the columns".format(missing)) KeyError: "None of ['user_id'] are in the columns" (election) derek@orac ➜ arson cat one_tweet.json| jq -rc '.user.id_str' 894082981534384129 (election) derek@orac ➜ arson cat one_tweet.json| jq -rc '' 894082981534384100 (election) derek@orac ➜ arson cat one_tweet.json| jq -rc '.id_str' 1211893587702497280 (election) derek@orac ➜ arson cat one_tweet.json| jq -rc '.id' 1211893587702497300

I thought I'd try looking at the user id and tweet id fields, and found that the id and id_str values are different, which I didn't expect. I use id_str in all my code, just to ensure I don't get rounding errors, but I'm not sure if it's absolutely necessary - do these results imply it is?

This is all using one tweet, which I'll attach as txt.`

I tried renaming one_tweet.json to one_tweet.jsonl and that got me further:
(election) derek@orac ➜ arson ~/Library/Python/3.7/bin/birdspotter -i one_tweet.jsonl -o botscores/one_tweet.csv Starting Tweet Extraction 1it [00:00, 138.51it/s] Reformatting cascades 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 44.78it/s] Downloading Fasttext embeddings <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)> Getting influence scores of users, with alpha of None, with time decay of -6.8e-05, with beta of 1.0 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 215.57it/s]
I also found the output needs to be a file rather than a directory, as I'd expected.

I don't know if all the resources that needed to be downloaded were downloaded, because it ran very quickly (and on Windows, it had spent some time downloading a 2Gb library of some sort for some of the text analysis - wiki-news-300d-1M.vec and

Is it running correctly now, even with the SSL error?

