behavioral-ds / birdspotter Goto Github PK

BirdSpotter is a python package which provides an influence and bot detection toolkit for twitter.

License: MIT License

Python 54.10% Jupyter Notebook 45.90%

birdspotter's Introduction

`birdspotter`: A tool to measure social attributes of Twitter users

birdspotter is a python package providing a toolkit to measures the social influence and botness of twitter users. It takes a twitter dump input in json or jsonl format and produces measures for:

Social Influence: The relative amount that one user can cause another user to adopt a behaviour, such as retweeting.
Botness: The amount that a user appears automated.

References:

Rohit Ram, Quyu Kong, and Marian-Andrei Rizoiu. 2021. Birdspotter: A Tool for Analyzing and Labeling Twitter Users. In Proceedings of the Fourteenth ACM International Conference on Web Search and Data Mining (WSDM ’21), March 8–12, 2021, Virtual Event, Israel. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3437963.3441695

Rizoiu, M.A., Graham, T., Zhang, R., Zhang, Y., Ackland, R. and Xie, L. # DebateNight: The Role and Influence of Socialbots on Twitter During the 1st 2016 US Presidential Debate. In Twelfth International AAAI Conference on Web and Social Media (ICWSM'18), 2018. https://arxiv.org/abs/1802.09808

Installation

pip3 install birdspotter

`birdspotter` requires a python version `>=3`.

Basic Usage

To use `birdspotter` on your own twitter dump, replace './tweets.20150430-223406.jsonl' with the path to your twitter dump './path/to/tweet/dump.json'. In this example we use the nltk twitter-sample dataset found on kaggle. It can be downloaded here. Notably the file extension needs to be changes from `.json` to `.jsonl`.

from birdspotter import BirdSpotter
bs = BirdSpotter('./tweets.20150430-223406.jsonl')
# This may take a few minutes, go grab a coffee...
labeledUsers = bs.getLabeledUsers(out='./output.csv')

After extracting the tweets, getLabeledUsers() returns a pandas dataframe with the influence and botness labels of users and writes a csv file if a path is specified i.e. ./output.csv.

`birdspotter` relies on the Fasttext word embeddings wiki-news-300d-1M.vec, which will automatically be downloaded if not available in the current directory (`./`) or a relative data folder (`./data/`).

Get Cascades Data

After extracting the tweets, the retweet cascades are accessible by using:

cascades = bs.getCascadesDataFrame()

This dataframe includes the expected structure of the retweet cascade as given by Rizoiu et al. (2018) via the column `expected_parent` in this dataframe.

Analysis

We can now check the users with the highest (and lowest) botness:

bs.featureDataframe[['screen_name', 'botness']].sort_values(by='botness', ascending=False)

user_id	screen_name	botness
233703296	fletchersamf	0.909293
83661026	Dawnhomes	0.903377
791998387	rnmmm_	0.896476
889935433	hillsidepaul	0.893082
389418311	tabbycats4	0.884687
...	...	...
430023390	LooWeeeza	0.179779
2343382280	AntiLibDems	0.165497
244302832	LewMarshallsay	0.163851
258468459	emelyeppparker	0.156063
246382492	DanShatford	0.152157

We visit some of these accounts to see if their botness aligns with our intuition. On inspection, we see that Dawnhomes retweets at an exceptional rate and has a conspiratorial vibe. This seems to be automated to some extent. rnmmm_ retweet spams a single account, suggesting it is also automated. On the otherside, DanShatford seems like a real human, who tweets occasionally and has pictures of himself and friends on his profile.

In the same way we can check the users with the highest (and lowest) influence:

bs.featureDataframe[['screen_name', 'influence']].sort_values(by='influence', ascending=False)

user_id	screen_name	influence
43503	JamesWallis	491.000000
7076492	Glinner	487.562406
27110209	GeorgetteLock	430.000000
2384252054	djhenshall	274.724062
603915132	stephcraig_	269.555732
...	...	...
424432213	ErikZoha	1.000000
424395065	robevansz	1.000000
424296291	TathamJoanne	1.000000
423794014	matty2992	1.000000
355044262	DWTODWFA	1.000000

Again, we visit this some accounts to verify our intuitions. Glinner is a blogger who writes long articles and shares these through her twitter account. It seems reasonable that she is influential. JamesWallis is a CEO, lecturer and writer so his influence score also seems to fit.

We can also see the interaction between botness and influence by plotting this:

import seaborn as sns
# We first get the influence in percentile form
bs.featureDataframe['influence percentile'] = bs.featureDataframe['influence'].rank(pct=True)

# We map the follower counts to colours
colors = sns.light_palette("#a1cfcf", input="hex", as_cmap=True)(bs.featureDataframe['followers_count'])

# We finally plot
g = sns.JointGrid(data=bs.featureDataframe, x="botness", y="influence percentile")
g = g.plot_joint(plt.scatter, color=colors, edgecolor="#a1cfcf")
g.plot_marginals(sns.distplot, kde=False, color="#c9245d")

We can see from the above plot that only a fraction of the users are considered to have influence, which is consistent with how users behave on twitter, where many tweets do not garner retweets. The marginal distribution on the top of the x-axis suggets that botness is normal, with a longer tail toward the left. Finally, the hue of the nodes show that higher influence is correlated with higher follower counts, however there are apparent exceptions.

How to train the classifier with your own botness data

birdspotter provides functionality for training the botness detector with your own training data. After extracting the tweets, we run:

bs.getBotAnnotationTemplate('./annotation_file.csv')

This produces a csv, with an empty column isbot to be annotated by a human, as below:

	screen_name	user_id	isbot
0	007_Rebooted	232358211
1	0151Sam64	351532518
2	0192am	621237594
3	01EddyCordero	324670614
4	052Erik	2807483094
...	...	...	...
11719	zoommonk	304081311
11720	zosephh	453328842
11721	zoumrouda	384483993
11722	zwartekat	14442577
11723	zygoticdeb	21486509

Once annotated the botness detector can be trained with:

bs.trainClassifierModel('./annotation_file.csv')

Finally, to get the new botness scores we run:

bs.getBotness()

Advanced Usage

Defining your own word embeddings

birdspotter provides functionality for defining your own word embeddings. For example:

customEmbedding # A mapping such as a dict() representing word embeddings
bs = BirdSpotter('./tweets.20150430-223406.jsonl', embeddings=customEmbedding)

Embeddings can be set through several methods, refer to setWord2VecEmbeddings.

Note the default bot training data uses the wiki-news-300d-1M.vec and as such we would need to retrain the bot detector for alternative word embeddings.

Alternatives to python

Command-line usage

birdspotter can be accessed through the command-line to return a csv, with the recipe below:

birdspotter ./path/to/twitter/dump.json ./path/to/output/directory/

R usage

birdspotter functionality can be accessed in R via the reticulate package. reticulate still requires a python installation on your system and birdspotter to be installed. The following produces the same results as the standard usage.

install.packages("reticulate")
library(reticulate)
use_python(Sys.which("python3"))
birdspotter <- import("birdspotter")
bs <- birdspotter$BirdSpotter("./tweets.20150430-223406.jsonl")
bs$getLabeledDataFrame(out = './output.csv')

Acknowledgements

The development of this package was partially supported through a UTS Data Science Institute seed grant.

birdspotter's People

Contributors

Stargazers

Watchers

Forkers

necklesscage 1nj0k

birdspotter's Issues

Documentation link in the README

I noticed that the README does not have a link to the documentation. Is it possible to link that?

Trouble parsing Twarc dump

I'm trying to use birdspotter from the commandline on Windows under Anaconda 3, and it can't seem to parse the tweets I've collected using Twarc. The output is the following (NB the first six lines are repeated many times (as many times as there are tweets, perhaps?)):

  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\birdspotter\BirdSpotter.py", line 184, in extractTweets
    temp_content = {'status_text':j['text'], 'user_id' : j['user']['id']}
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\birdspotter\BirdSpotter.py", line 184, in extractTweets
    temp_content = {'status_text':j['text'], 'user_id' : j['user']['id']}
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\birdspotter\BirdSpotter.py", line 184, in extractTweets
    temp_content = {'status_text':j['text'], 'user_id' : j['user']['id']}
Traceback (most recent call last):
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\pandas\core\indexes\base.py", line 2657, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'user_id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\Scripts\birdspotter", line 30, in 
    bs = BirdSpotter(dumpPath, quiet=quiet)
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\birdspotter\BirdSpotter.py", line 55, in __init__
    self.extractTweets(path,  tweetLimit = tweetLimit, embeddings=embeddings)
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\birdspotter\BirdSpotter.py", line 241, in extractTweets
    userDataframe = pd.DataFrame(user_list).fillna(0).set_index('user_id')
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\pandas\core\frame.py", line 4178, in set_index
    level = frame[col]._values
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\pandas\core\frame.py", line 2927, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\derek\Documents\tools\Anaconda3\envs\election\lib\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'user_id'

Looking at the BirdSpotter code, it's not immediately clear why this is being thrown. The property .user.id is valid in the tweets, but the property .text is not - instead it's .full_text in this dataset (collected over the last couple of weeks - perhaps APIs have changed). Maybe that's the real issue.

This fails on a single tweet too (first of the collection).

Any tips on how I can get this up and running?

Thanks.

What is the threshold for the bot score ?

Hello @rohitram96 , As part of a project , I am running birdspotter on 1 million Tweet IDs.I had a small question.

What is the threshold of the bot score ?

Trouble with formats/filenames and downloading resources

(This was posted as a comment in issue #1 but is reposted here as a new issue.)

Running on a Mac I got this output trying to run birdspotter on a single tweet:

`Thanks for the pointers - I used the rename .full_text to .text trick to get a bit further, but am still having a little trouble, now on a Mac, not just on Windows 10. Can you tell me if this error is related, or should it be a different Issue:

(election) derek@orac ➜ arson ~/Library/Python/3.7/bin/birdspotter -i one_tweet.json -o botscores/overall Starting Tweet Extraction 0it [00:00, ?it/s] Traceback (most recent call last): File "/Users/derek/Library/Python/3.7/bin/birdspotter", line 30, in bs = BirdSpotter(dumpPath, quiet=quiet) File "/Users/derek/Library/Python/3.7/lib/python/site-packages/birdspotter/BirdSpotter.py", line 55, in init self.extractTweets(path, tweetLimit = tweetLimit, embeddings=embeddings) File "/Users/derek/Library/Python/3.7/lib/python/site-packages/birdspotter/BirdSpotter.py", line 241, in extractTweets userDataframe = pd.DataFrame(user_list).fillna(0).set_index('user_id') File "/Users/derek/Library/Python/3.7/lib/python/site-packages/pandas/core/frame.py", line 4411, in set_index raise KeyError("None of {} are in the columns".format(missing)) KeyError: "None of ['user_id'] are in the columns" (election) derek@orac ➜ arson cat one_tweet.json| jq -rc '.user.id_str' 894082981534384129 (election) derek@orac ➜ arson cat one_tweet.json| jq -rc '.user.id' 894082981534384100 (election) derek@orac ➜ arson cat one_tweet.json| jq -rc '.id_str' 1211893587702497280 (election) derek@orac ➜ arson cat one_tweet.json| jq -rc '.id' 1211893587702497300

I thought I'd try looking at the user id and tweet id fields, and found that the id and id_str values are different, which I didn't expect. I use id_str in all my code, just to ensure I don't get rounding errors, but I'm not sure if it's absolutely necessary - do these results imply it is?

This is all using one tweet, which I'll attach as txt.`

I tried renaming one_tweet.json to one_tweet.jsonl and that got me further:
(election) derek@orac ➜ arson ~/Library/Python/3.7/bin/birdspotter -i one_tweet.jsonl -o botscores/one_tweet.csv Starting Tweet Extraction 1it [00:00, 138.51it/s] Reformatting cascades 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 44.78it/s] Downloading Fasttext embeddings <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)> Getting influence scores of users, with alpha of None, with time decay of -6.8e-05, with beta of 1.0 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 215.57it/s]
I also found the output needs to be a file rather than a directory, as I'd expected.

I don't know if all the resources that needed to be downloaded were downloaded, because it ran very quickly (and on Windows, it had spent some time downloading a 2Gb library of some sort for some of the text analysis - wiki-news-300d-1M.vec and .vec.zip).

Is it running correctly now, even with the SSL error?
one_tweet.jsonl.txt

ValueError: empty vocabulary; perhaps the documents only contain stop words

Yes.I am back.

I was trying to use BirdSpotter with another Twitter dump and I got this error.

This is my jsonl file -

MY JSON FILE

{"created_at": "Sun Jan 10 23:57:57 +0000 2021", "id": 1348418836060643332, "id_str": "1348418836060643332", "text": "RT @randomsakuga: Key Animation: https://t.co/Rl0OlpzfN6\nSeries: Rise of the Teenage Mutant Ninja Turtles (2019)\n\nhttps://t.co/LhTssYyf3A h…", "truncated": false, "entities": {"hashtags": [], "symbols": [], "user_mentions": [{"screen_name": "randomsakuga", "name": "randomsakuga", "id": 835969639851126784, "id_str": "835969639851126784", "indices": [3, 16]}], "urls": [{"url": "https://t.co/Rl0OlpzfN6", "expanded_url": "https://pastebin.com/raw/tqwCF1Ue", "display_url": "pastebin.com/raw/tqwCF1Ue", "indices": [33, 56]}, {"url": "https://t.co/LhTssYyf3A", "expanded_url": "https://www.sakugabooru.com/post/show/107024", "display_url": "sakugabooru.com/post/show/1070…", "indices": [114, 137]}]}, "source": "<a href='http://twitter.com/download/android' rel='nofollow'>Twitter for Android</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 2809289914, "id_str": "2809289914", "name": "peppy ", "screen_name": "fluttershoot", "location": "she/they whtvr! 23", "description": "hi im peppy! goblin, goth, clown, monster, freak enthusiast this is my personal twitter so i post everything! || pfp by @EliseraArt ", "url": null, "entities": {"description": {"urls": []}}, "protected": false, "followers_count": 628, "friends_count": 2396, "listed_count": 4, "created_at": "Sun Oct 05 23:02:50 +0000 2014", "favourites_count": 80025, "utc_offset": null, "time_zone": null, "geo_enabled": false, "verified": false, "statuses_count": 10299, "lang": null, "contributors_enabled": false, "is_translator": false, "is_translation_enabled": false, "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_tile": false, "profile_image_url": "http://pbs.twimg.com/profile_images/1386826629217927168/OcENzEZW_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1386826629217927168/OcENzEZW_normal.jpg", "profile_banner_url": "https://pbs.twimg.com/profile_banners/2809289914/1629929357", "profile_link_color": "1DA1F2", "profile_sidebar_border_color": "C0DEED", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "profile_use_background_image": true, "has_extended_profile": true, "default_profile": true, "default_profile_image": false, "following": false, "follow_request_sent": false, "notifications": false, "translator_type": "null", "withheld_in_countries": []}, "geo": null, "coordinates": null, "place": null, "contributors": null, "retweeted_status": {"created_at": "Sun Jan 10 17:00:40 +0000 2021", "id": 1348313822810013708, "id_str": "1348313822810013708", "text": "Key Animation: https://t.co/Rl0OlpzfN6\nSeries: Rise of the Teenage Mutant Ninja Turtles (2019)… https://t.co/iBxNCrJNYp", "truncated": true, "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [{"url": "https://t.co/Rl0OlpzfN6", "expanded_url": "https://pastebin.com/raw/tqwCF1Ue", "display_url": "pastebin.com/raw/tqwCF1Ue", "indices": [15, 38]}, {"url": "https://t.co/iBxNCrJNYp", "expanded_url": "https://twitter.com/i/web/status/1348313822810013708", "display_url": "twitter.com/i/web/status/1…", "indices": [96, 119]}]}, "source": "<a href='https://www.hootsuite.com' rel='nofollow'>Hootsuite Inc.</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 835969639851126784, "id_str": "835969639851126784", "name": "randomsakuga", "screen_name": "randomsakuga", "location": "", "description": "Providing some good animation on your timeline. The medias are taken from @sakugabooru", "url": "https://t.co/vw6ZEEYgAU", "entities": {"url": {"urls": [{"url": "https://t.co/vw6ZEEYgAU", "expanded_url": "https://sakugabooru.com/post", "display_url": "sakugabooru.com/post", "indices": [0, 23]}]}, "description": {"urls": []}}, "protected": false, "followers_count": 258337, "friends_count": 33, "listed_count": 1618, "created_at": "Sun Feb 26 21:47:48 +0000 2017", "favourites_count": 6, "utc_offset": null, "time_zone": null, "geo_enabled": false, "verified": true, "statuses_count": 9668, "lang": null, "contributors_enabled": false, "is_translator": false, "is_translation_enabled": false, "profile_background_color": "000000", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_tile": false, "profile_image_url": "http://pbs.twimg.com/profile_images/840302059362615296/TaVA2uei_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/840302059362615296/TaVA2uei_normal.jpg", "profile_banner_url": "https://pbs.twimg.com/profile_banners/835969639851126784/1489177763", "profile_link_color": "ABB8C2", "profile_sidebar_border_color": "000000", "profile_sidebar_fill_color": "000000", "profile_text_color": "000000", "profile_use_background_image": false, "has_extended_profile": false, "default_profile": false, "default_profile_image": false, "following": false, "follow_request_sent": false, "notifications": false, "translator_type": "null", "withheld_in_countries": []}, "geo": null, "coordinates": null, "place": null, "contributors": null, "is_quote_status": false, "retweet_count": 2009, "favorite_count": 8910, "favorited": false, "retweeted": false, "possibly_sensitive": false, "possibly_sensitive_appealable": false, "lang": "en"}, "is_quote_status": false, "retweet_count": 2009, "favorite_count": 0, "favorited": false, "retweeted": false, "possibly_sensitive": false, "possibly_sensitive_appealable": false, "lang": "en"}

Error when trying to use BirdSpotter on specialised Twitter Dump

Introduction

Hello , I am trying to use BirdSpotter with a Twitter Dump created by me.

Errror

It shows the following error :
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

The Twitter dump is in jsonl format.

Here is the Twitter Dump -

{\ "created_at\ ": \ "Sun Jan 10 23:55:50 0000 2021\ ", \ "id\ ": \ "1348418302952017922, \ "id_str\ ": \ "1348418302952017922\ ", \ "text\ ": \ "RT @Cloudphish: Arlington Board of Realtors\\n#phishing #CyberSecurity #infosec #technology\\nhttps://t.co/EegBr4KviP\ ", \ "truncated\ ": False, \ "entities\ ": {\ "hashtags\ ": [{\ "text\ ": \ "phishing\ ", \ "indices\ ": [44, 53]}, {\ "text\ ": \ "CyberSecurity\ ", \ "indices\ ": [54, 68]}, {\ "text\ ": \ "infosec\ ", \ "indices\ ": [69, 77]}, {\ "text\ ": \ "technology\ ", \ "indices\ ": [78, 89]}], \ "symbols\ ": [], \ "user_mentions\ ": [{\ "screen_name\ ": \ "Cloudphish\ ", \ "name\ ": \ "Cloudphish Anti Phishing\ ", \ "id\ ": 1141376699835330560, \ "id_str\ ": \ "1141376699835330560\ ", \ "indices\ ": [3, 14]}], \ "urls\ ": [{\ "url\ ": \ "https://t.co/EegBr4KviP\ ", \ "expanded_url\ ": \ "https://www.star telegram.com/homes/article248319155.html\ ", \ "display_url\ ": \ "star telegram.com/homes/article2\ ", \ "indices\ ": [90, 113]}]}, \ "source\ ": \ "<a href=\ "https://homeinc.com/\ ", \ "in_reply_to_status_id\ ": None, \ "in_reply_to_status_id_str\ ": None, \ "in_reply_to_user_id\ ": None, \ "in_reply_to_id_str\ ": None, \ "in_reply_to_screen_name\ ": None, \ "user\ ": {\ "id\ ": 1142424032794406912, \ "id_str\ ": \ "1142424032794406912\ ", \ "name\ ": \ "Cyber Security News\ ", \ "screen_name\ ": \ "CyberSecurityN8\ ", \ "location\ ": \ "\ ", \ "description\ ": \ "The place for InfoSec, CyberSecurity, DevSecOps, DataSecurity and many more!!! Stay tuned.\ ", \ "url\ ": None, \ "entities\ ": {\ "description\ ": {\ "urls\ ": []}}, \ "protected\ ": False, \ "followers_count\ ": 25712, \ "friends_count\ ": 2, \ "listed_count\ ": 349, \ "created_at\ ": \ "Sat Jun 22 13:28:09 0000 2019\ ", \ "favourites_count\ ": 0, \ "utc_offset\ ": None, \ "time_zone\ ": None, \ "geo_enabled\ ": False, \ "verified\ ": False, \ "statuses_count\ ": 1187335, \ "lang\ ": None, \ "contributors_enabled\ ": False, \ "is_translator\ ": False, \ "is_translation_enabled\ ": False, \ "profile_background_color\ ": \ "F5F8FA\ ", \ "profile_background_image_url\ ": None, \ "profile_background_image_url_https\ ": None, \ "profile_background_tile\ ": False, \ "profile_image_url\ ": \ "http://pbs.twimg.com/profile_images/1197135188473475074/8svI 1EO_normal.jpg\ ", \ "profile_image_url_https\ ": \ "https://pbs.twimg.com/profile_images/1197135188473475074/8svI 1EO_normal.jpg\ ", \ "profile_banner_url\ ": \ "https://pbs.twimg.com/profile_banners/1142424032794406912/1574254287\ ", \ "profile_link_color\ ": \ "1DA1F2\ ", \ "profile_sidebar_border_color\ ": \ "C0DEED\ ", \ "profile_sidebar_fill_color\ ": \ "DDEEF6\ ", \ "profile_text_color\ ": \ "333333\ ", \ "profile_use_background_image\ ": True, \ "has_extended_profile\ ": False, \ "default_profile\ ": True, \ "default_profile_image\ ": False, \ "following\ ": False, \ "follow_request_sent\ ": False, \ "notifications\ ": False, \ "translator_type\ ": \ "none\ ", \ "withheld_in_countries\ ": []}, \ "geo\ ": None, \ "coordinates\ ": None, \ "place\ ": None, \ "contributors\ ": None, \ "is_quote_status\ ": False, \ "retweet_count\ ": 19, \ "favorite_count\ ": 0, \ "favorited\ ": False, \ "retweeted\ ": False, \ "possibly_sensitive\ ": False, \ "possibly_sensitive_appealable\ ": False, \ "lang\ ": \ "en\ ", \ "user_id\ ": 1348418306215211013, \ "retweeted_status\ ": {\ "created_at\ ": \ "Sun Jan 10 23:55:37 +0000 2021\ ", \ "id\ ": 1348418248770015232, \ "id_str\ ": \ "1348418248770015232\ ", \ "text\ ": \ "Arlington Board of Realtors\\n#phishing #CyberSecurity #infosec #technology\\nhttps://t.co/EegBr4KviP\ ", \ "truncated\ ": False, \ "entities\ ": {\ "hashtags\ ": [{\ "text\ ": \ "phishing\ ", \ "indices\ ": [28, 37]}, {\ "text\ ": \ "CyberSecurity\ ", \ "indices\ ": [38, 52]}, {\ "text\ ": \ "infosec\ ", \ "indices\ ": [53, 61]}, {\ "text\ ": \ "technology\ ", \ "indices\ ": [62, 73]}], \ "symbols\ ": [], \ "user_mentions\ ": [], \ "urls\ ": [{\ "url\ ": \ "https://t.co/EegBr4KviP\ ", \ "expanded_url\ ": \ "https://www.startelegram.com/homes/article248319155.html\ ", \ "display_url\ ": \ "star telegram.com/homes/article2\ ", \ "indices\ ": [74, 97]}]}, \ "source\ ": \ "<a href=\ "https://cloudphish.com/\ ", \ "in_reply_to_status_id\ ": None, \ "in_reply_to_status_id_str\ ": None, \ "in_reply_to_user_id\ ": None, \ "in_reply_to_id_str\ ": None, \ "in_reply_to_screen_name\ ": None, \ "user\ ": {\ "id\ ": 1141376699835330560, \ "id_str\ ": \ "1141376699835330560\ ", \ "name\ ": \ "Cloudphish Anti Phishing\ ", \ "screen_name\ ": \ "Cloudphish\ ", \ "location\ ": \ "Bedford, MA\ ", \ "description\ ": \ "Cloudphish provides cloud based email validation protecting against all forms of email phishing like spear phishing, spoofing, impersonation and ceo fraud. \ud83d\udee1\ufe0f\ ", \ "url\ ": \ "https://t.co/FL4MKWEFHE\ ", \ "entities\ ": {\ "url\ ": {\ "urls\ ": [{\ "url\ ": \ "https://t.co/FL4MKWEFHE\ ", \ "expanded_url\ ": \ "https://cloudphish.com\ ", \ "display_url\ ": \ "cloudphish.com\ ", \ "indices\ ": [0, 23]}]}, \ "description\ ": {\ "urls\ ": []}}, \ "protected\ ": False, \ "followers_count\ ": 30, \ "friends_count\ ": 123, \ "listed_count\ ": 1, \ "created_at\ ": \ "Wed Jun 19 16:06:25 0000 2019\ ", \ "favourites_count\ ": 29, \ "utc_offset\ ": None, \ "time_zone\ ": None, \ "geo_enabled\ ": False, \ "verified\ ": False, \ "statuses_count\ ": 650, \ "lang\ ": None, \ "contributors_enabled\ ": False, \ "is_translator\ ": False, \ "is_translation_enabled\ ": False, \ "profile_background_color\ ": \ "000000\ ", \ "profile_background_image_url\ ": \ "http://abs.twimg.com/images/themes/theme1/bg.png\ ", \ "profile_background_image_url_https\ ": \ "https://abs.twimg.com/images/themes/theme1/bg.png\ ", \ "profile_background_tile\ ": False, \ "profile_image_url\ ": \ "http://pbs.twimg.com/profile_images/1141377536699637766/hVGGx0Ju_normal.png\ ", \ "profile_image_url_https\ ": \ "https://pbs.twimg.com/profile_images/1141377536699637766/hVGGx0Ju_normal.png\ ", \ "profile_banner_url\ ": \ "https://pbs.twimg.com/profile_banners/1141376699835330560/1560960628\ ", \ "profile_link_color\ ": \ "673AB7\ ", \ "profile_sidebar_border_color\ ": \ "000000\ ", \ "profile_sidebar_fill_color\ ": \ "000000\ ", \ "profile_text_color\ ": \ "000000\ ", \ "profile_use_background_image\ ": False, \ "has_extended_profile\ ": True, \ "default_profile\ ": False, \ "default_profile_image\ ": False, \ "following\ ": False, \ "follow_request_sent\ ": False, \ "notifications\ ": False, \ "translator_type\ ": \ "none\ ", \ "withheld_in_countries\ ": []}, \ "geo\ ": None, \ "coordinates\ ": None, \ "place\ ": None, \ "contributors\ ": None, \ "is_quote_status\ ": False, \ "retweet_count\ ": 2, \ "favorite_count\ ": 0, \ "favorited\ ": False, \ "retweeted\ ": False, \ "possibly_sensitive\ ": False, \ "possibly_sensitive_appealable\ ": False, \ "lang\ ": \ "en\ "}}

This is the BirdSpotter code:

from birdspotter import BirdSpotter import ast import json import logging input2 = str(input('File : ')) logging.basicConfig(level=logging.DEBUG) bs = BirdSpotter(input2) labeledUsers = bs.getLabeledUsers(out='./output.csv') cascades = bs.getCascadesDataFrame() bs.featureDataframe[['screen_name', 'botness']].sort_values(by='botness', ascending=False) print('No Errors')

I would be really grateful if I got a quick response.

Thanks in advance

Loading ouput from twitter-intact-stream failed

Hi, I used the crawler from twitter-intact-stream to collect tweets. Then I uncompressed the output file, add the extension .jsonl, then load it with birdspotter. The following error happened:

Extracting raw tweets: 6186it [00:03, 1872.15it/s]
Traceback (most recent call last):
File "", line 1, in
File "/home/tam/anaconda3/lib/python3.8/site-packages/birdspotter/BirdSpotter.py", line 56, in init
self.extractTweets(path, tweetLimit = tweetLimit, embeddings=embeddings)
File "/home/tam/anaconda3/lib/python3.8/site-packages/birdspotter/BirdSpotter.py", line 241, in extractTweets
for temp_user, temp_tweet, temp_content, temp_description, temp_cascade in itertools.chain(*map(self.process_tweet, tqdm(raw_tweets, desc="Extracting raw tweets"))):
File "/home/tam/anaconda3/lib/python3.8/site-packages/birdspotter/BirdSpotter.py", line 142, in process_tweet
temp_text = (j['text'] if 'text' in j.keys() else j['full_text'])
KeyError: 'full_text'

BirdSpotter does not have a licence blurb in README; is MIT the right license?

See license blurb in evently

Add other hawkes kernels to influence quantification (namely PL)

Add flag to determine which kernel is used in influence measurement.

KeyError: "['botness', 'influence'] not in index"

botness, influence do not exist in featureDataframe

Installation fail on macOS Mojave 10.14.6 because of xgboost=0.81 dependancy

[] xgboost needs to be updated to its stable format so that it can be installed on OSX more readily.