Giter VIP home page Giter VIP logo

academictwitter's Introduction

academictwitteR

Note this repo is now ARCHVIED due to changes to the Twitter API. The paid API means open-source development of this package is no longer feasible.

v2 DOI Downloads Codecov test coverage

Twitter Twitter

Repo containing code to for R package academictwitteR to collect tweets from v2 API endpoint for the Academic Research Product Track.

To cite package ‘academictwitteR’ in publications use:

  • Barrie, Christopher and Ho, Justin Chun-ting. (2021). academictwitteR: an R package to access the Twitter Academic Research Product Track v2 API endpoint. Journal of Open Source Software, 6(62), 3272, https://doi.org/10.21105/joss.03272

A BibTeX entry for LaTeX users is:

@article{BarrieHo2021,
  doi = {10.21105/joss.03272},
  url = {https://doi.org/10.21105/joss.03272},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {62},
  pages = {3272},
  author = {Christopher Barrie and Justin Chun-ting Ho},
  title = {academictwitteR: an R package to access the Twitter Academic Research Product Track v2 API endpoint},
  journal = {Journal of Open Source Software}
}

  

Installation

You can install the package with:

install.packages("academictwitteR")

Alternatively, you can install the development version with:

devtools::install_github("cjbarrie/academictwitteR", build_vignettes = TRUE)

Get started by reading vignette("academictwitteR-intro").

To use the package, it first needs to be loaded with:

library(academictwitteR)

The academictwitteR package has been designed with the efficient storage of data in mind. Queries to the API include arguments to specify whether tweets be stored as a .rds file using the file argument or as separate JSON files for tweet- and user-level information separately with argument data_path.

Tweets are returned as a data.frame object and, when a file argument has been included, will also be saved as a .rds file.

When collecting large amounts of data, we recommend the workflow described below, which allows the user : 1) to efficiently store authorization credentials; 2) to efficiently store returned data; 3) bind the data into a data.frame object or tibble ;4) resume collection in case of interruption; and 5) update collection in case of need.

Authorization

The first task is set authorization credentials with the set_bearer() function, which allows the user to store their bearer token in the .Renviron file.

To do so, use:

set_bearer()

and enter authorization credentials as below:

This will mean that the bearer token is automatically called during API calls. It also avoids the inadvisable practice of hard-coding authorization credentials into scripts.

See the vignette documentation vignette("academictwitteR-auth") for further information on obtaining a bearer token.

Collection

The workhorse function is get_all_tweets(), which is able to collect tweets matching a specific search query or all tweets by a specific set of users.

tweets <-
  get_all_tweets(
    query = "#BlackLivesMatter",
    start_tweets = "2020-01-01T00:00:00Z",
    end_tweets = "2020-01-05T00:00:00Z",
    file = "blmtweets",
    data_path = "data/",
    n = 1000000,
  )
  

Here, we are collecting tweets containing a hashtag related to the Black Lives Matter movement over the period January 1, 2020 to January 5, 2020.

We have also set an upper limit of one million tweets. When collecting large amounts of Twitter data we recommend including a data_path and setting bind_tweets = FALSE such that data is stored as JSON files and can be bound at a later stage upon completion of the API query.

tweets <-
  get_all_tweets(
    users = c("jack", "cbarrie"),
    start_tweets = "2020-01-01T00:00:00Z",
    end_tweets = "2020-01-05T00:00:00Z",
    file = "blmtweets",
    n = 1000
  )
  

Whereas here we are not specifying a search query and instead are requesting all tweets by users @jack and @cbarrie over the period January 1, 2020 to January 5, 2020. Here, we set an upper limit of 1000 tweets.

The search query and user query arguments can be combined in a single API call as so:

get_all_tweets(
  query = "twitter",
  users = c("cbarrie", "jack"),
  start_tweets = "2020-01-01T00:00:00Z",
  end_tweets = "2020-05-01T00:00:00Z",
  n = 1000
)

Where here we would be collecting tweets by users @jack and @cbarrie over the period January 1, 2020 to January 5, 2020 containing the word "twitter."

get_all_tweets(
  query = c("twitter", "social"),
  users = c("cbarrie", "jack"),
  start_tweets = "2020-01-01T00:00:00Z",
  end_tweets = "2020-05-01T00:00:00Z",
  n = 1000
)

While here we are collecting tweets by users @jack and @cbarrie over the period January 1, 2020 to January 5, 2020 containing the words "twitter" or "social."

Note that the "AND" operator is implicit when specifying more than one character string in the query. See here for information on building queries for search tweets. Thus, when searching for all elements of a character string, a call may look like:

get_all_tweets(
  query = c("twitter social"),
  users = c("cbarrie", "jack"),
  start_tweets = "2020-01-01T00:00:00Z",
  end_tweets = "2020-05-01T00:00:00Z",
  n = 1000
)

, which will capture tweets containing both the words "twitter" and "social." The same logics apply for hashtag queries.

Whereas if we specify our query as separate elements of a character vector like this:

get_all_tweets(
  query = c("twitter", "social"),
  users = c("cbarrie", "jack"),
  start_tweets = "2020-01-01T00:00:00Z",
  end_tweets = "2020-05-01T00:00:00Z",
  n = 1000
)

, this will be capturing tweets by users @cbarrie or @jack containing the words "twitter" or social.

Finally, we may wish to query an exact phrase. To do so, we can either input the phrase in escape quotes, e.g., query ="\"Black Lives Matter\"" or we can use the optional parameter exact_phrase = T (in devt. version) to search for tweets containing the exact phrase string:

tweets <-
  get_all_tweets(
    query = "Black Lives Matter",
    exact_phrase = T,
    start_tweets = "2021-01-04T00:00:00Z",
    end_tweets = "2021-01-04T00:45:00Z",
    n = Inf
  )

See the vignette documentation vignette("academictwitteR-build") for further information on building more complex API calls.

Data storage

Files are stores as JSON files in specified directory when a data_path is specified. Tweet-level data is stored in files beginning "data_"; user-level data is stored in files beginning "users_".

If a filename is supplied, the functions will save the resulting tweet-level information as a .rds file.

Functions always return a data.frame object unless a data_path is specified and bind_tweets is set to FALSE. When collecting large amounts of data, we recommend using the data_path option with bind_tweets = FALSE. This mitigates potential data loss in case the query is interrupted.

See the vignette documentation vignette("academictwitteR-intro") for further information on data storage conventions.

Reformatting

Users can then use the bind_tweets convenience function to bundle the JSONs into a data.frame object for analysis in R as such:

tweets <- bind_tweets(data_path = "data/")
users <- bind_tweets(data_path = "data/", user = TRUE)

To bind JSONs into tidy format, users can also specify a tidy output format.

bind_tweets(data_path = "tweetdata", output_format = "tidy")

See the vignette documentation vignette("academictwitteR-tidy") for further information on alternative output formats.

Interruption and Continuation

The package offers two functions to deal with interruption and continue previous data collection session. If you have set a data_path and export_query was set to "TRUE" during the original collection, you can use resume_collection() to resume a previous interrupted collection session. An example would be:

resume_collection(data_path = "data")

If a previous data collection session is completed, you can use update_collection() to continue data collection with a new end date. This function is particularly useful for getting data for ongoing events. An example would be:

update_collection(data_path = "data", end_tweets = "2020-05-10T00:00:00Z")

Note on v2 Twitter API

For more information on the parameters and fields available from the v2 Twitter API endpoint see: https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all.

Arguments

get_all_tweets() accepts a range of arguments, which can be combined to generate a more precise query.

Arguments Description
query Search query or queries e.g. "cat"
exact_phrase If TRUE, only tweets will be returned matching the exact phrase
users string or character vector, user handles to collect tweets from the specified users
reply_to string or character vector, user handles to collect replies to the specified users
retweets_of string or character vector, user handles to collects retweets of tweets by the specified users
exclude string or character vector, tweets containing the keyword(s) will be excluded
is_retweet If TRUE, only retweets will be returned; if FALSE, retweets will not be returned, only tweets will be returned; if NULL, both retweets and tweets will be returned.
is_reply If TRUE, only reply tweets will be returned
is_quote If TRUE, only quote tweets will be returned
is_verified If TRUE, only tweets whose authors are verified by Twitter will be returned
remove_promoted If TRUE, tweets created for promotion only on ads.twitter.com are removed
has_hashtags If TRUE, only tweets containing hashtags will be returned
has_cashtags If TRUE, only tweets containing cashtags will be returned
has_links If TRUE, only tweets containing links and media will be returned
has_mentions If TRUE, only tweets containing mentions will be returned
has_media If TRUE, only tweets containing a recognized media object, such as a photo, GIF, or video, as determined by Twitter will be returned
has_images If TRUE, only tweets containing a recognized URL to an image will be returned
has_videos If TRUE, only tweets containing contain native Twitter videos, uploaded directly to Twitter will be returned
has_geo If TRUE, only tweets containing Tweet-specific geolocation data provided by the Twitter user will be returned
place Name of place e.g. "London"
country Name of country as ISO alpha-2 code e.g. "GB"
point_radius A vector of two point coordinates latitude, longitude, and point radius distance (in miles)
bbox A vector of four bounding box coordinates from west longitude to north latitude
lang A single BCP 47 language identifier e.g. "fr"
url string, return tweets containing specified url
conversation_id string, return tweets that share the specified conversation ID

Batch Compliance

There are three functions to work with Twitter's Batch Compliance endpoints: create_compliance_job() creates a new compliance job and upload the dataset; list_compliance_jobs lists all created jobs and their job status; get_compliance_result() downloads the result.

Acknowledgements

Function originally inspired by Gist from https://github.com/schochastics.

Code of Conduct

Please note that the academictwitteR project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

academictwitter's People

Contributors

chainsawriot avatar cjbarrie avatar justinchuntingho avatar medewitt avatar noeliarico avatar t-davidson avatar timbmk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

academictwitter's Issues

Random sample of tweets function

Feature:
The current Academic Research Product Track endpoint fetches all tweets corresponding to a particular query over a specified date range. For some applications, we'd like just a smaller random subsample over the date range in question.

Solution
A function that generates a random sequence of datetimes over the full date range specified and then fetches tweets corresponding to the particular quere and the random datetime intervals within the broader date range.

All wrappers to `get_all_tweets` do not pass all parameters to `get_all_tweets`

Describe the bug

All get_* functions are untested. The parameter verbose behaves inconsistently among these functions. Besides, should warnings be silenced when verbose is FALSE?

To Reproduce

require(academictwitteR)
#> Loading required package: academictwitteR

bt <- "AAAAAAAAAYOURBEARERTOKENHERE"

data_dir <- paste0(tempdir(), "/", paste0(sample(letters, 12), collapse = ""))

## There is still one message
get_all_tweets("#commtwitter", start_tweets = "2021-06-01T00:00:00Z", 
               end_tweets = "2021-06-05T00:00:00Z", bind_tweets = FALSE, 
               data_path = data_dir, bearer_token = bt, verbose = FALSE)
#> Data stored as JSONs: use bind_tweets_json function to bundle into data.frame
unlink(data_dir)

## Should warnings be silenced?
z <- get_all_tweets("#commtwitter", start_tweets = "2021-06-01T00:00:00Z", 
               end_tweets = "2021-06-05T00:00:00Z", bind_tweets = TRUE, 
               data_path = NULL, bearer_token = bt, verbose = FALSE)
#> Warning: Recommended to specify a data path in order to mitigate data loss when
#> ingesting large amounts of data.
#> Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
#> available in local memory if assigned to an object.


data_dir <- paste0(tempdir(), "/", paste0(sample(letters, 12), collapse = ""))

## verbose is not honoured (actually, bind_tweets is not honoured as well)
x <- get_user_tweets("cbarrie", start_tweets = "2021-06-01T00:00:00Z", 
                end_tweets = "2021-06-05T00:00:00Z", bearer_token = bt,
                data_path = data_dir, verbose = FALSE, bind_tweets = FALSE)
#> Warning: Recommended to specify a data path in order to mitigate data loss when
#> ingesting large amounts of data.

#> Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
#> available in local memory if assigned to an object.
#> query: <(from:cbarrie)>: (tweets captured this page: 4). Total pages queried: 1. Total tweets ingested: 4. 
#> This is the last page for (from:cbarrie) : finishing collection.
## There is still a data frame; but this is a seperate issue
class(x)
#> [1] "data.frame"
unlink(data_dir)

Created on 2021-06-07 by the reprex package (v2.0.0)

Expected behavior

Consistent behavior when verbose is FALSE; preferably no warnings too (probably only experts will need to set verbose to FALSE).

Waiting Time in get_user_tweets function

Dear community,

I've noted the following behavior when pulling tweets through the get_user_tweets function for a number of users > 1000.
Within each user everything is fine and it goes fast, as indicated by the sys.sleep of 3.1. However, once I pass from user[i] to user[i+1] the progress bar takes several minutes. Is this an intended behavior?

Best,
Christoph

Incorporate build_query into build_user_query

Currently get_all_tweets() handles additional tweet parameters by passing these to build_query() function inside the get_all_tweets() function itself.

The get_user_tweets() function does not behave in the same way and does not permit building additional query parameters within the function.

This is related to #110 . Either we need to resurrect build_user_query() and properly incorporate it or we need to make build_query compatible with building user queries.

Entire Tweet, Location, and Language in "get_all_tweets"

Entire Tweet, Location, and Language
The retrieved tweets by the code "get_all_tweets" are not complete. It only covers up-to what the "Keywords" of the query is.
The Country is specified but not seen in the retrieved data. And language are mixed.

To Reproduce
Steps to reproduce the behavior:

  1. devtools::install_github("cjbarrie/academictwitteR", build_vignettes = TRUE)
    bearer token <- "##########"
  2. The load the package
  3. The run the following code:
    tweets <-get_all_tweets("Autonomous Driving OR self driving cars OR autopilot",
    "2020-06-01T00:00:00Z",
    "2020-07-01T00:00:00Z",is_retweet = FALSE, lang = "en",country = "US",
    bearer_token, data_path = "data/", bind_tweets = FALSE)
  4. See error
    I have attached the screenshot of the data. It is not an error but not able to access the required information which was available with rtweet for the standard package.
    Expected behavior
    I expected that I would get all the words of the tweets as I have to do a sentiment analysis. And also, the language is not specific and the location cannot be confirmed. If the instruction to give only english is not accepted, then I guess the country variable is also not considered.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: iOS
  • Version 10.15.6

Additional context
Upon inspection I was able to find that the tweets are still able to access if there is only one keyword. If there more than one separated with "OR" then none of the mentioned parameters are fulfilled, and the tweets are not complete. The text stop after a few words where the "key query" has been fullfilled.
Screenshot 2021-06-08 at 15 26 32
Screenshot 2021-06-08 at 15 25 26
Screenshot 2021-06-08 at 15 16 56
Screenshot 2021-06-08 at 15 30 22

NOTE: The screenshots do not contain the entire data. Only enough to show the problem

cannot bind json

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Contributor Code of Conduct

Is your feature request related to a problem? Please describe.

Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

I don't see a contributor code of conduct on the README page or in the vignettes.

Describe the solution you'd like

Clear guidelines for those wishing to contribute to the package.

Describe alternatives you've considered

The usethis package has a pretty vanilla contributor code of conduct that represents a good starting place.
You can test it out using the usethis package

Checkout:

usethis::use_code_of_conduct()

Additional context

Part of JOSS review see openjournals/joss-reviews#3272

Total tweets ingested stops printing [Minor bug]

Describe the bug
I'm using get_user_tweets to collect data from a set of users. The code prints out information on the queries, including a field titled "Total tweets ingested". At some point in the data collection this value stops being printed. This does not appear to affect either "tweets captured this page" or "Total pages queried". This does not appear to have any affect on the data collection process itself.

This example shows the print out and how it changes from pages X, Y, and Z (pseudonyms).

query: <from:X>: (tweets captured this page: 500). Total pages queried: 529. Total tweets ingested: 258458. 
query: <from:X>: (tweets captured this page: 500). Total pages queried: 530. Total tweets ingested: 258958. 
query: <from:X>: (tweets captured this page: ). Total pages queried: 531. Total tweets ingested: . 
next_token is now NULL for X  moving to next account 
query: <from:Y>: (tweets captured this page: 2). Total pages queried: 532. Total tweets ingested: . 
next_token is now NULL for Y moving to next account 
query: <from:Z>: (tweets captured this page: 500). Total pages queried: 533. Total tweets ingested: . 
query: <from:Z>: (tweets captured this page: 500). Total pages queried: 534. Total tweets ingested: . 

To Reproduce
Steps to reproduce the behavior:
Run get_user_tweets for a large number of accounts over a large timeframe. The problem seems to occur after a couple of hundred thousand tweets have been processed. I'm not sure whether this also affects other functions.

Expected behavior
The total number of tweets ingested should continue to print until the process is finished.

Desktop (please complete the following information):

  • iOs, Macbook Pro
  • RStudio and latest version of the academictwitteR from CRAN.

resume_collection error

I'm having issues resuming collection when the collection gets interrupted. When I try to resume the collection only 1 page is queried with no tweets returned. Oddly enough, bind_tweet_jsons works fine and returns the correct number of tweets, but the tweets themselves span the entire query from 1/1/2019 - 12/31/2019. Not sure where I'm going wrong.

image

Here's my query:
tweets_19_20 <-
get_all_tweets(
tagquery,
"2019-01-01T00:00:00Z",
"2020-01-01T00:00:00Z",
bearer_token,
lang = "en",
is_retweet = FALSE,
is_quote = FALSE,
file = "tweets_19_20",
data_path = "filepath",
bind_tweets = FALSE
)

Here's the resume_collection statement:
resume_collection(data_path = "filepath", bearer_token)

Note: I've substituted the actual data_path with "filepath". Any advice?

retweets are cut off mid sentence

Hi,
when using your package to scrape tweets it seems like retweets are cut off in the middle of the sentence and the text just ends with "...", although when checking on the profile that retweeted something the full text is displayed. Everything else works fine and original tweets are fully displayed. This seems to bethe case when using get_user_tweets as well as get_hashtag_tweets.

N of tweets to be fetched

Feature: Allow n to be specified so that the function doesn’t automatically fetch all tweets possible.

Solution: When upper limit is reached after x number of pagination runs, the function would break.

get_user_tweets timestamp issue

The timestamps using get_user_tweets are incorrect and do not match the real creating time for tweets that are pulled.

Here is the code I used:

News_Traffic_2008_2021 <-
get_user_tweets(
"NEWS1130Traffic",
"2008-01-01T00:00:00Z",
"2021-3-31T11:59:59Z",
bearer_token,
file = "News_Traffic_2010-2015"
)

The API seems to be returning questionable timestamps (as shown in the screenshot below). I confirmed that these tweets were posted at different times.

For example, this tweet’s timestamp on Twitter (2018-12-09 03:27:00) does not match the timestamp returned with the API ( 2018-12-09 08:00:00).

image

"get_all_tweets" coercing environment

Hello! Thank you so much for making this package! It is so useful!

I haven't been able to get the "get_all_tweets" function to work, however. This is my code:

ccs_query <- c("carbon capture", "carbon-capture", "#carboncapture")

ccs_tweets <- get_all_tweets(
ccsquery,
"2020-05-19T00:00:00Z",
"2021-05-18T00:00:00Z",
Bearer_Token
)

But I keep getting this error message:

Error in as.vector(x, "character") :
cannot coerce type 'environment' to vector of type 'character'

I am not sure what this is trying to coerce. Any ideas of what could be going on?

Collusion between multiple fetches

Describe the bug
If someone fetched data without deleting the data folder afterwards, and then fetched data again with another quiry/username, the current parse data function will return everything in the folder (old and new data)

Expected behavior
Parse and return only the new data.

Error Merging JSON Files

Can't merge saved JSON files.
Error report: "Error: Argument 3 can't be a list containing data frames" when running loop starting with "for (i in seq_along(files)) {".
Error stems from the line: "df.all <- bind_rows(df.all, df)"

get_radius_tweets confusion

Hello,

I am using the get_radius_tweets command and am getting tweets that aren't in my specified radius. For example, see the code below:

test1<-get_radius_tweets( c("#proudboys","#pyob","proud boys"), radius=c(-121.4962,38.5799,10), # Sacramento city center start_tweets = "2020-11-03T00:00:00Z", end_tweets = "2020-11-10T10:00:00Z", bearer_token = bearer_token, data_path = "data/" )

When I inspect the geo-coordinates, most are blank. Those that are nowhere near Sacramento. For example, one tweet has the following geo-coordinates: c(-70.03333333, 12.51666667). Am I doing something wrong?

Thanks!

Error code 400 - no specification

I ran the get_hashtag_tweets code and received a code 400 error message but I cannot find what issue that refers to.

bearer_token <- "redacted" # Insert bearer token

get_hashtag_tweets("#AmyConeyBarrett", "2020-10-19T00:00:00Z", "2020-19-29T00:00:00Z", bearer_token)
Error in get_tweets(q = query, n = 500, start_time = start_tweets, end_time = end_tweets, :
something went wrong. Status code: 400

Example

Hello

Is it possible that you upload a more detailed example of the function performance?

I have been trying to scrape tweets for the #BLM but I am getting weird results. I don't know how does the function scrape over the tweets but in some samples I was getting tweets coming from a few number of users with a lot of retweets. Does the function take random tweets in that interval? Is there a way to control for the country we are interested in?

Also, for te json files I can't really find a way to convert them to dataframes properly .

Thanks a lot in advance!

429 Error - get_user_tweets

Hi Christopher Barrie and Justin Chun-ting Ho,

First of all, thank you for providing this package! I just used your package to download tweets from a large list of accounts (502 accounts separated in ~ 100 accounts per list to reduce the amount of tweets per request).

However, when downloading tweets from one list (doesn't matter what list i specifiy), it stops after it iterated through approx. 30 accounts. The error message is "Error in get_tweets(q = query, n = 500, start_time = start_tweets, end_time = end_tweets, : something went wrong. Status code: 429"

The code 429 refers to "too many requests", even though my limit isn't reached at all. So if the maximum amount isn't reached, the problem probably is due to the rate I request data. Can I somehow specificy f.e. "RetryonLimit = TRUE" or do you have a workaround for this problem?

Many Thanks in advance and keep up the awesome work!

Kind regards

Juilian

What's the best way to get tidy data or parse JSON output?

Is there a simple way to get tidy data or parse the JSON output from functions in this package? I may be missing something here but all methods that work with other Twitter-R packages aren't working here. I'm resorting to the following code:

json_file <- rjson::fromJSON(file = "./data/data.json")
df <- stringi::stri_list2matrix(json_file, byrow = TRUE, fill = "") %>% data.frame()

Error in if (httr::headers(r)$`x-rate-limit-remaining` == "1") { : argument is of length zero

Describe the bug
I was trying to build an archive for a huge amount of tweets for two hashtags (ca. 700 000) from 2013 onwards and I ran first against memory problems (Ubuntu 18.04. R-Studio-Server Droplet on Digitalocean). After checking the memory during the construction of the RDS-file I saw that I need to increase the memory. Running the same query with double memory (2 GB) led to the same problem. Now I tried to only query for one hashtag (ca. 300 000 tweets) and the memory seems to be ok. But now I get the following error message:

Error in if (httr::headers(r)$x-rate-limit-remaining == "1") { : argument is of length zero

Expected behavior
Produce an .RDS file.

Desktop (please complete the following information):

  • OS: Ubuntu 18.04
  • Browser: Safari
  • Version 14.0.3

Failed to install 'academictwitteR' from GitHub

Describe the bug
image

To Reproduce
Steps to reproduce the behavior:

  1. Open a new RStudio session
  2. Type devtools::install_github("cjbarrie/academictwitteR", build_vignettes = TRUE) in the console, as explained in https://github.com/cjbarrie/academictwitteR/blob/master/README.md#installation
  3. Press "Enter"
  4. See error

Expected behavior
I was expecting that the package would be installed.

Screenshots
See above.

Desktop (please complete the following information):
image

Issue with bearer_token error message

After having installed the development package devtools::install_github("cjbarrie/academictwitteR")
and activated the library by running the code library("academictwitteR"). After that I ran the following commands:
bearer_token <- "long sequence of numbers and letters" (It ran with no error, got registered in Values)
boycott_tweets <- get_hashtag_tweets("#boycott", bearer_token) and receiever the following:

Error in get_tweets(q = query, n = 500, start_time = start_tweets, end_time = end_tweets, :
bearer token must be specified.

I specified it already. However, I go to my Twitter developer project and revoke the bearer token.
Having changed the bearer token in my code chunk, and I run it again. bearer_token is replaced with the new one in Values.
I now run boycott_tweets <- get_hashtag_tweets("#boycott", bearer_token) but it feeds me back the same error above.

Desktop:

  • OS: iOS Big Sur Version 11.2.2 (20D80)
  • Browser safari Version 14.0.3 (16610.4.3.1.4)

CRAN Submission TODO

  • Update vignettes (if necessarily)
  • Run devtools::check_win_release() and devtools::check_win_devel()
  • Update cran-comments.md
  • Submitting using devtools::release()

Enhancements TODO

The following list outlines multiple enhancements, with check box lists to record progress. The name in brackets indicates the dev responsible for each enhancement/change

  • Deprecate get_hashtag_tweets and use new function get_all_tweets as foundation (because hashtag queries are recognized as such with "#" before the string query. So can be incorporated into get_all_tweets, which takes both string queries and hashtags. (CHRIS TO DO)

  • In get_all_tweets and get_user_tweets functions line : saveRDS(df.all, file = file) should be changed to saveRDS(df.all, file = file = paste0(file, “.rds”)) so that the user only needs to name the .rds file (rather than e.g. specify file= "tweets.rds") (CHRIS TO DO)

  • Add functionality or at least a vignette explainer on how to combine multiple queries e.g. tweets <- get_all_tweets("(happy OR happiness) place_country:GB place:Manchester -birthday -is:retweet", "2021-01-01T00:00:00Z", "2021-01-01T01:00:00Z", bearer_token) (CHRIS TO DO)

  • Documentation: add vignette describing the thinking behind how the package delivers data. Is done with large-scale data collection in mind. So encourages data storage as JSONs on the fly, in named folders. But also has options to deliver dataframes etc. Discussion of file naming conventions and data delivery should come first in the vignette explainer. (CHRIS TO DO)

  • Add OR and AND options for hashtag and string queries; i.e., to allow user to request tweets containing both strings or hashtags or either string or hashtag. Details of AND and OR logics here: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query. AND is the default when multiple strings specified. Can add an OR option by e.g. taking strings and, when OR is TRUE, calling paste(strings, collapse = “ OR ”). GROUPING logic could also be incorporated. (CHRIS TO DO)

  • The main tweet retrieval functions only return results as data frames when they complete. Data loss is, of course, mitigated, by specifying data path field to store data as we go, but can we also return a dataframe object or .rds file as well if the function is interrupted or fails? (JUSTIN TO DO)

    • Another option here is to make a new argument e.g. bind_json = T in the main get_all_tweet function (and associated functions), which would mean JSONs are only bound if the user specified this. If they specify bind_json=F then JSONs will just be generated and not bound into a dataframe. Can then use the convenience binding functions listed below to bind when wanted. Added as bind_tweets
  • Add a bind JSON convenience function for user and tweet-level information. Useful if stored only as JSONs and user wants to bind into a dataframe after retrieval, or if the data collection is interrupted. (JUSTIN TO DO)

  • Add a "get last tweet" convenience function. Useful if data collection is interrupted and they want to recommence retrieval from the last tweet (or user) collected. (JUSTIN TO DO)

  • Include standalone functions for has: conjunction-required operators. See https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query for details. This can be achieved by repurposing the get_all_tweets workhorse functions and adding e.g. q = paste("has:media" ,query). This can be done by creating separate functions for:

    • has:media (get_media_tweets)
    • has:mentions (get_mentions_tweets)
    • has:images (get_images_tweets)
    • has:videos (get_videos_tweets)
    • has:geo (get_geo_tweets)
  • Integrate conjunction-required operators (i.e., from "is:retweet" downwards in this list: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#types) into the main get_all_tweets function as arguments. Can be achieved by series of ifelse() calls e.g., imgq <- ifelse(hasimage==T, "has:image ", NULL) then paste to query with q <- paste0(imgq, query). Other functions for e.g., get_media_tweets can remain as "shortcut" functions. (JUSTIN TODO)

  • Include standalone functions for

    • url: (get_url_tweets) TODO: fix the paste() call to get e.g. 'url: "url.com"' properly specified
    • to: (get_to_tweets)
    • retweets_of: (get_retweets_of_user)
  • Add functionality for:

    • place: (get_place_tweets)
    • place_country: (get_country_tweets)
    • point_radius: (get_radius_tweets)
    • bounding_box: (get_bb_tweets)
    • langs: (get_lang_tweets)
      • Include convenience function to look up languages?
        (CHRIS TO DO)
  • Add exclude retweets option with -is:retweet (JUSTIN TO DO)

  • Add exclude promoted tweets option with -is:nullcast (JUSTIN TO DO)

  • Allow negation operators (i.e., when searching for strings but not some strings: e.g., query = cat #meme -grumpy) (JUSTIN TO DO)

  • Add bio: and bio_name and has:profile_geo user search functionality to user search function. (JUSTIN TO DO)

  • Add sample: as an option to get_all_tweets function. This gives a % random sample (e.g., sample:10 gives 10% sample) rather than all tweets. Useful if retrieved data is likely to be large. Or if testing queries. Details here: https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/build-a-rule. (JUSTIN TO DO) NOT POSSIBLE WITH FULL ARCHIVE SEARCH

  • Add context: operators to get tweets annotated by particular context. Details here: https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/build-a-rule.

    • Include also a convenience function that looks up the context and domain entity ids?
      (JUSTIN TO DO)
      NOT POSSIBLE WITH FULL ARCHIVE SEARCH
  • Add Zenodo release and obtain DOI. (CHRIS TO DO)

  • Add functionality for querying conversation_id field NOT POSSIBLE WITH FULL ARCHIVE SEARCH

  • Add max. number of tweets argument

  • Add user handle to data.frame automatically

  • Incorporate build_user_query into get_user_tweets function

  • Rate limiting handling: manual specification of sleep between calls and/or trycatch addition?

  • Get data in tidy format?

  • Fix RTs getting cut off in main tweet_ data

  • Add random sample function (see https://twittercommunity.com/t/generating-a-random-set-of-tweet-ids/150255/11 for how to discussion)

Utilize environment variables instead of writing files

Rather than writing the token to disc in an obfuscated directory it would be preferable to write to the environment or read from an .Renviron file. See https://httr.r-lib.org/articles/api-packages.html#authentication-1

For example.

set_bearer <- function(token) {
      pat <- Sys.getenv('TWITTER_BEARER')
      if (identical(pat, "")) {
        stop("Please set env var TWITTER_BEARER to your Academic Twitter API bearer token",
         call. = FALSE)
  }

  pat

}


get_bearer <- function() {
    Sys.getenv("TWITTER_BEARER")
}

You could utilize an .Renviron file so set_bearer() doesn't need to be use e.g.

TWITTER_BEARER=my-access-token-here

#' Manage your bearer token

"get_all_tweets" doesn't work for me

Describe the bug
Thanks a lot for providing this package.
I was replicating your following codes:

Insert bearer token

bearer_token<-"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

tweets <- get_all_tweets("apples OR oranges","2020-01-01T00:00:00Z","2020-01-05T00:00:00Z",bearer_token)

But I got the following error message:

Error: Argument 1 can't be a list containing data frames

Please can you explain me why I got this message?

Thanks,

Best.

Unable to download the program

I am simply trying to download the program, but R studio does keeps giving me the error that I copy and pasted below:

Expected behavior
I expected for the program to download form GitHub.

What I am running:

  • OS: 10.12.6
  • Rstudio is up to dat as well; version 1.4.1103

Error Code:

devtools::install_github("cjbarrie/academictwitteR")
Downloading GitHub repo cjbarrie/academictwitteR@HEAD
✓ checking for file ‘/private/var/folders/3w/1ym4cc9j2qb4mp9mhxv261kr0000gn/T/Rtmp2IIXF5/remotes22539d23b54/cjbarrie-academictwitteR-aedd5fa/DESCRIPTION’ (550ms)
─ preparing ‘academictwitteR’:
✓ checking DESCRIPTION meta-information ...
Warning in file(con, "r") :
cannot open file '/var/db/timezone/zoneinfo/+VERSION': No such file or directory
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
─ building ‘academictwitteR_0.0.0.9000.tar.gz’

dyld: lazy symbol binding failed: Symbol not found: _utimensat
Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libR.dylib (which was built for Mac OS X 10.13)
Expected in: /usr/lib/libSystem.B.dylib

dyld: Symbol not found: _utimensat
Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libR.dylib (which was built for Mac OS X 10.13)
Expected in: /usr/lib/libSystem.B.dylib

/Library/Frameworks/R.framework/Resources/bin/INSTALL: line 34: 641 Done echo 'tools:::.install_packages()'
642 Abort trap: 6 | R_DEFAULT_PACKAGES= LC_COLLATE=C "${R_HOME}/bin/R" $myArgs --no-echo --args ${args}
Error: Failed to install 'academictwitteR' from GitHub:
(converted from warning) installation of package ‘/var/folders/3w/1ym4cc9j2qb4mp9mhxv261kr0000gn/T//Rtmp2IIXF5/file22519352574/academictwitteR_0.0.0.9000.tar.gz’ had non-zero exit status

Issue #get_hashtag_tweets(data = ) returned rds/dataframe

Describe the bug
The execution of the get_hashtag_tweets returns a series of independent json files. Supplying a filename to the data argument returns no RDS file. I attemped with .rds and without. Does jsonlite need loading in order to access the dataframe/tibble? The read me documentation shows after authentication from the V2 API endpoint returned file results in a dataframe to view headings.

To Reproduce
Steps to reproduce the behavior:
get_hashtag_tweets("#Blacktwitter",
start_tweets = "2021-01-06T00:00:00Z",
end_tweets = "2021-01-13T23:59:59Z",
bearer_token = bear_token,
data_path = "Desktop/data/bti.rds")

bti <- get_hashtag_tweets("#Insurrection",
start_tweets = "2021-01-06T00:00:00Z",
end_tweets = "2021-01-13T23:59:59Z",
bearer_token = bear_token,
data_path = "Desktop/data/bti")

Expected behavior
I expected a dataframe or tibble to use within the rtweet package to extract data or tidytext package. Does the json files stored in the data folder require further processing with a different package or function in the R studio environment. I even tried to save the get_hashtag_tweets in to an object. That option resulted in timeout request error.

Desktop (please complete the following information):
OS: iOS Big Sur Version 11.2.2
Browser safari Version 14.0.3

Function not found

After loading the devtools library and running the example script with my bearer token, I get

could not find function "get_hashtag_tweets"

Is there something I'm missing?

Incorporate username into main data_* payload

At present, the payload returned as data_* JSONs (and bound into a data.frame when bind_tweets = F) does not contain the username (username) of the user, just the user ID (author_id).

The username is contained in the secondary users_* JSON files. Other libraries, e.g. here seem to contain solutions to this, by flattening the files returned and re-incorporating the username field.

A potentially slow solution would be to:

  • Call bind_user_jsons() on the users_* JSONs, filter by id (which refers to the author_id in the data_* JSONs).
  • Rename id to author_id
  • Get unique usernames alongside author_id
  • Merge with tweet-level data.frame by author_id

This oddity of how the payload is returned may soon be fixed by Twitter. Rather than including an additional argument to the main get tweets functions to get usernames, the best approach may be to create a convenience function to recover usernames.

This could just be called join_usernames() or similar. The function would take the unique author_ids from the tweet-level data, bind users_* JSONs, rename id author_id, get usernames alongside author_ids by filtering on author_id, and merge back into the tweets data.frame.

So the function would just require two formals, something like:

join_usernames(tweetsdf, data_path = "data/")

, where tweetsdf is the data.frame object of returned tweets and the data_path just points to where the users_* JSONs are stored.

get_all_tweets: error if not tweets are returned

the get_all_tweets gives an error and stops if not tweets are retrieved for a specific time window

Tweets will be bound in local memory as well as stored as JSONs.Directory already exists. Existing JSON files may be parsed and returned, choose a new path if this is not intended.query: <#Figliuolo -is:retweet lang:it>: (tweets captured this page: ). Total pages queried: 1. Total tweets ingested: .
Error in if (ntweets > n) { : argument is of length zero

To Reproduce
get_all_tweets(#Figliuolo,
"2020-12-01T00:00:00Z",
"2020-12-31T00:00:00Z",
bearer_token, n=100000, data_path=newDir, lang = "it")

I'm using the github version 0.2.0

Cannot find function 'get_all_tweets"

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

TODO

The following list outlines multiple enhancements, with check box lists to record progress.

  • Add unit testing for authorization-required functions
  • Update package citation details
  • Ensure documentation and vignettes consistent esp. with new get_bearer() and set_bearer() functions
  • The main tweet retrieval functions only return results as data frames when they complete. Data loss is mitigated by specifying data path field to store data as we go, but can we also return a dataframe object or .rds file as well if the function is interrupted or fails?
  • Fix RTs getting cut off in main tweet_ data. Requires re-incorporating full RT text by re-querying these tweet IDs? Could just add an additional convenience function.
  • Add functionality for querying conversation_id field
  • Add user handle to data frame automatically (currently stored in user_* return)
  • Incorporate build_user_query into get_user_tweets function
  • Improve rate limiting handling: manual specification of sleep between calls and/or trycatch addition?
  • Add random sample function (see https://twittercommunity.com/t/generating-a-random-set-of-tweet-ids/150255/11 for how to discussion)

bind_user_jsons() cannot resolve data_path without a trailing slash

Describe the bug
bind_user_jsons is untested. It doesn't behave like bind_tweet_jsons, which can handle a data_path without a trailing slash.

To Reproduce

require(academictwitteR)
#> Loading required package: academictwitteR

empty_dir <- tempdir()
empty_dir
#> [1] "/tmp/RtmpPAJNq4"
my_cars <- mtcars
my_cars$model <- rownames(my_cars)

jsonlite::write_json(my_cars, 
                     path = file.path(empty_dir, "data_1.json"))
jsonlite::write_json(my_cars, 
                     path = file.path(empty_dir, "data_2.json"))
jsonlite::write_json(my_cars, 
                     path = file.path(empty_dir, "users_1.json"))
jsonlite::write_json(my_cars, 
                     path = file.path(empty_dir, "users_2.json"))
x <- bind_tweet_jsons(empty_dir)
#> ================================================================================
x <- bind_tweet_jsons(paste0(empty_dir, "/"))
#> ================================================================================

x <- bind_user_jsons(empty_dir)
#> Warning in open.connection(con, "rb"): cannot open file '/tmp/
#> RtmpPAJNq4users_1.json': No such file or directory
#> Error in open.connection(con, "rb"): cannot open the connection
x <- bind_user_jsons(paste0(empty_dir, "/"))
#> ================================================================================

Created on 2021-06-06 by the reprex package (v2.0.0)

Expected behavior

bind_user_jsons() should be able to resolve a data_path without a trailing slash.

Continue previous query

Issue: If the system crashed amidst a search, there is currently no easy way to continue the data download without changing the query. It is possible to continue to continue by reading the existing tweets and find the last fetched by date/tweet id. It would be great to have a function to do that automatically.

Additional use case: The function can be used also for Twitter data monitoring/long period data collection. For example, it could be used to set up a loop to search new tweets every day. This could be particularly useful for tweets that has a tendancy to be removed (eg. bot, false information, other censored content etc).

resurrect build_user_query or permit user query building in build_query

The function build_user_query() in v0.1.0 allowed users to build user queries, taking a vector of usernames, and searching their tweets while e.g. removing RTs or specifying other parameters. This was deprecated in recent PR.

For consistency, we need to resurrect build_user_query() or make it possible to take a vector of usernames with the build_query() function and append the additional tweet parameters to each username.

The expected behaviour (as it behaves in the current CRAN release) is as follows:

require(academictwitteR)
#> Loading required package: academictwitteR

bt <- "AAAAAAAAAYOURBEARERTOKENHERE"

users <- c("cbarrie", "justin_ct_ho")

users_params <-
  build_user_query(users,
                   is_retweet = F,
                   has_media = T,
                   lang = "en")

users_params
#> [1] "cbarrie -is:retweet has:media lang:en"     
#> [2] "justin_ct_ho -is:retweet has:media lang:en"

Continuous Integration Fails on R Versions < 3.4

This I learned, isFALSE was introduced in R3.4.2. Right now in your DESCRIPTION file, you have R > 3 as the only restriction on the version of R users are required to have. In order to ensure this works for earlier versions of R you would need to change the following isFALSE sections to !isTRUE or more robust(!is.null(x) && !isTRUE(x))

if(isFALSE(is_retweet)) {

if(isFALSE(is_retweet)) {

The CRAN checks missed this because they only check 1 version past, current version, and devel versions of R.

openjournals/joss-reviews#3272

Error messaging when no date specified

When date is not specified you get a warning saying “bearer token must be specified”

e.g.: tweets <- get_all_tweets("has:images @cbarrie", , bearer_token)

Need to update error message to say that date not specified.

Allow specified tweet and user fields

Feature: Allow user to fetch only specified tweet- and user-level fields

Solution: Add options that alter query parameters to include only specified user fields. Could be in form of series of addition functions e.g. get_users and get_tweets as well as more specific e.g. get_user_metrics to just capture user following, listed, tweet count etc. or, for tweets, get_tweet_entities to just capture tweet annotations, mentions, hashtags etc.

rate limit error

Describe the bug

I got an error that seems to be related to the rate limit. I was using an lapply to apply get_hashtag_tweets to a list of search terms.

Error message:
Error in if (httr::headers(r)$x-rate-limit-remaining == "1") { :
argument is of length zero

To Reproduce
I don't know if I can reproduce it because I am not sure which rate limit I hit. After waiting a bit, I retried the search term that failed, and it worked after waiting.

Expected behavior
expected the search to return data without an error.

Automated Testing / Unit Testing Framework for JOSS

Is your feature request related to a problem? Please describe.
CRAN build checks appear to be the only automated testing implemented in the package. Additionally, many of the examples in the documentation are wrapped in "don't run" which means they are not checked on CRAN (which makes sense for CRAN checks, but not for ensuring unit tests).

Describe the solution you'd like

Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?

Ideally, continuous integration could be deployed to verify continuity of functionality with new pulls. This could be implemented through GitHub Actions. Additionally, automated tests could be written to verify that functions are working properly. This is challenging with API related packages, but it could be done through minimally testing non-API key required features or with the use of secrets, some basic api features could be tested. And example is available in a forthcoming PR. Totally open to pushback on this point.

Describe alternatives you've considered
Testthat or tinytest both offer good options for unit testing.

Additional context
See openjournals/joss-reviews#3272

"AND logic" not working

Hi,
I tried the "OR logic" multiple times. Never encountered any issue. However, I tried the "AND logic" recently, and it is not working. I used the following command:

df <- get_all_tweets("mhealth AND children", "2021-01-05T00:00:00z", "2021-06-06T00:00:00z", lang = "en", bearer_token)

Note: The "OR logic" is working fine in the above command.

get_hashtag_tweets ERROR after loading for a while (Rstudio)

i'm doing research on Twittersentiment from Tweets about 10 different stocks. So i need to collect Tweets and i'm using the "acadamictwitteR" package to collect them with the get_hashtag_tweets. So if i run this code it runs for a while as i need data tweets from 10 different stocks from 1/11/2019 untill 1/03/2021. After running for a while 500 per line (in the console) it just stops and gives me the 503 error, but everytime i check the servers of Twitter are online. Anybody that knows how i can fix this?

ps: i'm using Rstudio and i have an academic researcher account, i tried using the Rtweets package but i can't use the Search_fullarchive as you need a premium or enterprise account. Attached is the error i get after running for a while.
Error 2

something went wrong. Status code: 403

Describe the bug
My code is simple-

install.packages("academictwitteR")
library(academictwitteR)

bearer_token <- "********************************************************"

tweets <-
  get_all_tweets(
    "#BLM OR #BlackLivesMatter",
    "2014-01-01T00:00:00Z",
    "2020-01-01T00:00:00Z",
    bearer_token,
    data_path = "data/",
    bind_tweets = FALSE
  )

Then I am getting this-

image

Can anyone please help?

To Reproduce
Steps to reproduce the behaviour:

  1. Open a new file.
  2. Write down the provided code and run.
  3. Get the error

Expected behavior
Should have get some values (tweets)
Screenshots
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.