Comments (32)
import time
from datetime import datetime, date, timedelta
def DownloadTweets(SinceDate, UntilDate, Query):
'''
Downloads all tweets from a certain month in three sessions in order to avoid sending too many requests.
Date format = 'yyyy-mm-dd'.
Query=string.
'''
since = datetime.strptime(SinceDate, '%Y-%m-%d')
until= datetime.strptime(UntilDate, '%Y-%m-%d')
tenth = since + timedelta(days = 10)
twentieth = since + timedelta(days=20)
print ('starting first download')
first = got.manager.TweetCriteria().setQuerySearch(Query).setSince(since.strftime('%Y-%m-%d')).setUntil(tenth.strftime('%Y-%m-%d'))
firstdownload = got.manager.TweetManager.getTweets(first)
firstlist=[[tweet.date, tweet.text] for tweet in firstdownload]
df_1 = pd.DataFrame.from_records(firstlist, columns = ["date", "tweet"])
#df_1.to_csv("%s_1.csv" % SinceDate)
time.sleep(600)
print ('starting second download')
second = got.manager.TweetCriteria().setQuerySearch(Query).setSince(tenth.strftime('%Y-%m-%d')).setUntil(twentieth.strftime('%Y-%m-%d'))
seconddownload = got.manager.TweetManager.getTweets(second)
secondlist=[[tweet.date, tweet.text] for tweet in seconddownload]
df_2 = pd.DataFrame.from_records(secondlist, columns = ["date", "tweet"])
#df_2.to_csv("%s_2.csv" % SinceDate)
time.sleep(600)
print ('starting third download')
third = got.manager.TweetCriteria().setQuerySearch(Query).setSince(twentieth.strftime('%Y-%m-%d')).setUntil(until.strftime('%Y-%m-%d'))
thirddownload = got.manager.TweetManager.getTweets(third)
thirdlist=[[tweet.date, tweet.text] for tweet in thirddownload]
df_3 = pd.DataFrame.from_records(thirdlist, columns = ["date", "tweet"])
#df_3.to_csv("%s_3.csv" % SinceDate)
df=pd.concat([df_1,df_2,df_3])
df.to_csv("%s.csv" % SinceDate)
return df
I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!
#------
#Example:
#DownloadTweets('2019-01-01', '2019-01-31', 'klimaat')
from getoldtweets3.
I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.
The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period.
Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.
Since they are mimicking the advanced search on Twitter, the smallest time unit is a day -- you can see the options included in twitter's advanced search by checking out the website.
For larger than a day, it is easy to loop through days and have it sleep. I couldn't find the download rate limit for Twitter's server. However, when I had it sleep 16 minutes, between two days, It seemed to recover from the Too Many Request error.
My problem is for when I download one day of the Tweets for a common word. I think it doesn't have much to do with the Advanced Search functionality anymore, which is a good thing. Similar to max tweets there should be a way to cap the downloads, but we want to keep the search alive --maybe that's the issue in your case @lethalbeans? and have a placeholder for resuming the download after sleep.
from getoldtweets3.
I'm still working so haven't made a PR, but here's the code: https://github.com/libbyh/GetOldTweets-python/blob/py3/GetOldTweets3/manager/TweetManager.py
Kinda sloppy because it includes the ratelimit approach (lines 277-279) and a catch-and-sleep for the 429 (lines 351 - 355).
from getoldtweets3.
Hey guys, playing around with time.sleep.
Does anyone know how to find out exactly how long I need to wait to retry?Yes, you can get it!
But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.Ah okay, let me dig into that. One last thing mate - for the retry-after response header, which file in the code should I be looking at?
Appreciate it and thank you!
No problem!
You should change GetOldTweets3/manager/TweetManager.py
:
try:
response = opener.open(url)
#--- HERE ---#
#response.status()
#response.headers()
jsonResponse = response.read()
except Exception as e:
print("An error occured during an HTTP request:", str(e))
print("Try to open in browser: https://twitter.com/search?q=%s&src=typd" % urllib.parse.quote(urlGetData))
sys.exit()
You should try something on that way. Take a look on the web for the correct code, but it must be something like that. Feel free to email me.
from getoldtweets3.
I am sadly having the same issue. It used to work great but not anymore. Any solution from the more experienced coders?
from getoldtweets3.
I'm having the same issue. Is there a way to make the code sleep in order to pace the number of requests?
from getoldtweets3.
Same issue here too, I couldn't find a feature that allows the scraping process to sleep.
from getoldtweets3.
Would using a proxy do the trick? if so how?
I tried to set a proxy on pycharm and on the general setting but no luck.
from getoldtweets3.
I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.
from getoldtweets3.
I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.
The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period.
Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.
from getoldtweets3.
I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.
The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period.
Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.
It's not even working with 10k now :/ I'm guessing Twitter's team have put in countermeasures.
from getoldtweets3.
I tried to alter the source code with time.delay but got a different error:
An error occured during an HTTP request: HTTP Error 503: Service Temporarily Unavailable
from getoldtweets3.
@elkalhor thanks I've managed to get 20k so far, aiming for 100k at least.
Yeah that would be good. The current code seems to use try catch exceptions, and ends the code abruptly if the json request is not fulfilled. I'm currently trying to implement ratelimit:
https://github.com/tomasbasham/ratelimit
from getoldtweets3.
I would propose to change the current exiting of the script to a unified exception.
That way the user can decide to catch the exception and use any method to retry the request later on for the moment.
Using sys.exit()
without any error code, seems like the worst way to handle it.
Besides that, the approach from @lethalbeans seems like a really good idea to me. Do you have progress on that?
from getoldtweets3.
@ekalhor, this should do the trick
from getoldtweets3.
@lethalbeans did you get ratelimit
to work?
from getoldtweets3.
has anybody found a solution for this? I get the same error when I get 15000 to 20000 tweets
from getoldtweets3.
I used ratelimit to solve the 429 problem, but I eventually got another error:
An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>
To use rate limit:
from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60
@sleep_and_retry
@limits(calls=30, period=ONE_MINUTE)
def callAPI...
from getoldtweets3.
I tried it and still got the "too many requests"
could you please share your complete code? (I mean the callAPI function)
from getoldtweets3.
I used ratelimit to solve the 429 problem, but I eventually got another error:
An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>
To use rate limit:
from ratelimit import limits, sleep_and_retry ONE_MINUTE = 60 @sleep_and_retry @limits(calls=30, period=ONE_MINUTE) def callAPI...
Has anyone solved the certificate expired error or know what's causing it? I can't tell if it's a time limit-related issue or something else. I'm also getting the certificate expired error, but it has happened after 700 tweets, 4100 tweets, 0 tweets...
An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1045)>
from getoldtweets3.
Hi @libbyh @meixingdg,
I have been able to solve the SSL certificate with the following two lines at the beginning of the TweetManager.py
file:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
I used ratelimit to solve the 429 problem, but I eventually got another error:
An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>
To use rate limit:
from ratelimit import limits, sleep_and_retry ONE_MINUTE = 60 @sleep_and_retry @limits(calls=30, period=ONE_MINUTE) def callAPI...
I'm trying to use ratelimit
with the modifications you described here but for now ratelimit
does not seem to work on my side.
I'm still working so haven't made a PR, but here's the code: https://github.com/libbyh/GetOldTweets-python/blob/py3/GetOldTweets3/manager/TweetManager.py
Kinda sloppy because it includes the ratelimit approach (lines 277-279) and a catch-and-sleep for the 429 (lines 351 - 355).
Lines 277-279 and lines 351-355 are not related, right? Then are you sure ratelimit
is working on your side? With only lines 351-355 and the time.sleep set to 15 minutes, I think the 429 issue is solved.
EDIT: Here is now a new error I sometimes get when downloading tweets: An error occured during an HTTP request: <urlopen error [Errno -3] Temporary failure in name resolution>
from getoldtweets3.
@meixingdg Did you figure out? I get the 429 error after 10000 tweets. I have tried retry and sleep but didn't work
from getoldtweets3.
Did anyone figure out a good solution? I have been trying to get tweets for a popular hashtag, say #coronavirus and ran into the same issue. I have been limiting the search to each day. Yet, some days the tweets are so many that I run into this issue. I used to get the 529 Error. Now, it has switched to 503 error. As someone stated, each time I retry, the total tweets before the error is even smaller. I even did as small as 50, put in a sleep for a minute, then began gathering again. Still fails after about 2000 tweets have been gathered. On the other extreme, I tried to get 10000, then put it to sleep for 65 minutes. Still not much luck. Kinda stuck. Any thoughts or solutions?
from getoldtweets3.
Hi has anybody found a solution to this yet? I am trying to download the tweets for a hot topic for a single day, however, any number of requests above 10000 shows me the 429 error... @Clairedevries does your function solve this issue? and can it download hundreds of thousands of tweets for a single day?
from getoldtweets3.
Yes. I downloaded 312,000 tweets in a single day by running the function above. I ran the function 12 times, once for every month in 2014. You might not be able to use this exact function as it might not work for your code and was specifically made for my project, but it shows how you can run a function and have it sleep in between in order to avoid errors. It might not work if there are more than 100,000 tweets in a single day though.
from getoldtweets3.
I used the same function def DownloadTweets(SinceDate, UntilDate, Query) as well. In case there are too many tweets and you get an error message, break down the dates into smaller intervals and save the data in csv files in small increment.
The issue i find is that the geo information is not available. I am aware that not all tweets should have geo information but when downloading the data using Twitter api we get a small % of tweets having the geo code info
from getoldtweets3.
I noticed there is a buffer option in the library. By using it I could update a .csv file for each 10 tweets returned by the library. Even in the cases I got some error, the number was satisfactory for me. Basically what i did was:
def partial_results(tweets):
print(tweets.text)
tweets = got.manager.TweetManager.getTweets(tweetCriteria, bufferLength=10, receiveBuffer=partial_results)
from getoldtweets3.
Hey guys, playing around with time.sleep.
Does anyone know how to find out exactly how long I need to wait to retry?
from getoldtweets3.
Hey guys, playing around with time.sleep.
Does anyone know how to find out exactly how long I need to wait to retry?
Yes, you can get it!
But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.
from getoldtweets3.
Hey guys, playing around with time.sleep.
Does anyone know how to find out exactly how long I need to wait to retry?Yes, you can get it!
But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.
Ah okay, let me dig into that. One last thing mate - for the retry-after response header, which file in the code should I be looking at?
Appreciate it and thank you!
from getoldtweets3.
I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!
#------
Thank you @Clairedevries! I adopted and altered your code. The function now waits after each day for a specified amount of sleep time. 15 minutes sleep should be on the safe side given the API rate limits. This too does not work for too many tweets (say >100k) in a single day.
import GetOldTweets3 as got
import time
from datetime import datetime, timedelta
def DownloadTweets(SinceDate, UntilDate, query, sleep=900, maxtweet=0) :
#create a list of day numbers
since = datetime.strptime(SinceDate, '%Y-%m-%d')
days = list(range(0, (datetime.strptime(UntilDate, '%Y-%m-%d') - datetime.strptime(SinceDate, '%Y-%m-%d')).days+1))
tweets = []
for day in days:
init = got.manager.TweetCriteria().setQuerySearch(query).setSince((since + timedelta(days=day)).strftime('%Y-%m-%d')).setUntil((since+ timedelta(days=day+1)).strftime('%Y-%m-%d')).setMaxTweets(maxtweet)
get = got.manager.TweetManager.getTweets(init)
tweets.append([[tweet.id, tweet.date, tweet.text] for tweet in get])
print("day", day+1, "of", len(days), "completed")
print("sleeping for", sleep, "seconds")
time.sleep(sleep)
#flatten list
tweets = [tweet for sublist in tweets for tweet in sublist]
return tweets
#%%
since = "2020-02-27"
until = "2020-03-01"
tweets = DownloadTweets(since, until, query='trump', maxtweet=10, sleep=10)
from getoldtweets3.
import time from datetime import datetime, date, timedelta def DownloadTweets(SinceDate, UntilDate, Query): ''' Downloads all tweets from a certain month in three sessions in order to avoid sending too many requests. Date format = 'yyyy-mm-dd'. Query=string. ''' since = datetime.strptime(SinceDate, '%Y-%m-%d') until= datetime.strptime(UntilDate, '%Y-%m-%d') tenth = since + timedelta(days = 10) twentieth = since + timedelta(days=20) print ('starting first download') first = got.manager.TweetCriteria().setQuerySearch(Query).setSince(since.strftime('%Y-%m-%d')).setUntil(tenth.strftime('%Y-%m-%d')) firstdownload = got.manager.TweetManager.getTweets(first) firstlist=[[tweet.date, tweet.text] for tweet in firstdownload] df_1 = pd.DataFrame.from_records(firstlist, columns = ["date", "tweet"]) #df_1.to_csv("%s_1.csv" % SinceDate) time.sleep(600) print ('starting second download') second = got.manager.TweetCriteria().setQuerySearch(Query).setSince(tenth.strftime('%Y-%m-%d')).setUntil(twentieth.strftime('%Y-%m-%d')) seconddownload = got.manager.TweetManager.getTweets(second) secondlist=[[tweet.date, tweet.text] for tweet in seconddownload] df_2 = pd.DataFrame.from_records(secondlist, columns = ["date", "tweet"]) #df_2.to_csv("%s_2.csv" % SinceDate) time.sleep(600) print ('starting third download') third = got.manager.TweetCriteria().setQuerySearch(Query).setSince(twentieth.strftime('%Y-%m-%d')).setUntil(until.strftime('%Y-%m-%d')) thirddownload = got.manager.TweetManager.getTweets(third) thirdlist=[[tweet.date, tweet.text] for tweet in thirddownload] df_3 = pd.DataFrame.from_records(thirdlist, columns = ["date", "tweet"]) #df_3.to_csv("%s_3.csv" % SinceDate) df=pd.concat([df_1,df_2,df_3]) df.to_csv("%s.csv" % SinceDate) return df
I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!
#------
#Example:
#DownloadTweets('2019-01-01', '2019-01-31', 'klimaat')
I'm getting a very, very small number of tweets when I use this function, roughly 20-30 for an entire month long period (and there should definitely be more tweets than that for the queries I'm making). Any ideas?
from getoldtweets3.
Related Issues (20)
- [Question] Is it posisble to get a tweet based on Twitter Status ID? HOT 1
- [Question] still 404 error.. HOT 7
- HTTP Error 403: Forbidden HOT 1
- Import and Installation Issue
- Tweets sent as replies to a specific account
- Replies to certain tweets HOT 2
- Fetching Tweets based on Time stamps
- Related to emojis
- [Discussion] Option for setting max number of tweets per day HOT 1
- Cannot get sensitive tweets HOT 1
- How can I export results to a csv file? HOT 1
- search for exact term
- Problem iterating through multiple days and storing output because of high amount of daily tweets HOT 1
- Unrelated, uncorrectly dated, duplicated tweets in retrieved data. Advertisements/Spam? HOT 2
- Query
- .setUntil() always yields zero results
- It is not a bug but, Is there a way to gather only tweets? (excluding replies etc..) HOT 1
- Not getting all the tweets. HOT 7
- HTTP Error, Gives 404 but the URL is working HOT 144
- [Question] Using tweets downloaded from GetOldtweets
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from getoldtweets3.