I wonder if there is a way to break down the download into pieces and pause between tw

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

I'm still working so haven't made a PR, but here's the code: <a href="https://github.c

Too Many Requests ,about mottl/getoldtweets3

Comments (32)

Clairedevries commented on July 30, 2024 9

import time
from datetime import datetime, date, timedelta

def DownloadTweets(SinceDate, UntilDate, Query):
    '''
    Downloads all tweets from a certain month in three sessions in order to avoid sending too many requests. 
    Date format = 'yyyy-mm-dd'. 
    Query=string.
    '''
    since = datetime.strptime(SinceDate, '%Y-%m-%d')
    until= datetime.strptime(UntilDate, '%Y-%m-%d')
    tenth = since + timedelta(days = 10)
    twentieth = since + timedelta(days=20)
    
    print ('starting first download')
    first = got.manager.TweetCriteria().setQuerySearch(Query).setSince(since.strftime('%Y-%m-%d')).setUntil(tenth.strftime('%Y-%m-%d'))
    firstdownload = got.manager.TweetManager.getTweets(first)
    firstlist=[[tweet.date, tweet.text] for tweet in firstdownload]
    
    df_1 = pd.DataFrame.from_records(firstlist, columns = ["date", "tweet"])
    #df_1.to_csv("%s_1.csv" % SinceDate)
    
    time.sleep(600)
    
    print ('starting second download')
    second = got.manager.TweetCriteria().setQuerySearch(Query).setSince(tenth.strftime('%Y-%m-%d')).setUntil(twentieth.strftime('%Y-%m-%d'))
    seconddownload = got.manager.TweetManager.getTweets(second)
    secondlist=[[tweet.date, tweet.text] for tweet in seconddownload]
    
    df_2 = pd.DataFrame.from_records(secondlist, columns = ["date", "tweet"])
    #df_2.to_csv("%s_2.csv" % SinceDate)
    
    time.sleep(600)
    
    print ('starting third download')
    third = got.manager.TweetCriteria().setQuerySearch(Query).setSince(twentieth.strftime('%Y-%m-%d')).setUntil(until.strftime('%Y-%m-%d'))
    thirddownload = got.manager.TweetManager.getTweets(third)
    thirdlist=[[tweet.date, tweet.text] for tweet in thirddownload]
    
    df_3 = pd.DataFrame.from_records(thirdlist, columns = ["date", "tweet"])
    #df_3.to_csv("%s_3.csv" % SinceDate)
    
    df=pd.concat([df_1,df_2,df_3])
    df.to_csv("%s.csv" % SinceDate)
  
    return df

I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!

#------
#Example:
#DownloadTweets('2019-01-01', '2019-01-31', 'klimaat')

from getoldtweets3.

ekalhor commented on July 30, 2024 3

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period.
Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.

Since they are mimicking the advanced search on Twitter, the smallest time unit is a day -- you can see the options included in twitter's advanced search by checking out the website.

For larger than a day, it is easy to loop through days and have it sleep. I couldn't find the download rate limit for Twitter's server. However, when I had it sleep 16 minutes, between two days, It seemed to recover from the Too Many Request error.

My problem is for when I download one day of the Tweets for a common word. I think it doesn't have much to do with the Advanced Search functionality anymore, which is a good thing. Similar to max tweets there should be a way to cap the downloads, but we want to keep the search alive --maybe that's the issue in your case @lethalbeans? and have a placeholder for resuming the download after sleep.

from getoldtweets3.

libbyh commented on July 30, 2024 1

I'm still working so haven't made a PR, but here's the code: https://github.com/libbyh/GetOldTweets-python/blob/py3/GetOldTweets3/manager/TweetManager.py

Kinda sloppy because it includes the ratelimit approach (lines 277-279) and a catch-and-sleep for the 429 (lines 351 - 355).

from getoldtweets3.

cefasgarciapereira commented on July 30, 2024 1

Hey guys, playing around with time.sleep.
Does anyone know how to find out exactly how long I need to wait to retry?

Yes, you can get it!
But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.

Ah okay, let me dig into that. One last thing mate - for the retry-after response header, which file in the code should I be looking at?

Appreciate it and thank you!

No problem!

You should change GetOldTweets3/manager/TweetManager.py :

        try:
            response = opener.open(url)
            #--- HERE ---#
            #response.status()
            #response.headers()
            jsonResponse = response.read()
        except Exception as e:
            print("An error occured during an HTTP request:", str(e))
            print("Try to open in browser: https://twitter.com/search?q=%s&src=typd" % urllib.parse.quote(urlGetData))
            sys.exit()

You should try something on that way. Take a look on the web for the correct code, but it must be something like that. Feel free to email me.

from getoldtweets3.

Jetstarkiller commented on July 30, 2024

I am sadly having the same issue. It used to work great but not anymore. Any solution from the more experienced coders?

from getoldtweets3.

klaralindahl commented on July 30, 2024

I'm having the same issue. Is there a way to make the code sleep in order to pace the number of requests?

from getoldtweets3.

JerGag commented on July 30, 2024

Same issue here too, I couldn't find a feature that allows the scraping process to sleep.

from getoldtweets3.

Jetstarkiller commented on July 30, 2024

Would using a proxy do the trick? if so how?
I tried to set a proxy on pycharm and on the general setting but no luck.

from getoldtweets3.

brndnsy commented on July 30, 2024

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

from getoldtweets3.

Jetstarkiller commented on July 30, 2024

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period.
Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.

from getoldtweets3.

brndnsy commented on July 30, 2024

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period.
Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.

It's not even working with 10k now :/ I'm guessing Twitter's team have put in countermeasures.

from getoldtweets3.

brndnsy commented on July 30, 2024

I tried to alter the source code with time.delay but got a different error:

An error occured during an HTTP request: HTTP Error 503: Service Temporarily Unavailable

from getoldtweets3.

brndnsy commented on July 30, 2024

@elkalhor thanks I've managed to get 20k so far, aiming for 100k at least.

Yeah that would be good. The current code seems to use try catch exceptions, and ends the code abruptly if the json request is not fulfilled. I'm currently trying to implement ratelimit:

https://github.com/tomasbasham/ratelimit

from getoldtweets3.

sebimarkgraf commented on July 30, 2024

I would propose to change the current exiting of the script to a unified exception.
That way the user can decide to catch the exception and use any method to retry the request later on for the moment.
Using sys.exit() without any error code, seems like the worst way to handle it.

Besides that, the approach from @lethalbeans seems like a really good idea to me. Do you have progress on that?

from getoldtweets3.

bcornet1 commented on July 30, 2024

@ekalhor, this should do the trick

#3 (comment)

from getoldtweets3.

libbyh commented on July 30, 2024

@lethalbeans did you get ratelimit to work?

from getoldtweets3.

mohamadre3a commented on July 30, 2024

has anybody found a solution for this? I get the same error when I get 15000 to 20000 tweets

from getoldtweets3.

libbyh commented on July 30, 2024

I used ratelimit to solve the 429 problem, but I eventually got another error:

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>

To use rate limit:

from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60

@sleep_and_retry
@limits(calls=30, period=ONE_MINUTE)
def callAPI...

from getoldtweets3.

mohamadre3a commented on July 30, 2024

I tried it and still got the "too many requests"
could you please share your complete code? (I mean the callAPI function)

from getoldtweets3.

meixingdg commented on July 30, 2024

I used ratelimit to solve the 429 problem, but I eventually got another error:

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>

To use rate limit:
from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60

@sleep_and_retry
@limits(calls=30, period=ONE_MINUTE)
def callAPI...

Has anyone solved the certificate expired error or know what's causing it? I can't tell if it's a time limit-related issue or something else. I'm also getting the certificate expired error, but it has happened after 700 tweets, 4100 tweets, 0 tweets...

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1045)>

from getoldtweets3.

MichaelKarpe commented on July 30, 2024

Hi @libbyh @meixingdg,
I have been able to solve the SSL certificate with the following two lines at the beginning of the TweetManager.py file:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

I used ratelimit to solve the 429 problem, but I eventually got another error:

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>

To use rate limit:
from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60

@sleep_and_retry
@limits(calls=30, period=ONE_MINUTE)
def callAPI...

I'm trying to use ratelimit with the modifications you described here but for now ratelimit does not seem to work on my side.

I'm still working so haven't made a PR, but here's the code: https://github.com/libbyh/GetOldTweets-python/blob/py3/GetOldTweets3/manager/TweetManager.py

Kinda sloppy because it includes the ratelimit approach (lines 277-279) and a catch-and-sleep for the 429 (lines 351 - 355).

Lines 277-279 and lines 351-355 are not related, right? Then are you sure ratelimit is working on your side? With only lines 351-355 and the time.sleep set to 15 minutes, I think the 429 issue is solved.

EDIT: Here is now a new error I sometimes get when downloading tweets: An error occured during an HTTP request: <urlopen error [Errno -3] Temporary failure in name resolution>

from getoldtweets3.

preetham-salehundam commented on July 30, 2024

@meixingdg Did you figure out? I get the 429 error after 10000 tweets. I have tried retry and sleep but didn't work

from getoldtweets3.

spavank commented on July 30, 2024

Did anyone figure out a good solution? I have been trying to get tweets for a popular hashtag, say #coronavirus and ran into the same issue. I have been limiting the search to each day. Yet, some days the tweets are so many that I run into this issue. I used to get the 529 Error. Now, it has switched to 503 error. As someone stated, each time I retry, the total tweets before the error is even smaller. I even did as small as 50, put in a sleep for a minute, then began gathering again. Still fails after about 2000 tweets have been gathered. On the other extreme, I tried to get 10000, then put it to sleep for 65 minutes. Still not much luck. Kinda stuck. Any thoughts or solutions?

from getoldtweets3.

ArjunAcharya0311 commented on July 30, 2024

Hi has anybody found a solution to this yet? I am trying to download the tweets for a hot topic for a single day, however, any number of requests above 10000 shows me the 429 error... @Clairedevries does your function solve this issue? and can it download hundreds of thousands of tweets for a single day?

from getoldtweets3.

Clairedevries commented on July 30, 2024

Yes. I downloaded 312,000 tweets in a single day by running the function above. I ran the function 12 times, once for every month in 2014. You might not be able to use this exact function as it might not work for your code and was specifically made for my project, but it shows how you can run a function and have it sleep in between in order to avoid errors. It might not work if there are more than 100,000 tweets in a single day though.

from getoldtweets3.

badaouisaad commented on July 30, 2024

I used the same function def DownloadTweets(SinceDate, UntilDate, Query) as well. In case there are too many tweets and you get an error message, break down the dates into smaller intervals and save the data in csv files in small increment.

The issue i find is that the geo information is not available. I am aware that not all tweets should have geo information but when downloading the data using Twitter api we get a small % of tweets having the geo code info

from getoldtweets3.

cefasgarciapereira commented on July 30, 2024

I noticed there is a buffer option in the library. By using it I could update a .csv file for each 10 tweets returned by the library. Even in the cases I got some error, the number was satisfactory for me. Basically what i did was:

def partial_results(tweets):
    print(tweets.text)

tweets = got.manager.TweetManager.getTweets(tweetCriteria, bufferLength=10, receiveBuffer=partial_results)

from getoldtweets3.

darrenlimweiyang commented on July 30, 2024

Hey guys, playing around with time.sleep.

Does anyone know how to find out exactly how long I need to wait to retry?

from getoldtweets3.

cefasgarciapereira commented on July 30, 2024

Hey guys, playing around with time.sleep.

Does anyone know how to find out exactly how long I need to wait to retry?

Yes, you can get it!

But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.

from getoldtweets3.

darrenlimweiyang commented on July 30, 2024

Hey guys, playing around with time.sleep.
Does anyone know how to find out exactly how long I need to wait to retry?

Yes, you can get it!

But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.

Ah okay, let me dig into that. One last thing mate - for the retry-after response header, which file in the code should I be looking at?

Appreciate it and thank you!

from getoldtweets3.

tredmill commented on July 30, 2024

I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!

#------

Thank you @Clairedevries! I adopted and altered your code. The function now waits after each day for a specified amount of sleep time. 15 minutes sleep should be on the safe side given the API rate limits. This too does not work for too many tweets (say >100k) in a single day.

import GetOldTweets3 as got
import time
from datetime import datetime, timedelta

def DownloadTweets(SinceDate, UntilDate, query, sleep=900, maxtweet=0) :
    #create a list of day numbers
    since = datetime.strptime(SinceDate, '%Y-%m-%d')
    days = list(range(0, (datetime.strptime(UntilDate, '%Y-%m-%d') - datetime.strptime(SinceDate, '%Y-%m-%d')).days+1))
    tweets = []
  
    for day in days:
        init = got.manager.TweetCriteria().setQuerySearch(query).setSince((since + timedelta(days=day)).strftime('%Y-%m-%d')).setUntil((since+ timedelta(days=day+1)).strftime('%Y-%m-%d')).setMaxTweets(maxtweet)
        get = got.manager.TweetManager.getTweets(init)
        tweets.append([[tweet.id, tweet.date, tweet.text] for tweet in get])
        print("day", day+1, "of", len(days), "completed")
        print("sleeping for", sleep, "seconds")
        time.sleep(sleep)
    #flatten list
    tweets = [tweet for sublist in tweets for tweet in sublist]
    return tweets

#%%
since = "2020-02-27"
until = "2020-03-01"

tweets = DownloadTweets(since, until, query='trump', maxtweet=10, sleep=10)

from getoldtweets3.

rayms commented on July 30, 2024

import time
from datetime import datetime, date, timedelta

def DownloadTweets(SinceDate, UntilDate, Query):
    '''
    Downloads all tweets from a certain month in three sessions in order to avoid sending too many requests. 
    Date format = 'yyyy-mm-dd'. 
    Query=string.
    '''
    since = datetime.strptime(SinceDate, '%Y-%m-%d')
    until= datetime.strptime(UntilDate, '%Y-%m-%d')
    tenth = since + timedelta(days = 10)
    twentieth = since + timedelta(days=20)
    
    print ('starting first download')
    first = got.manager.TweetCriteria().setQuerySearch(Query).setSince(since.strftime('%Y-%m-%d')).setUntil(tenth.strftime('%Y-%m-%d'))
    firstdownload = got.manager.TweetManager.getTweets(first)
    firstlist=[[tweet.date, tweet.text] for tweet in firstdownload]
    
    df_1 = pd.DataFrame.from_records(firstlist, columns = ["date", "tweet"])
    #df_1.to_csv("%s_1.csv" % SinceDate)
    
    time.sleep(600)
    
    print ('starting second download')
    second = got.manager.TweetCriteria().setQuerySearch(Query).setSince(tenth.strftime('%Y-%m-%d')).setUntil(twentieth.strftime('%Y-%m-%d'))
    seconddownload = got.manager.TweetManager.getTweets(second)
    secondlist=[[tweet.date, tweet.text] for tweet in seconddownload]
    
    df_2 = pd.DataFrame.from_records(secondlist, columns = ["date", "tweet"])
    #df_2.to_csv("%s_2.csv" % SinceDate)
    
    time.sleep(600)
    
    print ('starting third download')
    third = got.manager.TweetCriteria().setQuerySearch(Query).setSince(twentieth.strftime('%Y-%m-%d')).setUntil(until.strftime('%Y-%m-%d'))
    thirddownload = got.manager.TweetManager.getTweets(third)
    thirdlist=[[tweet.date, tweet.text] for tweet in thirddownload]
    
    df_3 = pd.DataFrame.from_records(thirdlist, columns = ["date", "tweet"])
    #df_3.to_csv("%s_3.csv" % SinceDate)
    
    df=pd.concat([df_1,df_2,df_3])
    df.to_csv("%s.csv" % SinceDate)
  
    return df

#------
#Example:
#DownloadTweets('2019-01-01', '2019-01-31', 'klimaat')

I'm getting a very, very small number of tweets when I use this function, roughly 20-30 for an entire month long period (and there should definitely be more tweets than that for the queries I'm making). Any ideas?

from getoldtweets3.

Too Many Requests about getoldtweets3 HOT 32 OPEN

Comments (32)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent