Giter VIP home page Giter VIP logo

Comments (32)

Clairedevries avatar Clairedevries commented on July 30, 2024 9
import time
from datetime import datetime, date, timedelta

def DownloadTweets(SinceDate, UntilDate, Query):
    '''
    Downloads all tweets from a certain month in three sessions in order to avoid sending too many requests. 
    Date format = 'yyyy-mm-dd'. 
    Query=string.
    '''
    since = datetime.strptime(SinceDate, '%Y-%m-%d')
    until= datetime.strptime(UntilDate, '%Y-%m-%d')
    tenth = since + timedelta(days = 10)
    twentieth = since + timedelta(days=20)
    
    print ('starting first download')
    first = got.manager.TweetCriteria().setQuerySearch(Query).setSince(since.strftime('%Y-%m-%d')).setUntil(tenth.strftime('%Y-%m-%d'))
    firstdownload = got.manager.TweetManager.getTweets(first)
    firstlist=[[tweet.date, tweet.text] for tweet in firstdownload]
    
    df_1 = pd.DataFrame.from_records(firstlist, columns = ["date", "tweet"])
    #df_1.to_csv("%s_1.csv" % SinceDate)
    
    time.sleep(600)
    
    print ('starting second download')
    second = got.manager.TweetCriteria().setQuerySearch(Query).setSince(tenth.strftime('%Y-%m-%d')).setUntil(twentieth.strftime('%Y-%m-%d'))
    seconddownload = got.manager.TweetManager.getTweets(second)
    secondlist=[[tweet.date, tweet.text] for tweet in seconddownload]
    
    df_2 = pd.DataFrame.from_records(secondlist, columns = ["date", "tweet"])
    #df_2.to_csv("%s_2.csv" % SinceDate)
    
    time.sleep(600)
    
    print ('starting third download')
    third = got.manager.TweetCriteria().setQuerySearch(Query).setSince(twentieth.strftime('%Y-%m-%d')).setUntil(until.strftime('%Y-%m-%d'))
    thirddownload = got.manager.TweetManager.getTweets(third)
    thirdlist=[[tweet.date, tweet.text] for tweet in thirddownload]
    
    df_3 = pd.DataFrame.from_records(thirdlist, columns = ["date", "tweet"])
    #df_3.to_csv("%s_3.csv" % SinceDate)
    
    df=pd.concat([df_1,df_2,df_3])
    df.to_csv("%s.csv" % SinceDate)
  
    return df

I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!

#------
#Example:
#DownloadTweets('2019-01-01', '2019-01-31', 'klimaat')

from getoldtweets3.

ekalhor avatar ekalhor commented on July 30, 2024 3

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period.
Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.

Since they are mimicking the advanced search on Twitter, the smallest time unit is a day -- you can see the options included in twitter's advanced search by checking out the website.

For larger than a day, it is easy to loop through days and have it sleep. I couldn't find the download rate limit for Twitter's server. However, when I had it sleep 16 minutes, between two days, It seemed to recover from the Too Many Request error.

My problem is for when I download one day of the Tweets for a common word. I think it doesn't have much to do with the Advanced Search functionality anymore, which is a good thing. Similar to max tweets there should be a way to cap the downloads, but we want to keep the search alive --maybe that's the issue in your case @lethalbeans? and have a placeholder for resuming the download after sleep.

from getoldtweets3.

libbyh avatar libbyh commented on July 30, 2024 1

I'm still working so haven't made a PR, but here's the code: https://github.com/libbyh/GetOldTweets-python/blob/py3/GetOldTweets3/manager/TweetManager.py

Kinda sloppy because it includes the ratelimit approach (lines 277-279) and a catch-and-sleep for the 429 (lines 351 - 355).

from getoldtweets3.

cefasgarciapereira avatar cefasgarciapereira commented on July 30, 2024 1

Hey guys, playing around with time.sleep.
Does anyone know how to find out exactly how long I need to wait to retry?

Yes, you can get it!
But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.

Ah okay, let me dig into that. One last thing mate - for the retry-after response header, which file in the code should I be looking at?

Appreciate it and thank you!

No problem!

You should change GetOldTweets3/manager/TweetManager.py :

        try:
            response = opener.open(url)
            #--- HERE ---#
            #response.status()
            #response.headers()
            jsonResponse = response.read()
        except Exception as e:
            print("An error occured during an HTTP request:", str(e))
            print("Try to open in browser: https://twitter.com/search?q=%s&src=typd" % urllib.parse.quote(urlGetData))
            sys.exit()

You should try something on that way. Take a look on the web for the correct code, but it must be something like that. Feel free to email me.

from getoldtweets3.

Jetstarkiller avatar Jetstarkiller commented on July 30, 2024

I am sadly having the same issue. It used to work great but not anymore. Any solution from the more experienced coders?

from getoldtweets3.

klaralindahl avatar klaralindahl commented on July 30, 2024

I'm having the same issue. Is there a way to make the code sleep in order to pace the number of requests?

from getoldtweets3.

JerGag avatar JerGag commented on July 30, 2024

Same issue here too, I couldn't find a feature that allows the scraping process to sleep.

from getoldtweets3.

Jetstarkiller avatar Jetstarkiller commented on July 30, 2024

Would using a proxy do the trick? if so how?
I tried to set a proxy on pycharm and on the general setting but no luck.

from getoldtweets3.

brndnsy avatar brndnsy commented on July 30, 2024

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

from getoldtweets3.

Jetstarkiller avatar Jetstarkiller commented on July 30, 2024

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period.
Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.

from getoldtweets3.

brndnsy avatar brndnsy commented on July 30, 2024

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period.
Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.

It's not even working with 10k now :/ I'm guessing Twitter's team have put in countermeasures.

from getoldtweets3.

brndnsy avatar brndnsy commented on July 30, 2024

I tried to alter the source code with time.delay but got a different error:

An error occured during an HTTP request: HTTP Error 503: Service Temporarily Unavailable

from getoldtweets3.

brndnsy avatar brndnsy commented on July 30, 2024

@elkalhor thanks I've managed to get 20k so far, aiming for 100k at least.

Yeah that would be good. The current code seems to use try catch exceptions, and ends the code abruptly if the json request is not fulfilled. I'm currently trying to implement ratelimit:

https://github.com/tomasbasham/ratelimit

from getoldtweets3.

sebimarkgraf avatar sebimarkgraf commented on July 30, 2024

I would propose to change the current exiting of the script to a unified exception.
That way the user can decide to catch the exception and use any method to retry the request later on for the moment.
Using sys.exit() without any error code, seems like the worst way to handle it.

Besides that, the approach from @lethalbeans seems like a really good idea to me. Do you have progress on that?

from getoldtweets3.

bcornet1 avatar bcornet1 commented on July 30, 2024

@ekalhor, this should do the trick

#3 (comment)

from getoldtweets3.

libbyh avatar libbyh commented on July 30, 2024

@lethalbeans did you get ratelimit to work?

from getoldtweets3.

mohamadre3a avatar mohamadre3a commented on July 30, 2024

has anybody found a solution for this? I get the same error when I get 15000 to 20000 tweets

from getoldtweets3.

libbyh avatar libbyh commented on July 30, 2024

I used ratelimit to solve the 429 problem, but I eventually got another error:

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>

To use rate limit:

from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60

@sleep_and_retry
@limits(calls=30, period=ONE_MINUTE)
def callAPI...

from getoldtweets3.

mohamadre3a avatar mohamadre3a commented on July 30, 2024

I tried it and still got the "too many requests"
could you please share your complete code? (I mean the callAPI function)

from getoldtweets3.

meixingdg avatar meixingdg commented on July 30, 2024

I used ratelimit to solve the 429 problem, but I eventually got another error:

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>

To use rate limit:

from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60

@sleep_and_retry
@limits(calls=30, period=ONE_MINUTE)
def callAPI...

Has anyone solved the certificate expired error or know what's causing it? I can't tell if it's a time limit-related issue or something else. I'm also getting the certificate expired error, but it has happened after 700 tweets, 4100 tweets, 0 tweets...

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1045)>

from getoldtweets3.

MichaelKarpe avatar MichaelKarpe commented on July 30, 2024

Hi @libbyh @meixingdg,
I have been able to solve the SSL certificate with the following two lines at the beginning of the TweetManager.py file:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

I used ratelimit to solve the 429 problem, but I eventually got another error:

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>

To use rate limit:

from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60

@sleep_and_retry
@limits(calls=30, period=ONE_MINUTE)
def callAPI...

I'm trying to use ratelimit with the modifications you described here but for now ratelimit does not seem to work on my side.

I'm still working so haven't made a PR, but here's the code: https://github.com/libbyh/GetOldTweets-python/blob/py3/GetOldTweets3/manager/TweetManager.py

Kinda sloppy because it includes the ratelimit approach (lines 277-279) and a catch-and-sleep for the 429 (lines 351 - 355).

Lines 277-279 and lines 351-355 are not related, right? Then are you sure ratelimit is working on your side? With only lines 351-355 and the time.sleep set to 15 minutes, I think the 429 issue is solved.

EDIT: Here is now a new error I sometimes get when downloading tweets: An error occured during an HTTP request: <urlopen error [Errno -3] Temporary failure in name resolution>

from getoldtweets3.

preetham-salehundam avatar preetham-salehundam commented on July 30, 2024

@meixingdg Did you figure out? I get the 429 error after 10000 tweets. I have tried retry and sleep but didn't work

from getoldtweets3.

spavank avatar spavank commented on July 30, 2024

Did anyone figure out a good solution? I have been trying to get tweets for a popular hashtag, say #coronavirus and ran into the same issue. I have been limiting the search to each day. Yet, some days the tweets are so many that I run into this issue. I used to get the 529 Error. Now, it has switched to 503 error. As someone stated, each time I retry, the total tweets before the error is even smaller. I even did as small as 50, put in a sleep for a minute, then began gathering again. Still fails after about 2000 tweets have been gathered. On the other extreme, I tried to get 10000, then put it to sleep for 65 minutes. Still not much luck. Kinda stuck. Any thoughts or solutions?

from getoldtweets3.

ArjunAcharya0311 avatar ArjunAcharya0311 commented on July 30, 2024

Hi has anybody found a solution to this yet? I am trying to download the tweets for a hot topic for a single day, however, any number of requests above 10000 shows me the 429 error... @Clairedevries does your function solve this issue? and can it download hundreds of thousands of tweets for a single day?

from getoldtweets3.

Clairedevries avatar Clairedevries commented on July 30, 2024

Yes. I downloaded 312,000 tweets in a single day by running the function above. I ran the function 12 times, once for every month in 2014. You might not be able to use this exact function as it might not work for your code and was specifically made for my project, but it shows how you can run a function and have it sleep in between in order to avoid errors. It might not work if there are more than 100,000 tweets in a single day though.

from getoldtweets3.

badaouisaad avatar badaouisaad commented on July 30, 2024

I used the same function def DownloadTweets(SinceDate, UntilDate, Query) as well. In case there are too many tweets and you get an error message, break down the dates into smaller intervals and save the data in csv files in small increment.

The issue i find is that the geo information is not available. I am aware that not all tweets should have geo information but when downloading the data using Twitter api we get a small % of tweets having the geo code info

from getoldtweets3.

cefasgarciapereira avatar cefasgarciapereira commented on July 30, 2024

I noticed there is a buffer option in the library. By using it I could update a .csv file for each 10 tweets returned by the library. Even in the cases I got some error, the number was satisfactory for me. Basically what i did was:

def partial_results(tweets):
    print(tweets.text)

tweets = got.manager.TweetManager.getTweets(tweetCriteria, bufferLength=10, receiveBuffer=partial_results)

from getoldtweets3.

darrenlimweiyang avatar darrenlimweiyang commented on July 30, 2024

Hey guys, playing around with time.sleep.

Does anyone know how to find out exactly how long I need to wait to retry?

from getoldtweets3.

cefasgarciapereira avatar cefasgarciapereira commented on July 30, 2024

Hey guys, playing around with time.sleep.

Does anyone know how to find out exactly how long I need to wait to retry?

Yes, you can get it!

But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.

from getoldtweets3.

darrenlimweiyang avatar darrenlimweiyang commented on July 30, 2024

Hey guys, playing around with time.sleep.
Does anyone know how to find out exactly how long I need to wait to retry?

Yes, you can get it!

But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.

Ah okay, let me dig into that. One last thing mate - for the retry-after response header, which file in the code should I be looking at?

Appreciate it and thank you!

from getoldtweets3.

tredmill avatar tredmill commented on July 30, 2024

I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!

#------

Thank you @Clairedevries! I adopted and altered your code. The function now waits after each day for a specified amount of sleep time. 15 minutes sleep should be on the safe side given the API rate limits. This too does not work for too many tweets (say >100k) in a single day.

import GetOldTweets3 as got
import time
from datetime import datetime, timedelta

def DownloadTweets(SinceDate, UntilDate, query, sleep=900, maxtweet=0) :
    #create a list of day numbers
    since = datetime.strptime(SinceDate, '%Y-%m-%d')
    days = list(range(0, (datetime.strptime(UntilDate, '%Y-%m-%d') - datetime.strptime(SinceDate, '%Y-%m-%d')).days+1))
    tweets = []
  
    for day in days:
        init = got.manager.TweetCriteria().setQuerySearch(query).setSince((since + timedelta(days=day)).strftime('%Y-%m-%d')).setUntil((since+ timedelta(days=day+1)).strftime('%Y-%m-%d')).setMaxTweets(maxtweet)
        get = got.manager.TweetManager.getTweets(init)
        tweets.append([[tweet.id, tweet.date, tweet.text] for tweet in get])
        print("day", day+1, "of", len(days), "completed")
        print("sleeping for", sleep, "seconds")
        time.sleep(sleep)
    #flatten list
    tweets = [tweet for sublist in tweets for tweet in sublist]
    return tweets

#%%
since = "2020-02-27"
until = "2020-03-01"

tweets = DownloadTweets(since, until, query='trump', maxtweet=10, sleep=10)

from getoldtweets3.

rayms avatar rayms commented on July 30, 2024
import time
from datetime import datetime, date, timedelta

def DownloadTweets(SinceDate, UntilDate, Query):
    '''
    Downloads all tweets from a certain month in three sessions in order to avoid sending too many requests. 
    Date format = 'yyyy-mm-dd'. 
    Query=string.
    '''
    since = datetime.strptime(SinceDate, '%Y-%m-%d')
    until= datetime.strptime(UntilDate, '%Y-%m-%d')
    tenth = since + timedelta(days = 10)
    twentieth = since + timedelta(days=20)
    
    print ('starting first download')
    first = got.manager.TweetCriteria().setQuerySearch(Query).setSince(since.strftime('%Y-%m-%d')).setUntil(tenth.strftime('%Y-%m-%d'))
    firstdownload = got.manager.TweetManager.getTweets(first)
    firstlist=[[tweet.date, tweet.text] for tweet in firstdownload]
    
    df_1 = pd.DataFrame.from_records(firstlist, columns = ["date", "tweet"])
    #df_1.to_csv("%s_1.csv" % SinceDate)
    
    time.sleep(600)
    
    print ('starting second download')
    second = got.manager.TweetCriteria().setQuerySearch(Query).setSince(tenth.strftime('%Y-%m-%d')).setUntil(twentieth.strftime('%Y-%m-%d'))
    seconddownload = got.manager.TweetManager.getTweets(second)
    secondlist=[[tweet.date, tweet.text] for tweet in seconddownload]
    
    df_2 = pd.DataFrame.from_records(secondlist, columns = ["date", "tweet"])
    #df_2.to_csv("%s_2.csv" % SinceDate)
    
    time.sleep(600)
    
    print ('starting third download')
    third = got.manager.TweetCriteria().setQuerySearch(Query).setSince(twentieth.strftime('%Y-%m-%d')).setUntil(until.strftime('%Y-%m-%d'))
    thirddownload = got.manager.TweetManager.getTweets(third)
    thirdlist=[[tweet.date, tweet.text] for tweet in thirddownload]
    
    df_3 = pd.DataFrame.from_records(thirdlist, columns = ["date", "tweet"])
    #df_3.to_csv("%s_3.csv" % SinceDate)
    
    df=pd.concat([df_1,df_2,df_3])
    df.to_csv("%s.csv" % SinceDate)
  
    return df

I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!

#------
#Example:
#DownloadTweets('2019-01-01', '2019-01-31', 'klimaat')

I'm getting a very, very small number of tweets when I use this function, roughly 20-30 for an entire month long period (and there should definitely be more tweets than that for the queries I'm making). Any ideas?

from getoldtweets3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.