martinkbeck / twitterscraper Goto Github PK

View Code? Open in Web Editor NEW

195.0 195.0 117.0 146 KB

Repository containing all files relevant to my basic and advanced tweet scraping articles.

Jupyter Notebook 98.36% Python 1.64%

twitterscraper's People

Contributors

Stargazers

Watchers

Forkers

chip-munk nancynyantakyiwaa seunmi joseph1otoo jvaish yohenthounaojam sayali27 osemuaimiosior siddchauhan77 minjunglalaa 99-chetan tacettinarici jammyv davidudayakumar justaahh gurusura modernizacionmvl2020 tiatitisari malenaeboli madden13 concri bolitto karteek-dev praneethrachumallu petroimanol agaillat arksss007 hexuanzhang-ds cedricsjr nishantkumarbehl pdbenard anneveenendaal johnsons-ds abner-01 mehtashaival27 rosahelikopter rizzz-stash abono2000 imperialite ralara00 riteshpt27 drewsky33 akgyzv m4meeran sandithr retaoliveira chendbox ram2314 legendleejr pandrewturner rohit-21 mehak-aggarwal9300 hilmanhanivan seagullvante nikolasrakryan behordeun flo-tyna jravenel muel1994 ayobame anekoinda caroline120 mohanakotkar24 gaurav19999 shodapp117 regretfulwinter sanketadamapure fatma-elrayes irfankkhairullah evangelostikas pythonnewbee11 bossthanison a-lily dutchosintguy t3t4 suyoun-dev amerwafiy padeakanbi rutlima-sinaga aidyai mcelam00 halilergul1 huyennm-1670 diagnostikon haffizrazali tomylive j5shan kiddojazz raj-gupta1 naylonbrando olujoe1 gauravjadeja3 probuilttech wakyiisr liewchunkin shreshthamankala manuel-rodriguezs maggiewong27 jassim lazycrazyowl

twitterscraper's Issues

Filtering RTs and Replies

Hi Martin,

Thank you so much for your Medium blog on this tool. This tool is super useful, and you did a great job describing how to use snscrape. I am just curious, do you know if you can filter retweets and replies with this module? Or if there is a way to know if the Tweet you are getting back is a RT, a reply, a part of a thread, etc.

Thanks so much in advance.

Query conversationID

Hello, thank you very much for this tutorial, it's great. I would like to ask you how I can make a query based on the code of a particular conversation, with the variable conversationID?
Thanks in advance.
good day.

scraper exception

import os
os.environ["http_proxy"] = "http://127.0.0.1:56916"
os.environ["https_proxy"] = "http://127.0.0.1:56916"
import snscrape.modules.twitter as sntwitter
from transformers import pipeline
import pandas as pd
from tqdm import tqdm
#抓取某一用户数据

# Creating list to append tweet data 
tweets_list1 = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:QCompounding').get_items()): # CharlieMunger00 Mayhem4Markets QCompounding
    if i>20: #number of tweets you want to scrape
        break
    tweets_list1.append([tweet.date,  tweet.content, tweet.user.username, tweet.likeCount, tweet.user.displayname, tweet.lang,tweet.hashtags,tweet.mentionedUsers,tweet.inReplyToUser,tweet.quotedTweet,tweet.retweetedTweet,tweet.media])
    
# Creating a dataframe from the tweets list above
tweets_df1 = pd.DataFrame(tweets_list1, columns=['Datetime',  'Text', 'Username', 'Like Count', 'Display Name', 'Language','hashtags','mentionedUsers','inReplyToUser','quotedTweet','retweetedTweet','media'])

tf=tweets_df1[tweets_df1['inReplyToUser'].isnull()]

from urllib.request import urlretrieve
tf=tweets_df1[tweets_df1['media'].isnull()==False]
for i in range(tf.shape[0]):
    try:
        kk=str(i)+'i'
        urlretrieve(tf.iloc[i,-1][0].fullUrl, "d:/data/photo2/{}.jpg".format(kk))  
    except:
        continue

`File "e:\temp\ipykernel_16024\2936908550.py", line 14, in <cell line: 14>
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:QCompounding').get_items()): # CharlieMunger00 Mayhem4Markets QCompounding

File "D:\anaconda3\envs\tensorflow\lib\site-packages\snscrape\modules\twitter.py", line 680, in get_items
for obj in self._iter_api_data('https://api.twitter.com/2/search/adaptive.json', params, paginationParams, cursor = self._cursor):

File "D:\anaconda3\envs\tensorflow\lib\site-packages\snscrape\modules\twitter.py", line 369, in _iter_api_data
obj = self._get_api_data(endpoint, reqParams)

File "D:\anaconda3\envs\tensorflow\lib\site-packages\snscrape\modules\twitter.py", line 338, in _get_api_data
self._ensure_guest_token()

File "D:\anaconda3\envs\tensorflow\lib\site-packages\snscrape\modules\twitter.py", line 301, in _ensure_guest_token
r = self._get(self._baseUrl if url is None else url, headers = {'User-Agent': self._userAgent}, responseOkCallback = self._check_guest_token_response)

File "D:\anaconda3\envs\tensorflow\lib\site-packages\snscrape\base.py", line 216, in _get
return self._request('GET', *args, **kwargs)

File "D:\anaconda3\envs\tensorflow\lib\site-packages\snscrape\base.py", line 212, in _request
raise ScraperException(msg)

ScraperException: 4 requests to https://twitter.com/search?f=live&lang=en&q=from%3AQCompounding&src=spelling_expansion_revert_click failed, giving up.`

Question about scraping multiple users from a list

Hi! Thanks for your super helpful Jupyter Notebook and Medium tutorial. Really appreciate the time and effort you put into this! Quick question, how do you scrape multiple users in a list? I would ideally like to iterate through a list of usernames and use your code below:

`# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:jack').get_items()):
    if i>maxTweets:
        break
    tweets_list1.append([tweet.date, tweet.id, tweet.content, tweet.user.username])`

I tried to iterate through my list like below, but I think I'm doing something wrong.

list = [user1, user 2, user3....]
i = 0
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('from:list[i]').get_items()):

Would appreciate any advice! Thank you :)

have problem wiht the CLI code python in os.system.

please update the CLI code python in os.system.
It seem to be not working with retweetCount, like Count. when i use these CLI code in .inypb it is not working. So please help me. Thank you

Geocode and since for extracting tweets

Hi, I'm trying to extract tweets combining the geocode filter and the since filter but every time I run it I end up having this error: 'Unable to find guest token'. I've run this same search using Tweepy and I do get a lot of tweets but because of the time constraint I'm very interested in making it run with this scraper. Do you know why could this be happening?

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('covid geocode:"34.052235,-118.243683,10km" since:2021-12-24').get_items()):
    if i>maxTweets:
        break
    tweets_list1.append([tweet.url,tweet.date, tweet.id, tweet.content,
                         tweet.user.username,tweet.replyCount,tweet.retweetCount,
                         tweet.likeCount,tweet.quoteCount,tweet.source,tweet.media,
                         tweet.retweetedTweet,tweet.mentionedUsers])
print('Complete')

Also, if I want to append the coordinates to the dataframe or the country/city, what attribute of tweet. should I use?

Thanks a lot!

How would you combine the time span of dates and the specific user in the query?

Is there any way to get tweets from a specific time span from a specific user? I've trying different things out on jupyter notebook and gotten just blank dataframes returned. Thank you.

Scraping irrelevant tweets

Hi, for some reason I am getting a lot of irrelevant tweets when I run this code.
I've got the dataframe set up to show which keyword was used to scrape the tweet. I get a wall of relevant tweets from each user with the keyword listed, and then a whole bunch of irrelevant tweets for which keyword column is blank. Can anybody tell why?

# Imports

import snscrape.modules.twitter as sntwitter
import pandas as pd

# Query by text search
# Setting variables to be used below

maxTweets = 500

# Creating list to append tweet data to
tweets_list2 = []

# Creating lists from SearchWords and TwitterHandles txt files:

keywords_list = open("SearchWords.txt", mode='r', encoding='utf-8').read().splitlines()
users_list = open("TwitterHandles.txt", mode='r', encoding='utf-8').read().splitlines()

# Using TwitterSearchScraper to scrape data and append tweets to list

for n, k in enumerate(users_list):
    for m, j in enumerate(keywords_list):
        for i,tweet in enumerate(sntwitter.TwitterSearchScraper('{} from:{} since:2020-07-07 until:2021-07-07'.format(keywords_list[m], users_list[n])).get_items()):
            if i>maxTweets:
                break
            tweets_list2.append([tweet.url, tweet.date, tweet.id, tweet.content, tweet.user.username, tweet.retweetedTweet, keywords_list[m]])

# Creating a dataframe from the tweets list above
tweets_df2 = pd.DataFrame(tweets_list2, columns=['URL', 'Datetime', 'Tweet Id', 'Text', 'Username', 'Retweet', 'Keywords'])

# Display first 5 entries from dataframe
tweets_df2.head()

# Export dataframe into a CSV
tweets_df2.to_csv('text-query-tweets9.csv', sep=',', index=False)

failed on_status, Twitter error response: status code = 401

I am getting this issue on scrapping tweets by tweepy. Can you help me to resolve this issue?

Question about scraping last day of the month

Hi! First of all, thank you very much for this tool, this is helping me with my dissertation so much!
I'm having a problem while scraping tweets about a certain topic for a specified period of time. The problem is that when I try to get tweets like from 01/05/2019 to 31/05/2019 I only get tweets up to the 30th.
For my dissertation I needed 10k tweets a day for the past 3 years, I built a function and it worked perfectly, I extracted around 300k tweets a month but in the end, I found out that I always miss the last day of each month.
In order to add all the missing days, I need to use the tool to download tweets from the last day of each month, but from what I understood I always specify a "since" date and an "until" date. So for getting 1st May tweets I need to specify from 01/05/19 to 02/05/19 and I will get tweets from the 1st of May. The problem is that if I am dealing with the 31st of May I cannot specify an until date as it would be the 32nd that doesn't exist..
Am I missing something? How can I just get tweets from a specific day?
ps. if I set the same date as since and until date it doesn't work
Thank you in advance

Blank csv files and tweets are not showing

Hello @MartinBeckUT I read you article on medium, btw it is a fantastic one. I tried your code but the csv and json files are blank

It's not working

Despite several attempts, none of them work. Always getting the same error.

An error occured during an HTTP request: HTTP Error 404: Not Found
Try to open in browser: https://twitter.com/search?q=Hello%20from%3Abarackobama%20since%3A2011-01-01%20until%3A2016-12-20&src=typd
An exception has occurred, use %tb to see the full traceback.

SystemExit

Limit of quantity of tweets with snscrape

Hi Martin
Thank you for the good work you are doing.

I was wondering if there is a limit to the number of tweets one can scrape with snscrape. What about the date, any limit?
Thank you.

Number of likes and retweets for each tweet

Is it possible to scrape the number of likes and retweets for each tweet?
Thank you.