godkingjay / selenium-twitter-scraper Goto Github PK

This is a Twitter Scraper which uses Selenium for scraping tweets. It is capable of scraping tweets from home, user profile, hashtag, query or search, and advanced searches.

License: Apache License 2.0

Jupyter Notebook 52.40% Python 47.60%

scraper selenium-scraper twitter twitter-scraper web-crawling hacktoberfest hacktoberfest-accepted collaborate selenium

selenium-twitter-scraper's Introduction

selenium-twitter-scraper

Setup

Install dependencies

pip install -r requirements.txt

Authentication Options

Using Environment Variable

Rename .env.example to .env.
Open .env and update environment variables

TWITTER_USERNAME=# Your Twitter Handle (e.g. @username)
TWITTER_USERNAME=# Your Twitter Username
TWITTER_PASSWORD=# Your Twitter Password

Authentication in Terminal

Add a username and password to the command line.

python scraper --user=@elonmusk --password=password123

No Authentication Provided

If you didn't specify a username and password, the program will ask you to enter a username and password.

Twitter Username: @username
Password: password123

Authentication Sequence Priority

1. Authentication provided in terminal.
2. Authentication provided in environment variables.

Usage

Show Help

python scraper --help

Basic usage

python scraper

Setting maximum number of tweets. defaults to 50.

python scraper --tweets=500   # Scrape 500 Tweets

Options and Arguments

usage: python scraper [option] ... [arg] ...

authentication options  description
--user                  : Your twitter account Handle.
                          e.g.
                          --user=@username

--password              : Your twitter account password.
                          e.g.
                          --password=password123

options:                description
-t, --tweets            : Number of tweets to scrape (default: 50).
                          e.g.
                            -t 500
                            --tweets=500

-u, --username          : Twitter username.
                          Scrape tweets from a user's profile.
                          e.g.
                            -u elonmusk
                            --username=@elonmusk

-ht, --hashtag          : Twitter hashtag.
                          Scrape tweets from a hashtag.
                          e.g.
                            -ht javascript
                            --hashtag=javascript

-q, --query             : Twitter query or search.
                          Scrape tweets from a query or search.
                          e.g.
                            -q "Philippine Marites"
                            --query="Jak Roberto anti selos"

-a, --add               : Additional data to scrape and
                          save in the .csv file.

                          values:
                          pd - poster's followers and following

                          e.g.
                            -a "pd"
                            --add="pd"

                          NOTE: Values must be separated by commas.

--latest                : Twitter latest tweets (default: True).
                          Note: Only for hashtag-based
                          and query-based scraping.
                          usage:
                            python scraper -t 500 -ht=python --latest

--top                   : Twitter top tweets (default: False).
                          Note: Only for hashtag-based
                          and query-based scraping.
                          usage:
                            python scraper -t 500 -ht=python --top

-ntl, --no_tweets_limit : Set no limit to the number of tweets to scrape
                          (will scrap until no more tweets are available).

Sample Scraping Commands

Custom Limit Scraping

python scraper -t 500

User Profile Scraping

python scraper -t 100 -u elonmusk

Hashtag Scraping

Latest

python scraper -t 100 -ht python --latest

Top
```
python scraper -t 100 -ht python --top
```

Query or Search Scraping (Also works with twitter's advanced search.)

Latest

python scraper -t 100 -q "Jak Roberto Anti Selos" --latest

Top

python scraper -t 100 -q "International News" --top

Advanced Search Scraping
- For tweets mentioning @elonmusk:
```
python scraper --query="(@elonmusk)"
```
- For tweets that mentions @elonmusk with at least 1000 replies from January 01, 2020 - August 31, 2023:
```
python scraper --query="(@elonmusk) min_replies:1000 until:2023-08-31 since:2020-01-01"
```
- Perform more Advanced Search using Twitter's Advanced Search, just setup the advanced query and copy the resulting string query to the program:
- Twitter Advanced Search
Scrape Additional Data

python scraper --add="pd"

Values	Description
pd	Tweet poster's id, followers, and following count.

selenium-twitter-scraper's People

Contributors

Stargazers

Watchers

selenium-twitter-scraper's Issues

Add examples to search for @userhandle mentions

In the readme and examples, we have search for text, hashtags and listing posts from a user.

However, maybe include an example to search for @UserHandle mentions as well?

That would complete the program

(request) Allow command-line USERNAME and PASSWORD input

python scraper -u="hello" -p="password" -t 100 -ht python --latest

This would allow users to use rotate and use different twitter accounts when scraping, preventing floods and bans.

So by default it uses the account in .env, but if provided in the command-line, will use them instead.

Scrapping stuck if you don't provide tweet count

When I try to scrap with following line
python scraper --user=@username--password=password --query="(queryparam) until:2024-05-19 since:2024-04-19"
It is stucking as you can see in following image:

It cannot continue the process.
It is also same if I try to give a higher number like 10000. It grabs some of data but then stuck again for "waiting to access older tweets"

I think problem occurs on "waiting to access older tweets" part. It never get a response for that part again.

Encode Pictures in CSV File

Hello

I was just wondering if the script could be updated to also include pictures/attachment in the output. perhaps through base64 encoding?

Thanks.

How to reconstruct the original Tweet ?

Hello everyone !

Here is the result I get in the CSV for this tweet :

My issue is that I want to be able to construct the original text of the tweet, and I get to have the mentions and emojis filtered, which is nice for me because I do want to treat them independently, but what I'm missing is an indication of at which character position in the tweet they belong. So that I place the "👍" back at its place, as well as the mentions in order.

I know when I used to use the API before they made it hard to use it, the API was providing me with the positions for everything, as well as the tweet with the original text.

Is there a possibility to add that ? Or is it possible already to get that information using that project and I missed it ?

Thanks ! ♥

Twitter @user mentions and emojis/emoticons missing

Nice work! But it seems like user twitter handle mentions i.e. @elon @jay are not recorded in the .csv outputs.

Also emojis / emoticons are removed as well.

Are we able to have them in the "content" or in 2 new columns like "mentions" and "emojis"?

Getting error Login Failed: 'NoneType' object is not iterable

When I did the ReadMe instructions, it returns following error:
Get the code,
Install requirement.txt
Update env file.
Run command -> python scraper --user=@myusername --password=mypassword

Loading .env file
Loaded .env file


Initializing Twitter Scraper...
Setup WebDriver...
Initializing FirefoxDriver...
WebDriver Setup Complete

Logging in to Twitter...

Login Failed: 'NoneType' object is not iterable

Max tweets

Hi!
Nice app!

There is a way to get all the historical data?
something like " -t all " (from -t 50)

(request) Scrape each post/tweet [TWEET_URL], optionally [POSTER_TWITTER_ID], [TWEET_ID]

The [TWEET_URL] is important to reference the every post/tweet that has been scraped.
Can this be done?

Additionally, [POSTER_TWITTER_ID] and [TWEET_ID] is also useful data if possible

This should make the scraper very useful and powerful.

I can't specify proxy for selenium

As a user, I'd like to be able to specify the proxy used my selenium.

Because I want to host this script on a server, and then use a proxy I have at home as a public IP that Twitter will see.

Here is a MR that answers this need : #13

(request) Scrape each poster's [FOLLOWERS] and [FOLLOWING] count?

Is it possible to get the user's [FOLLOWERS] and [FOLLOWING] count as 2 new columns of data in the .csv?

Very useful information when scraping.

Thanks

Failed Logging in to Twitter Require Single-Use Code

Loading .env file
Loaded .env file


Initializing Twitter Scraper...
Setup WebDriver...
Initializing ChromeDriver...
WebDriver Setup Complete

Logging in to Twitter...

Login Failed: This may be due to the following:

- Internet connection is unstable
- Username is incorrect
- Password is incorrect

I believe that error because Twitter send a Single Use Code to my email like 2 factor auth ( although I have disabled 2 factor auth ).

Is that possible if add one more step to submit Single-Use code for Login?