Giter VIP home page Giter VIP logo

godkingjay / selenium-twitter-scraper Goto Github PK

View Code? Open in Web Editor NEW
117.0 2.0 33.0 167 KB

This is a Twitter Scraper which uses Selenium for scraping tweets. It is capable of scraping tweets from home, user profile, hashtag, query or search, and advanced searches.

License: Apache License 2.0

Jupyter Notebook 52.40% Python 47.60%
scraper selenium-scraper twitter twitter-scraper web-crawling hacktoberfest hacktoberfest-accepted collaborate selenium

selenium-twitter-scraper's Introduction

selenium-twitter-scraper

Setup

  1. Install dependencies
pip install -r requirements.txt

Authentication Options

Using Environment Variable

  1. Rename .env.example to .env.

  2. Open .env and update environment variables

TWITTER_USERNAME=# Your Twitter Handle (e.g. @username)
TWITTER_USERNAME=# Your Twitter Username
TWITTER_PASSWORD=# Your Twitter Password

Authentication in Terminal

  • Add a username and password to the command line.
python scraper --user=@elonmusk --password=password123

No Authentication Provided

  • If you didn't specify a username and password, the program will ask you to enter a username and password.
Twitter Username: @username
Password: password123

Authentication Sequence Priority

1. Authentication provided in terminal.
2. Authentication provided in environment variables.

Usage

  • Show Help
python scraper --help
  • Basic usage
python scraper
  • Setting maximum number of tweets. defaults to 50.
python scraper --tweets=500   # Scrape 500 Tweets
  • Options and Arguments
usage: python scraper [option] ... [arg] ...

authentication options  description
--user                  : Your twitter account Handle.
                          e.g.
                          --user=@username

--password              : Your twitter account password.
                          e.g.
                          --password=password123

options:                description
-t, --tweets            : Number of tweets to scrape (default: 50).
                          e.g.
                            -t 500
                            --tweets=500

-u, --username          : Twitter username.
                          Scrape tweets from a user's profile.
                          e.g.
                            -u elonmusk
                            --username=@elonmusk

-ht, --hashtag          : Twitter hashtag.
                          Scrape tweets from a hashtag.
                          e.g.
                            -ht javascript
                            --hashtag=javascript

-q, --query             : Twitter query or search.
                          Scrape tweets from a query or search.
                          e.g.
                            -q "Philippine Marites"
                            --query="Jak Roberto anti selos"

-a, --add               : Additional data to scrape and
                          save in the .csv file.

                          values:
                          pd - poster's followers and following

                          e.g.
                            -a "pd"
                            --add="pd"

                          NOTE: Values must be separated by commas.

--latest                : Twitter latest tweets (default: True).
                          Note: Only for hashtag-based
                          and query-based scraping.
                          usage:
                            python scraper -t 500 -ht=python --latest

--top                   : Twitter top tweets (default: False).
                          Note: Only for hashtag-based
                          and query-based scraping.
                          usage:
                            python scraper -t 500 -ht=python --top

-ntl, --no_tweets_limit : Set no limit to the number of tweets to scrape
                          (will scrap until no more tweets are available).

Sample Scraping Commands

  • Custom Limit Scraping
python scraper -t 500
  • User Profile Scraping
python scraper -t 100 -u elonmusk
  • Hashtag Scraping

    • Latest

      python scraper -t 100 -ht python --latest
    • Top

      python scraper -t 100 -ht python --top
  • Query or Search Scraping (Also works with twitter's advanced search.)

    • Latest

      python scraper -t 100 -q "Jak Roberto Anti Selos" --latest
    • Top

      python scraper -t 100 -q "International News" --top
  • Advanced Search Scraping

    • For tweets mentioning @elonmusk:

      python scraper --query="(@elonmusk)"
    • For tweets that mentions @elonmusk with at least 1000 replies from January 01, 2020 - August 31, 2023:

      python scraper --query="(@elonmusk) min_replies:1000 until:2023-08-31 since:2020-01-01"
    • Perform more Advanced Search using Twitter's Advanced Search, just setup the advanced query and copy the resulting string query to the program:

    • Twitter Advanced Search Image

  • Scrape Additional Data

python scraper --add="pd"
Values Description
pd Tweet poster's id, followers, and following count.

selenium-twitter-scraper's People

Contributors

godkingjay avatar hovanhoa avatar magiprince avatar nautsimon avatar pierreminiggio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

selenium-twitter-scraper's Issues

(request) Allow command-line USERNAME and PASSWORD input

python scraper -u="hello" -p="password" -t 100 -ht python --latest

This would allow users to use rotate and use different twitter accounts when scraping, preventing floods and bans.

So by default it uses the account in .env, but if provided in the command-line, will use them instead.

Scrapping stuck if you don't provide tweet count

When I try to scrap with following line
python scraper --user=@username--password=password --query="(queryparam) until:2024-05-19 since:2024-04-19"
It is stucking as you can see in following image:
image

It cannot continue the process.
It is also same if I try to give a higher number like 10000. It grabs some of data but then stuck again for "waiting to access older tweets"

I think problem occurs on "waiting to access older tweets" part. It never get a response for that part again.

Encode Pictures in CSV File

Hello

I was just wondering if the script could be updated to also include pictures/attachment in the output. perhaps through base64 encoding?

Thanks.

How to reconstruct the original Tweet ?

Hello everyone !

Here is the result I get in the CSV for this tweet :

image

My issue is that I want to be able to construct the original text of the tweet, and I get to have the mentions and emojis filtered, which is nice for me because I do want to treat them independently, but what I'm missing is an indication of at which character position in the tweet they belong. So that I place the "๐Ÿ‘" back at its place, as well as the mentions in order.

I know when I used to use the API before they made it hard to use it, the API was providing me with the positions for everything, as well as the tweet with the original text.

Is there a possibility to add that ? Or is it possible already to get that information using that project and I missed it ?

Thanks ! โ™ฅ

Twitter @user mentions and emojis/emoticons missing

Nice work! But it seems like user twitter handle mentions i.e. @elon @jay are not recorded in the .csv outputs.

Also emojis / emoticons are removed as well.

Are we able to have them in the "content" or in 2 new columns like "mentions" and "emojis"?

Getting error Login Failed: 'NoneType' object is not iterable

When I did the ReadMe instructions, it returns following error:
Get the code,
Install requirement.txt
Update env file.
Run command -> python scraper --user=@myusername --password=mypassword

Loading .env file
Loaded .env file


Initializing Twitter Scraper...
Setup WebDriver...
Initializing FirefoxDriver...
WebDriver Setup Complete

Logging in to Twitter...

Login Failed: 'NoneType' object is not iterable

Max tweets

Hi!
Nice app!

There is a way to get all the historical data?
something like " -t all " (from -t 50)

I can't specify proxy for selenium

As a user, I'd like to be able to specify the proxy used my selenium.

Because I want to host this script on a server, and then use a proxy I have at home as a public IP that Twitter will see.

Here is a MR that answers this need : #13

Failed Logging in to Twitter Require Single-Use Code

Loading .env file
Loaded .env file


Initializing Twitter Scraper...
Setup WebDriver...
Initializing ChromeDriver...
WebDriver Setup Complete

Logging in to Twitter...

Login Failed: This may be due to the following:

- Internet connection is unstable
- Username is incorrect
- Password is incorrect

I believe that error because Twitter send a Single Use Code to my email like 2 factor auth ( although I have disabled 2 factor auth ).

Is that possible if add one more step to submit Single-Use code for Login?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.