alexzhangji / twitter-insight-llm Goto Github PK
View Code? Open in Web Editor NEWTwitter data scraping, embedding based image search and more.
Twitter data scraping, embedding based image search and more.
Tried few times, only 50 likes downloaded, last few lines log below
`2024-04-19 02:27:04,735 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://localhost:50146/session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/elements {'using': 'css selector', 'value': "div[data-testid='videoPlayer']"}
2024-04-19 02:27:04,738 - urllib3.connectionpool - DEBUG - http://localhost:50146 "POST /session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/elements HTTP/1.1" 200 0
2024-04-19 02:27:04,739 - selenium.webdriver.remote.remote_connection - DEBUG - Remote response: status=200 | data={"value":[]} | headers=HTTPHeaderDict({'Content-Length': '12', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-04-19 02:27:04,739 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request
2024-04-19 02:27:04,739 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://localhost:50146/session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/elements {'using': 'css selector', 'value': "div[data-testid='tweetPhoto']"}
2024-04-19 02:27:04,742 - urllib3.connectionpool - DEBUG - http://localhost:50146 "POST /session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/elements HTTP/1.1" 200 0
2024-04-19 02:27:04,742 - selenium.webdriver.remote.remote_connection - DEBUG - Remote response: status=200 | data={"value":[]} | headers=HTTPHeaderDict({'Content-Length': '12', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-04-19 02:27:04,742 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request
2024-04-19 02:27:04,742 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://localhost:50146/session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/elements {'using': 'css selector', 'value': "div[data-testid='videoPlayer']"}
2024-04-19 02:27:04,745 - urllib3.connectionpool - DEBUG - http://localhost:50146 "POST /session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/elements HTTP/1.1" 200 0
2024-04-19 02:27:04,745 - selenium.webdriver.remote.remote_connection - DEBUG - Remote response: status=200 | data={"value":[]} | headers=HTTPHeaderDict({'Content-Length': '12', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-04-19 02:27:04,745 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request
2024-04-19 02:27:04,745 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://localhost:50146/session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/elements {'using': 'css selector', 'value': "div[data-testid='tweetPhoto']"}
2024-04-19 02:27:04,747 - urllib3.connectionpool - DEBUG - http://localhost:50146 "POST /session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/elements HTTP/1.1" 200 0
2024-04-19 02:27:04,747 - selenium.webdriver.remote.remote_connection - DEBUG - Remote response: status=200 | data={"value":[]} | headers=HTTPHeaderDict({'Content-Length': '12', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-04-19 02:27:04,748 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request
2024-04-19 02:27:04,748 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://localhost:50146/session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/element {'using': 'css selector', 'value': "div[data-testid='reply']"}
2024-04-19 02:27:04,751 - urllib3.connectionpool - DEBUG - http://localhost:50146 "POST /session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/element HTTP/1.1" 200 0
2024-04-19 02:27:04,751 - selenium.webdriver.remote.remote_connection - DEBUG - Remote response: status=200 | data={"value":{"element-6066-11e4-a52e-4f735466cecf":"f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.584"}} | headers=HTTPHeaderDict({'Content-Length': '127', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-04-19 02:27:04,751 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request
2024-04-19 02:27:04,751 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://localhost:50146/session/861c99898cb0272e5e265c5f1b0d503b/execute/sync {'script': '/* getAttribute /return (function(){return (function(){var d=this||self;function f(a,b){function c(...', 'args': [{'element-6066-11e4-a52e-4f735466cecf': 'f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.584'}, 'aria-label']}
2024-04-19 02:27:04,770 - urllib3.connectionpool - DEBUG - http://localhost:50146 "POST /session/861c99898cb0272e5e265c5f1b0d503b/execute/sync HTTP/1.1" 200 0
2024-04-19 02:27:04,770 - selenium.webdriver.remote.remote_connection - DEBUG - Remote response: status=200 | data={"value":"19 Replies. Reply"} | headers=HTTPHeaderDict({'Content-Length': '29', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-04-19 02:27:04,770 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request
2024-04-19 02:27:04,770 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://localhost:50146/session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/element {'using': 'css selector', 'value': "div[data-testid='retweet']"}
2024-04-19 02:27:04,774 - urllib3.connectionpool - DEBUG - http://localhost:50146 "POST /session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/element HTTP/1.1" 200 0
2024-04-19 02:27:04,774 - selenium.webdriver.remote.remote_connection - DEBUG - Remote response: status=200 | data={"value":{"element-6066-11e4-a52e-4f735466cecf":"f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.585"}} | headers=HTTPHeaderDict({'Content-Length': '127', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-04-19 02:27:04,774 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request
2024-04-19 02:27:04,774 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://localhost:50146/session/861c99898cb0272e5e265c5f1b0d503b/execute/sync {'script': '/ getAttribute */return (function(){return (function(){var d=this||self;function f(a,b){function c(...', 'args': [{'element-6066-11e4-a52e-4f735466cecf': 'f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.585'}, 'aria-label']}
2024-04-19 02:27:04,776 - urllib3.connectionpool - DEBUG - http://localhost:50146 "POST /session/861c99898cb0272e5e265c5f1b0d503b/execute/sync HTTP/1.1" 200 0
2024-04-19 02:27:04,776 - selenium.webdriver.remote.remote_connection - DEBUG - Remote response: status=200 | data={"value":"89 reposts. Repost"} | headers=HTTPHeaderDict({'Content-Length': '30', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-04-19 02:27:04,776 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request
2024-04-19 02:27:04,776 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://localhost:50146/session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/element {'using': 'css selector', 'value': "div[data-testid='like']"}
2024-04-19 02:27:04,780 - urllib3.connectionpool - DEBUG - http://localhost:50146 "POST /session/861c99898cb0272e5e265c5f1b0d503b/element/f.E668CD9998144E0499B79B48775B5205.d.F61E3435C244E04795D5639F5028A5C9.e.520/element HTTP/1.1" 404 0
2024-04-19 02:27:04,780 - selenium.webdriver.remote.remote_connection - DEBUG - Remote response: status=404 | data={"value":{"error":"no such element","message":"no such element: Unable to locate element: {"method":"css selector","selector":"div[data-testid='like']"}\n (Session info: chrome=124.0.6367.60)","stacktrace":"0 chromedriver 0x00000001031be934 chromedriver + 4368692\n1 chromedriver 0x00000001031b6dc8 chromedriver + 4337096\n2 chromedriver 0x0000000102ddac04 chromedriver + 289796\n3 chromedriver 0x0000000102e1ce00 chromedriver + 560640\n4 chromedriver 0x0000000102e13368 chromedriver + 521064\n5 chromedriver 0x0000000102e555ec chromedriver + 792044\n6 chromedriver 0x0000000102e11ab4 chromedriver + 514740\n7 chromedriver 0x0000000102e1250c chromedriver + 517388\n8 chromedriver 0x0000000103182e50 chromedriver + 4124240\n9 chromedriver 0x0000000103187c40 chromedriver + 4144192\n10 chromedriver 0x0000000103168818 chromedriver + 4016152\n11 chromedriver 0x0000000103188570 chromedriver + 4146544\n12 chromedriver 0x000000010315a2cc chromedriver + 3957452\n13 chromedriver 0x00000001031a7eb8 chromedriver + 4275896\n14 chromedriver 0x00000001031a8034 chromedriver + 4276276\n15 chromedriver 0x00000001031b6a28 chromedriver + 4336168\n16 libsystem_pthread.dylib 0x000000019c7f3fa8 _pthread_start + 148\n17 libsystem_pthread.dylib 0x000000019c7eeda0 thread_start + 8\n"}} | headers=HTTPHeaderDict({'Content-Length': '1699', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-04-19 02:27:04,780 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request
2024-04-19 02:27:04,914 - main - INFO -
Done saving to data/tweets_2024-04-19_02.xlsx. Total of 50 unique tweets.`
2024-03-08 10:42:16,951 - main - INFO - Tweet: <selenium.webdriver.remote.webelement.WebElement (session="6d68c615a31a66014517db5a18a144ed", element="f.4558563DBA9B5CDD7FCE8259F38DCFD1.d.A0B7EFFDA9F1D15135C1B32CAF8A42EB.e.12597")>
Traceback (most recent call last):
File "C:\Users\abc\PycharmProjects\pythonProject.venv\Lib\site-packages\tenacity_init_.py", line 382, in call
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "E:\Twitter-Insight-LLM-main\twitter_data_ingestion.py", line 152, in _process_tweet
author_name, author_handle = self._extract_author_details(tweet)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\Twitter-Insight-LLM-main\twitter_data_ingestion.py", line 240, in _extract_author_details
author_details = self._get_element_text(
^^^^^^^^^^^^^^^^^^^^^^^
File "E:\Twitter-Insight-LLM-main\twitter_data_ingestion.py", line 196, in _get_element_text
return parent.find_element(By.XPATH, selector).text
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\abc\PycharmProjects\pythonProject.venv\Lib\site-packages\selenium\webdriver\remote\webelement.py", line 417, in find_element
return self._execute(Command.FIND_CHILD_ELEMENT, {"using": by, "value": value})["value"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\abc\PycharmProjects\pythonProject.venv\Lib\site-packages\selenium\webdriver\remote\webelement.py", line 395, in _execute
return self._parent.execute(command, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\abc\PycharmProjects\pythonProject.venv\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 347, in execute
self.error_handler.check_response(response)
File "C:\Users\abc\PycharmProjects\pythonProject.venv\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 229, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found
(Session info: chrome=122.0.6261.112); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#stale-element-reference-exception
Stacktrace:
GetHandleVerifier [0x00007FF656AFAD32+56930]
(No symbol) [0x00007FF656A6F632]
(No symbol) [0x00007FF6569242E5]
(No symbol) [0x00007FF656929261]
(No symbol) [0x00007FF65692B6EB]
(No symbol) [0x00007FF65692B7B0]
(No symbol) [0x00007FF65696955C]
(No symbol) [0x00007FF656969A2C]
(No symbol) [0x00007FF65695F13C]
(No symbol) [0x00007FF65698BCDF]
(No symbol) [0x00007FF65695F09A]
(No symbol) [0x00007FF65698BEB0]
(No symbol) [0x00007FF6569A81E2]
(No symbol) [0x00007FF65698BA43]
(No symbol) [0x00007FF65695D438]
(No symbol) [0x00007FF65695E4D1]
GetHandleVerifier [0x00007FF656E76ABD+3709933]
GetHandleVerifier [0x00007FF656ECFFFD+4075821]
GetHandleVerifier [0x00007FF656EC818F+4043455]
GetHandleVerifier [0x00007FF656B99766+706710]
(No symbol) [0x00007FF656A7B90F]
(No symbol) [0x00007FF656A76AF4]
(No symbol) [0x00007FF656A76C4C]
(No symbol) [0x00007FF656A66904]
BaseThreadInitThunk [0x00007FF8AB1B7344+20]
RtlUserThreadStart [0x00007FF8ABE026B1+33]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "E:\Twitter-Insight-LLM-main\twitter_data_ingestion.py", line 311, in
scraper.fetch_tweets(
File "E:\Twitter-Insight-LLM-main\twitter_data_ingestion.py", line 55, in fetch_tweets
row = self.process_tweet(tweet)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\abc\PycharmProjects\pythonProject.venv\Lib\site-packages\tenacity_init.py", line 289, in wrapped_f
return self(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\abc\PycharmProjects\pythonProject.venv\Lib\site-packages\tenacity_init_.py", line 379, in call
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\abc\PycharmProjects\pythonProject.venv\Lib\site-packages\tenacity_init_.py", line 326, in iter
raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x173c9675e80 state=finished raised StaleElementReferenceException>]
Traceback (most recent call last):
File "d:\codes.vscode\twitter_ingection.py", line 311, in
scraper.fetch_tweets(
File "d:\codes.vscode\twitter_ingection.py", line 74, in fetch_tweets
self._save_to_json(row, filename=f"{cur_filename}.json")
File "d:\codes.vscode\twitter_ingection.py", line 292, in _save_to_json
with open(filename, "a", encoding="utf-8") as file:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'data/tweets_2024-04-12_17-53-05.json'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.