yorkittran / google-scraper Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 658 KB

Ruby 63.31% JavaScript 5.13% HTML 13.85% Batchfile 3.57% SCSS 2.32% Slim 11.83%

google-scraper's People

Contributors

Watchers

google-scraper's Issues

[Question] Why batching keywords instead of processing them individually?

Issue

The current implementation processes keyword in batches of 20:

google-scraper/app/services/scraper/create_crawl_job_service.rb

Lines 20 to 26 in e2f9bcc

 batch = Sidekiq::Batch.new 

 batch.on(:success, CrawlJobCallbackService, batch_id: batch.bid, user_id: user_id) 

 batch.jobs do 

 keywords.each_slice(BATCH_SIZE) do |batch_keywords| 

 CrawlGoogleSearchWorker.perform_async(batch_keywords, user_id, batch.bid) 

 end 

 end

This leads to having to loop through keywords:

google-scraper/app/services/scraper/crawl_google_search_service.rb

Lines 16 to 27 in e2f9bcc

 keywords.each.with_index do |keyword, index| 

 browser = Watir::Browser.new(:chrome, headless: true, proxy: proxy(index % total_proxies)) 

 browser.goto("#{GOOGLE_SEARCH_URL}#{keyword}") 

 unless browser.element(id: 'result-stats').present? # rubocop:disable Rails/Blank 

 failed_keywords << keyword 

 next 

 end 

 crawl_results << crawl_data(keyword, browser.html) 

 browser.quit 

 rescue Net::ReadTimeout 

 failed_keywords << keyword 

 end

The downside of loops is that any unhandled error can stop the processing of the remaining keywords 💥

Expected

Keywords should be first stored in the database with a flag to track the scraping status.
A distinct background job to scrap the Google page is then scheduled.

The benefits are:

Fast CSV upload request without losing any data.
Observability via the Sidekiq dashboard is possible.
Retries can be managed at the keyword level.
The scraping of each keyword is isolated, i.e., one could error while others could be successful.

[Chore] Add database and Redis setup instructions to the README

Issue

While the README has detailed information on how you have approached the technical challenge of scraping, following the instructions were not sufficient to run the application. I need to have both Postgres and Redis running and running the Rails command to set up the Postgres database and run the migration.

Expected

The README file must contain all the required information for developers to set up the application in their local environments. So the following pieces of information are required: Ruby / Node versions and database(s) setup.

Note
Using Docker compose to set up the database services would make the project setup easier.

Start of the code review process 👋

Hello Tuan 👋 , thank you for your effort on the code submission. I am Olivier at Nimble, and I am happy to be the reviewer for our code review session.

During the review process, I would like to know more about your decisions, so I will create issues in some areas where there could be more improvements regarding your submission. Since solving every possible problem would take too long, I will prioritize the most important ones.

At the same time, please keep in mind that this is a bi-directional process, and I would love to hear back from you as well. Therefore, do not hesitate to ask questions or share your opinions about the implementation (if any) during the process.

We expect the code review process to be completed within 2-3 days. As a result, please make sure you are responsive during this process. If you need more time, please let us know as soon as possible so we can plan accordingly.

If we are aligned on any issue, and you would like to correct them, please address the issue using a proper git flow (creating a new branch, opening a Pull Request (PR) per issue, and merging the code when you are ready), and I will follow up on those fixes. Just so you know, you don't have to close any of my created issues after merging your PRs; I will help verify and close them for you once they pass. 😇

In the end, I do hope that you find the process enjoyable. Good luck and happy coding. 🤘

[Bug] CSV upload fails

Issue

Upon uploading a CSV file, the UI shows a success message:

However, the web server log shows an authorization issue:

As a result, I have not been able to verify the scraping results yet 😅

[Chore] Increase test coverage

Issue

Automated tests do not cover the core business logic (CSV upload, scraping).

Expected

While 100% test coverage is not required for this code challenge, all critical paths of the application should be unit-tested and, ideally, UI-tested.

Note
The main challenge is to test the scraper reliably, i.e., it should not make network requests.

[Feature] Render the stored HTML in the UI

Issue

While the Google search result content is stored in the database, users can only view the HTML as code. The page is not viewable.

google-scraper/app/views/crawl_results/show.html.slim

Lines 9 to 14 in e2f9bcc

 .d-flex.justify-content-between.align-items-center.mb-4[data-controller='clipboard'] 

 h3 = @crawl_result.keyword 

 = hidden_field_tag :source, @crawl_result.source, data: { 'clipboard-target': 'source' }, readonly: true 

 button.btn.btn-outline-success.my-2[data-action='clipboard#copy' data-clipboard-target='button'] 

 | Copy Source HTML 

 i.fa-solid.fa-clipboard.ms-2

As a result, a user cannot verify if the scraping results are correct.

Expected

The HTML content should be viewable on the application, e.g., there could be a view of the content on a new page or in a modal.

Note
There is a bit of a challenge to render HTML content from another page. Hence it is an expected feature of the application.

	batch = Sidekiq::Batch.new
	batch.on(:success, CrawlJobCallbackService, batch_id: batch.bid, user_id: user_id)
	batch.jobs do
	keywords.each_slice(BATCH_SIZE) do \|batch_keywords\|
	CrawlGoogleSearchWorker.perform_async(batch_keywords, user_id, batch.bid)
	end
	end

	keywords.each.with_index do \|keyword, index\|
	browser = Watir::Browser.new(:chrome, headless: true, proxy: proxy(index % total_proxies))
	browser.goto("#{GOOGLE_SEARCH_URL}#{keyword}")
	unless browser.element(id: 'result-stats').present? # rubocop:disable Rails/Blank
	failed_keywords << keyword
	next
	end
	crawl_results << crawl_data(keyword, browser.html)
	browser.quit
	rescue Net::ReadTimeout
	failed_keywords << keyword
	end

	.d-flex.justify-content-between.align-items-center.mb-4[data-controller='clipboard']
	h3 = @crawl_result.keyword
	= hidden_field_tag :source, @crawl_result.source, data: { 'clipboard-target': 'source' }, readonly: true
	button.btn.btn-outline-success.my-2[data-action='clipboard#copy' data-clipboard-target='button']
	\| Copy Source HTML
	i.fa-solid.fa-clipboard.ms-2

yorkittran / google-scraper Goto Github PK

google-scraper's People

Contributors

Watchers

google-scraper's Issues

Issue

Expected

Issue

Expected

Issue

Issue

Expected

Issue

Expected

Recommend Projects

Recommend Topics

Recommend Org