Giter VIP home page Giter VIP logo

google-scraper's People

Contributors

yorkittran avatar

Watchers

 avatar

google-scraper's Issues

[Question] Why batching keywords instead of processing them individually?

Issue

The current implementation processes keyword in batches of 20:

batch = Sidekiq::Batch.new
batch.on(:success, CrawlJobCallbackService, batch_id: batch.bid, user_id: user_id)
batch.jobs do
keywords.each_slice(BATCH_SIZE) do |batch_keywords|
CrawlGoogleSearchWorker.perform_async(batch_keywords, user_id, batch.bid)
end
end

This leads to having to loop through keywords:

keywords.each.with_index do |keyword, index|
browser = Watir::Browser.new(:chrome, headless: true, proxy: proxy(index % total_proxies))
browser.goto("#{GOOGLE_SEARCH_URL}#{keyword}")
unless browser.element(id: 'result-stats').present? # rubocop:disable Rails/Blank
failed_keywords << keyword
next
end
crawl_results << crawl_data(keyword, browser.html)
browser.quit
rescue Net::ReadTimeout
failed_keywords << keyword
end

The downside of loops is that any unhandled error can stop the processing of the remaining keywords ๐Ÿ’ฅ

Expected

  • Keywords should be first stored in the database with a flag to track the scraping status.
  • A distinct background job to scrap the Google page is then scheduled.

The benefits are:

  • Fast CSV upload request without losing any data.
  • Observability via the Sidekiq dashboard is possible.
  • Retries can be managed at the keyword level.
  • The scraping of each keyword is isolated, i.e., one could error while others could be successful.

[Chore] Add database and Redis setup instructions to the README

Issue

While the README has detailed information on how you have approached the technical challenge of scraping, following the instructions were not sufficient to run the application. I need to have both Postgres and Redis running and running the Rails command to set up the Postgres database and run the migration.

Expected

The README file must contain all the required information for developers to set up the application in their local environments. So the following pieces of information are required: Ruby / Node versions and database(s) setup.

Note
Using Docker compose to set up the database services would make the project setup easier.

Start of the code review process ๐Ÿ‘‹

Hello Tuan ๐Ÿ‘‹ , thank you for your effort on the code submission. I am Olivier at Nimble, and I am happy to be the reviewer for our code review session.

During the review process, I would like to know more about your decisions, so I will create issues in some areas where there could be more improvements regarding your submission. Since solving every possible problem would take too long, I will prioritize the most important ones.

At the same time, please keep in mind that this is a bi-directional process, and I would love to hear back from you as well. Therefore, do not hesitate to ask questions or share your opinions about the implementation (if any) during the process.

We expect the code review process to be completed within 2-3 days. As a result, please make sure you are responsive during this process. If you need more time, please let us know as soon as possible so we can plan accordingly.

If we are aligned on any issue, and you would like to correct them, please address the issue using a proper git flow (creating a new branch, opening a Pull Request (PR) per issue, and merging the code when you are ready), and I will follow up on those fixes. Just so you know, you don't have to close any of my created issues after merging your PRs; I will help verify and close them for you once they pass. ๐Ÿ˜‡

In the end, I do hope that you find the process enjoyable. Good luck and happy coding. ๐Ÿค˜

[Bug] CSV upload fails

Issue

Upon uploading a CSV file, the UI shows a success message:

2566-03-22 15 49 51

However, the web server log shows an authorization issue:

image

As a result, I have not been able to verify the scraping results yet ๐Ÿ˜…

[Chore] Increase test coverage

Issue

Automated tests do not cover the core business logic (CSV upload, scraping).

Expected

While 100% test coverage is not required for this code challenge, all critical paths of the application should be unit-tested and, ideally, UI-tested.

Note
The main challenge is to test the scraper reliably, i.e., it should not make network requests.

[Feature] Render the stored HTML in the UI

Issue

While the Google search result content is stored in the database, users can only view the HTML as code. The page is not viewable.

.d-flex.justify-content-between.align-items-center.mb-4[data-controller='clipboard']
h3 = @crawl_result.keyword
= hidden_field_tag :source, @crawl_result.source, data: { 'clipboard-target': 'source' }, readonly: true
button.btn.btn-outline-success.my-2[data-action='clipboard#copy' data-clipboard-target='button']
| Copy Source HTML
i.fa-solid.fa-clipboard.ms-2

As a result, a user cannot verify if the scraping results are correct.

Expected

The HTML content should be viewable on the application, e.g., there could be a view of the content on a new page or in a modal.

Note
There is a bit of a challenge to render HTML content from another page. Hence it is an expected feature of the application.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.