Giter VIP home page Giter VIP logo

screen-scout's Issues

Controlling concurrency

Concurrency control for how many pages are processed at the same time could improve performance, especially when working with a large number of pages.

Optimizing performance for large-scale crawling

Consider including concurrent page processing, rate limiting, or request-to-server delay mechanisms in the script in order to control load and adhere to server limitations if you intend to scale it.

relates to #2

Bug: pages not fully loaded

The crawler does not wait fpor all responses and sometimes takes screenshots of pages still loading.

For example:
sysndd dbmr unibe ch-API_2023-12-10T12-28-44-521Z

Solution:

  • set a parameter to handle max wait time
  • check if pupateer can wait for whole page to laod

Feature: Logging, progress monitoring, and command line feedback

More detailed logging, such as the current depth level and number of pages processed, can aid in progress tracking and debugging.

Making feedback available in the command line for each step (e.g., "Launching browser", "Processing URL:...", "Closing browser") would improve the script's usability.

Feature: implement ChatGPT description of screenshots

The screenshots could be processed by ChatGPT to generate descriptions of the website.
This implementation will require:

  • a config file
  • likely a second, independent script
  • tailored API requests to produce for example markdown formatted files

Browser instance reuse and resource management

Each URL currently launches a new browser instance. This can be time-consuming, especially with deep recursion or a large number of pages. Use the same browser instance for all pages if possible. This may necessitate reorganizing how browser and page instances are managed.
Ensure that resources are released properly. Close the page after taking a screenshot, but keep the browser open if you are going to reuse it.

Customizable timeout

It might be useful to add a setting to configure the timeout for page loading, as some pages may take longer to load than others.

Bug: Some pages are not saved as images

For example the output "sysndd.dbmr.unibe.ch-Genes-HGNC".
This seems to be an issue with certain characters like ":" in the URL when they are parsed with regex.

Make file naming robust with long urls

The screenshot file naming should be robust with long urls including parameters and hashes.
A possible solution is to make a meta file that holds the full urls and the file names with shortened und unique names.
Carefully implement to follow filename conventions for different operating systems.

Closing resources correctly

After all browser tabs/pages have been processed, ensure that they are properly closed. This aids in the efficient management of resources.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.