apify / actor-scraper Goto Github PK

House of Apify Scrapers. Generic scraping actors with a simple UI to handle complex web crawling and scraping use cases.

Home Page: https://docs.apify.com/scraping

JavaScript 98.67% Dockerfile 1.33%

web-scraping apify

actor-scraper's Introduction

Apify Scrapers

This repository houses all of Apify generic actors that are used for simplified scraping using a pre-defined, schema validated UI input instead of the typical JSON input used in other actors.

Web Scraper

Web Scraper (apify/web-scraper) is a ready-made solution for scraping the web using the Chrome browser. It takes away all the work necessary to set up a browser for crawling, controls the browser automatically and produces machine readable results in several common formats.

Underneath, it uses the Puppeteer library to control the browser, but you don't need to worry about that. Using a simple web UI and a little of basic JavaScript, you can tweak it to serve almost any scraping need.

Puppeteer Scraper

Puppeteer Scraper (apify/puppeteer-scraper) is the most powerful scraper tool in our arsenal (aside from developing your own actors). It uses the Puppeteer library to programmatically control a headless Chrome browser and it can make it do almost anything. If using the Web Scraper does not cut it, Puppeteer Scraper is what you need.

Puppeteer is a Node.js library, so knowledge of Node.js and its paradigms is expected when working with the Puppeteer Scraper.

If you need either a faster, or a simpler tool, see the Cheerio Scraper for speed, or Web Scraper for simplicity.

Cheerio Scraper

Cheerio Scraper (apify/cheerio-scraper) is a ready-made solution for crawling the web using plain HTTP requests to retrieve HTML pages and then parsing and inspecting the HTML using the Cheerio library. It's blazing fast.

Cheerio is a server-side version of the popular jQuery library, that does not run in the browser, but instead constructs a DOM out of a HTML string and then provides the user with API to work with that DOM.

Cheerio Scraper is ideal for scraping websites that do not rely on client-side JavaScript to serve their content. It can be as much as 20 times faster than using a full browser solution such as Puppeteer.

Scraper Tools

A library that houses logic common to all the scrapers.

actor-scraper's People

Contributors

Stargazers

Watchers

actor-scraper's Issues

Tutorial improvements

The getting started guide doesn't mention how to enqueue requests manually, I found it somewhere in the SDK docs. Maybe giving an example of enqueueing requests manually in the getting started guide would be helpful for people starting out with writing a web-scraper.
Again about the context.enqueueRequest() function I was not sure if I can use it asynchronously or not? for example, is it ok to pass an async function to each():

subCategoriesSelector.each(async function() {
            // Suffix sub category link with a limit of 96
            var scLink = $(this).attr('href').replace('#', '?limit=96');
            //log.info(`SubCategory Link: ${scLink}`);
            await context.enqueueRequest({
                url:`${baseURL}/${scLink}`,
                userData: {
                    "label": "LIST"
                    }
                }
            );

Pass Puppeteer events to pageFunction

For example Network iddle is useful for scraping of AJAX websites.

Fix broken links

https://apify.com/apify/cheerio-scraper - link to https://sdk.apify.com/docs/api/autoscaledpool
https://apify.com/apify/puppeteer-scraper - links to https://sdk.apify.com/docs/api/autoscaledpool and https://sdk.apify.com/docs/api/puppeteerpool

Consider using wait for network2 in puppeteer/web-scrapers default one

Many pages are not fully loaded with current configuration and network0 works for them. We should test this on some use cases to see if the websites won't timeout.

Ideas to improve input schema sections

Added sections to web-scraper input schema ->

Link for example proxy and browser masking to https://help.apify.com/en/articles/1961361-several-tips-how-to-bypass-website-anti-scraping-protections

Before releasing this make sure that markdown in section description is supported at Apify app.

Rename this repo to "actor-web-scraper"

For consistency with other actor projects...

Actor cannot be started with default input

If I make a task for this actor and hit the run button, I got an error message that Start URLs is not defined even there are two. Then I also have to set Max pages per crawl.

Cheerio scraper - pass `proxyInfo` to `evaledPageFunction`

Make it consistent with CheerioCrawler documentation

Add a commented-out "debugger;" statement to pageFunction prefill to Web Scraper

With a comment describing what is it good for. This will help users to quickly learn about this cool feature.

Evaluation failed when running: `npm run local`

Following the instructions to run locally returns the following error:

Error: Evaluation failed: EvalError: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security Policy directive: "script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/".


    at <anonymous>:6:43
    at ExecutionContext.evaluateHandle (act-crawler/node_modules/puppeteer/lib/ExecutionContext.js:66:13)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:188:7)

Here's the RESULTS-1.json:

[
  {
    "id": 1,
    "label": "",
    "url": "https://news.ycombinator.com",
    "loadedUrl": "https://news.ycombinator.com/",
    "uniqueKey": "https://news.ycombinator.com",
    "referrerId": null,
    "requestedAt": "2017-12-28T09:34:17.387Z",
    "loadingStartedAt": "2017-12-28T09:34:18.609Z",
    "loadingFinishedAt": "2017-12-28T09:34:18.627Z",
    "responseStatus": 200,
    "responseHeaders": {
      "date": "Thu, 28 Dec 2017 09:34:18 GMT",
      "content-encoding": "gzip",
      "referrer-policy": "origin",
      "server": "cloudflare-nginx",
      "x-frame-options": "DENY",
      "content-type": "text/html; charset=utf-8",
      "status": "200",
      "x-xss-protection": "1; mode=block",
      "cache-control": "private; max-age=0",
      "content-security-policy": "default-src 'self'; script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/; frame-src 'self' https://www.google.com/recaptcha/; style-src 'self' 'unsafe-inline'; img-src 'self' 'unsafe-inline'",
      "strict-transport-security": "max-age=31556900",
      "cf-ray": "3d4385df4ee52fcf-MAA",
      "vary": "Accept-Encoding",
      "x-content-type-options": "nosniff"
    },
    "pageFunctionStartedAt": null,
    "pageFunctionFinishedAt": null,
    "type": "START_URL",
    "depth": 0,
    "pageFunctionResult": null,
    "interceptRequestData": null,
    "downloadedBytes": 196571,
    "willLoad": true,
    "queuePosition": "LAST",
    "errorInfo": [
      "Error: Evaluation failed: EvalError: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security Policy directive: \"script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/\".\n\n\n    at <anonymous>:6:43\n    at ExecutionContext.evaluateHandle (/private/tmp/act-crawler/node_modules/puppeteer/lib/ExecutionContext.js:66:13)\n    at <anonymous>\n    at process._tickCallback (internal/process/next_tick.js:188:7)",
      "Error: Evaluation failed: EvalError: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security Policy directive: \"script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/\".\n\n\n    at <anonymous>:6:43\n    at ExecutionContext.evaluateHandle (/private/tmp/act-crawler/node_modules/puppeteer/lib/ExecutionContext.js:66:13)\n    at <anonymous>\n    at process._tickCallback (internal/process/next_tick.js:188:7)",
      "Error: Evaluation failed: EvalError: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security Policy directive: \"script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/\".\n\n\n    at <anonymous>:6:43\n    at ExecutionContext.evaluateHandle (/private/tmp/act-crawler/node_modules/puppeteer/lib/ExecutionContext.js:66:13)\n    at <anonymous>\n    at process._tickCallback (internal/process/next_tick.js:188:7)",
      "Error: Evaluation failed: EvalError: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security Policy directive: \"script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/\".\n\n\n    at <anonymous>:6:43\n    at ExecutionContext.evaluateHandle (/private/tmp/act-crawler/node_modules/puppeteer/lib/ExecutionContext.js:66:13)\n    at <anonymous>\n    at process._tickCallback (internal/process/next_tick.js:188:7)"
    ],
    "retryCount": 3,
    "skipOutput": false,
    "outputSeqNum": 1
  }
]

Finish webhook URL

Hi,

Are there any chances we could get the crawler's webhook feature finishWebhookUrl implemented in this act?

Thanks

POST requests are treated as GET

Facing issues with the HTTP_REFERER header for an apify/web-scraper Task, I tried to dump $_SERVER on a test page after a POST request enqueued as

context.enqueueRequest(
    {
        url: "https://example.org/print_post_request.php",
        payload: "fname=lol&lname=wut",
        headers: {
            "Referer": "http://example.org"
        },
           method: "POST"
    }
);

but the output stated REQUEST_METHOD => GET. Then I checked setting the StartURL method to POST and the output was the same.

I conclude that POST settings are completely ignored by apify/web-scraper.

Add method to retire browser

If proxy IP doesn't work a user needs a way to retire the browser to obtain a new IP. This should wait for SessionPool to be added and this will simply invalidate the session.

Food for thought: How to use this actor for simple testing of scraping?

The legacy Crawler had one use case that doesn't seem to work nicely in this actor. Let's say you wanted to check if some webpage can be easily scraped or not. So you just created a new crawler, set the Start URL to the URL you wanted to check, click Run and see a screenshot of the page. When you do the same in this actor, the live view doesn't even show up and the actor finishes straight away. Is there some way to make this case work?

Allow using named storage

Two users already had a use-case where they want to store persistent data. Right now they have to use raw API which is not user friendly.

Docs for INPUT_SCHEMA.json

Where is the documentation for how the INPUT_SCHEMA.json fields get render on the task UI?

Cheerio scraper - pass `session` and `proxyInfo` to `evaledPrepareRequestFunction`

Make it consistent with CheerioCrawler documentation

Puppeteer Scraper Injecting jquery by default

I was the puppetter scraper tutorial. https://docs.apify.com/scraping/puppeteer-scraper

I was testing injecting qjuery. When I removed the line:
await Apify.utils.puppeteer.injectJQuery(page); // <-------- Inject jQuery.

I am still getting all of the results in the dataset!

PuppeteerCrawler - Page function timeout - "maximum": 360 is unnecessarily restrictive.

Why put such a strict limit there? I understand that Puppeteer can accumulate garbage and crash on memory over looong amount of time but that's nowhere near 360 seconds.

Improve help input fields help texts

The copy in input schema needs improvements, many things are not clear, documentation is missing.

ACTOR: The actor run has reached the timeout of 300 seconds, aborting it. You can increase the timeout in run options.

Why am i getting this error message with Puppeteer Scraper? I cannot find this error in the source code.

https://github.com/apifytech/actor-scraper
https://github.com/apifytech/apify-js
ACTOR: The actor run has reached the timeout of 300 seconds, aborting it. You can increase the timeout in run options.

Add PhantomJS to SDK as 8x More Efficient than Puppeteer

Based on my testing, the PhantomJS actor on the apify platform is 8x more efficient than puppeteer actor. With the same configuration and 70 minutes of crawling, PhantomJS crawled 4830 pages, while puppeteer 604. Your customers could get more value by using PhantonJS.

https://my.apify.com/tasks/mDp69rvzdfC7LGk36#/runs/cycaJnAzPqzvmDq6q - PhantomJS
https://my.apify.com/tasks/h4RBR5Z72rnrwryPY#/runs/zPSyicXMJejz2rjX2 - puppeteer

Readme of web-scraper is missing information on how to use underscore in page function

The functions section of the readme contains info about jquery, but not about underscore
https://www.dropbox.com/s/izu694xzrxik5fz/Screenshot%202019-04-04%2010.39.42.png?dl=0

Expose session and proxyInfo to pageFunction of cheerio and puppeteer scrapers

Since both have useSessionPool: true, and you have access to Apify inside, useful for using requestAsBrowser using the same session

API not working for OUTPUT Log

The API URL to view the output log for a task doe not return the log.
https://api.apify.com/v2/key-value-stores/vDLfk985Hqw8SiXh3/records/OUTPUT?disableRedirect=1

Instead returns:

{
  "error": {
    "type": "record-not-found",
    "message": "Record was not found"
  }
}

Request: Add email notification when complete

I would like to have the option to receive an email notification when the task is scheduled, starts and ends. I would prefer the default email address to be my account email but have a field or additional addresses.

Since you already have the email actor, the code already seems to be there.

Change wait until to multi select

JSON input is not user-friendly. This needs to be implemented at the platform first and enabled in input schema.

Web-scraper: There is no 'URL #fragments identify unique pages'

This means that on a website which uses hashtags to control it's routing, web-scraper does not work. Example would be this website https://dagsordener.vejle.dk/ (I can provide a testing configured Task for this).

what is the use case to not useRequestQueue

i am trying to understand the use case when you would want to not useRequestQueue. what value does this provide to the user?

Add maxConcurrency to input

CheerioScraper - pass custom data also to "Prepare request function"

Now it's only available in "Page function". Customers requested this (see https://app.intercom.com/a/apps/kod1r788/inbox/inbox/conversation/26871291243)

Web Scraper: waitFor options - timeout vs. timeoutMillis

Default puppeteer page.waitFor function has parameter timeout in options, where you can set a timeout in ms.

We should have the same param or at least document it that we have a different one in the web-scraper readme.

Not able to access DOM using global window/document

Hi,

I am going through the documentation and on the below mentioned URL, you have written that window & document can be accessed. Now I don't want to inject jQuery as I can do all the stuff in Javascript only. But I am not able to access window/document

https://apify.com/apify/web-scraper#page-function

While document is returning object as below:

{
2021-01-25T13:54:00.587Z location: {
2021-01-25T13:54:00.587Z ancestorOrigins: {},
2021-01-25T13:54:00.588Z href: 'chrome-error://chromewebdata/',
2021-01-25T13:54:00.588Z origin: 'null',
2021-01-25T13:54:00.589Z protocol: 'chrome-error:',
2021-01-25T13:54:00.589Z host: 'chromewebdata',
2021-01-25T13:54:00.589Z hostname: 'chromewebdata',
2021-01-25T13:54:00.590Z port: '',
2021-01-25T13:54:00.590Z pathname: '/',
2021-01-25T13:54:00.591Z search: '',
2021-01-25T13:54:00.591Z hash: '',
2021-01-25T13:54:00.591Z assign: {},
2021-01-25T13:54:00.592Z reload: {},
2021-01-25T13:54:00.592Z replace: {},
2021-01-25T13:54:00.593Z toString: {}
2021-01-25T13:54:00.593Z }
2021-01-25T13:54:00.594Z }

While I was expecting the actual DOM object. Please help me out on this.

Upgrade Puppeteer to latest

The current version 0.13.0 is almost 3 months old. It would be good to move to the latest version 1.0.0 which would be more stable.

Add support for clickable divs and buttons

Currently only a tags are supported. Support for divs with href attribute is needed.
Is this something which is a quick fix? If someone can point in the right direction, then maybe I can add the fix and send out a PR.
Thanks

Ignore SSL errors did not work for Puppeteer Scraper

Run log

Not sure why it did not work, the options seem to passed correctly.

Can't remove a Link selector

When I remove the value of a Link selector ('a'), save the configuration and reload the page, the value is back again set to 'a'

Development mode breaks on request error

I have not researched this further, but it seems that when an error is thrown or some other issue happens in the browser, proxy is not correctly closed, or it tries to reconnect or I don't know what and it causes the scraper to get into an infinite loop of restarting browsers. More info in SDK slack channel.

Don't know what is input in the documentation about, there is no input in the UI that it is referring to, solved with custom data
Also I wasn't using the Pseudo URLs box but I wasn't sure if I should uncheck the 'Use Request Queue' checkbox or not in that case as I was adding to the request queue in the code.