Giter VIP home page Giter VIP logo

worker's People

Contributors

ziflex avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

worker's Issues

How to avoid CloudFlare Capcha?

For example, I tried open a site behind Cloudflare, and it threw the below error. How can I avoid Cloudflare captcha errors? or How could I install privacypass chrome extension provided by Cloudflare

{
    "text": "RETURN DOCUMENT(@url, {driver: 'cdp'})",
    "params": {
        "url": "https://some.site.behind.cloudflare/"
      }
}
"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n
\n
Please enable cookies.
\n
\n
\n \n
One more step

\n \n \n
Another way to prevent getting this page in the future is to use Privacy Pass. You may need to download version 2.0 now from the [Chrome Web Store](https://github.com/%22https://chrome.google.com/webstore/detail/privacy-pass/ajhmfdgkijocedmfjonnpjfojldioehi/%22).

\n
\n\n\n
\n
\"\"\n\n \n\n\n\n\n"

Docker Images not synced

Hello, first of all, thank you for this amazing project. You really have a vision behind this and it's looking to be awesome! About this issue, it's just that the container won't start, the entrypoint or init string is not found I would guess. But when built locally, all good, so not an urgent issue for me.

feat(api): return Server IP

Hello, it might be interesting to return the execution ip of the request for tracking / debugging on regular scraps.

This can allow you to be aware of spam from an IP address, to understand if it is banned etc ...

Unexpected error with the "try it" script

when I run the standard "try it" script, the ferret doesn't find the element by the ".chartTrack__title" selector

image

LET doc = DOCUMENT('https://soundcloud.com/charts/top', {
    driver: 'cdp'
})

WAIT_ELEMENT(doc, '.chartTrack__details', 5000)

LET tracks = ELEMENTS(doc, '.chartTrack__details')

FOR track IN tracks
    RETURN {
        artist: TRIM(INNER_TEXT(track, '.chartTrack__username')),
        track: TRIM(INNER_TEXT(track, '.chartTrack__title'))
    }

build seems break

(on an intel mac)

1. Docker Hub => OK

docker run -d -p 8080:8080 montferret/worker

get http://localhost:8080/info

{
"ip": "xxxx",
"version": {
"worker": "1.18.0",
"chrome": {
"browser": "HeadlessChrome/99.0.4844.0",
"protocol": "1.3",
"v8": "9.9.115",
"webkit": "537.36 (@007241ce2e6c8e5a7b306cc36c730cd07cd38825)"
},
"ferret": "0.16.6"
}
}

post http://localhost:8080

{
    "text": "WAIT(RAND(5000,0)) LET doc = DOCUMENT('https://news.ycombinator.com/', {driver: 'cdp',viewport: {width: 1920,height: 1080}}) WAIT_ELEMENT(doc, '.titlelink', 5000) LET elements = (FOR el IN ELEMENTS(doc, '.title .titlelink')RETURN {'article': {'title': TRIM(el.innerText)}}) RETURN elements"
}

Success

2. My build => KO

git clone https://github.com/MontFerret/worker.git  
docker build -t mybuild/worker ./ 
docker run -d -p 8081:8080 mybuild/worker

get http://localhost:8080/info

{
    "ip": "xxxxxx",
    "version": {
        "worker": "1.18.0-1-g6331581",
        "chrome": {
            "browser": "HeadlessChrome/99.0.4844.0",
            "protocol": "1.3",
            "v8": "9.9.115",
            "webkit": "537.36 (@007241ce2e6c8e5a7b306cc36c730cd07cd38825)"
        },
        "ferret": "0.16.6"
    }
}

post http://localhost:8081

{
    "text": "RETURN 0"
}

Success

post http://localhost:8081

{
    "text": "WAIT(RAND(5000,0)) LET doc = DOCUMENT('https://news.ycombinator.com/', {driver: 'cdp',viewport: {width: 1920,height: 1080}}) WAIT_ELEMENT(doc, '.titlelink', 5000) LET elements = (FOR el IN ELEMENTS(doc, '.title .titlelink')RETURN {'article': {'title': TRIM(el.innerText)}}) RETURN elements"
}

Error: socket hang up
Docker Exited


sync.(*Pool).Get(0x10521e0)
	/usr/local/go/src/sync/pool.go:132 +0x25 fp=0xc00033c5e8 sp=0xc00033c5b0 pc=0x46e445
github.com/wI2L/jettison.encodeSortedMap(0x10521a0, {0xc000617000, 0x1, 0x1000}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, 0x80, ...}, ...)
	/go/pkg/mod/github.com/w!i2!l/[email protected]/encode.go:415 +0x7a fp=0xc00033c728 sp=0xc00033c5e8 pc=0x7a7f1a
github.com/wI2L/jettison.encodeMap(0xc0008016c8?, {0xc000617000, 0x0, 0x1000}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, 0x80, ...}, ...)
	/go/pkg/mod/github.com/w!i2!l/[email protected]/encode.go:364 +0x345 fp=0xc00033c808 sp=0xc00033c728 pc=0x7a7ac5
github.com/wI2L/jettison.newMapInstr.func1(0x10520e0?, {0xc000617000?, 0xc000478900?, 0xc000617000?}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, 0x80, ...})
	/go/pkg/mod/github.com/w!i2!l/[email protected]/instruction.go:400 +0x72 fp=0xc00033c898 sp=0xc00033c808 pc=0x7adcb2
github.com/wI2L/jettison.wrapInlineInstr.func1(0xc00021f7d0, {0xc000617000?, 0x7f86c053cd28?, 0x40?}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, 0x80, ...})
	/go/pkg/mod/github.com/w!i2!l/[email protected]/instruction.go:406 +0x65 fp=0xc00033c908 sp=0xc00033c898 pc=0x7adec5
github.com/wI2L/jettison.marshalJSON({0x9ae8c0?, 0xc00021f7d0?}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, 0x80, 0x0, 0x0})
	/go/pkg/mod/github.com/w!i2!l/[email protected]/json.go:167 +0xd9 fp=0xc00033c9d0 sp=0xc00033c908 pc=0x7aee79
github.com/wI2L/jettison.MarshalOpts({0x9ae8c0, 0xc00021f7d0}, {0xc00033cab8, 0x1, 0xa644a0?})
	/go/pkg/mod/github.com/w!i2!l/[email protected]/json.go:142 +0x1a9 fp=0xc00033ca90 sp=0xc00033c9d0 pc=0x7aec69
github.com/MontFerret/ferret/pkg/runtime/values.(*Object).MarshalJSON(0xc00033cb08?)
	/go/pkg/mod/github.com/!mont!ferret/[email protected]/pkg/runtime/values/object.go:47 +0x45 fp=0xc00033cad0 sp=0xc00033ca90 pc=0x7bab45
github.com/wI2L/jettison.encodeJSONMarshaler({0xa644a0?, 0xc00014e5a8}, {0xc000616000, 0x1, 0x1000}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, ...}, ...)
	/go/pkg/mod/github.com/w!i2!l/[email protected]/encode.go:692 +0x86 fp=0xc00033cb68 sp=0xc00033cad0 pc=0x7aa146
github.com/wI2L/jettison.encodeMarshaler(0xc0001c8a00, {0xc000616000, 0x1, 0x1000}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, 0x80, ...}, ...)
	/go/pkg/mod/github.com/w!i2!l/[email protected]/encode.go:668 +0x359 fp=0xc00033cc18 sp=0xc00033cb68 pc=0x7a9c99
github.com/wI2L/jettison.newJSONMarshalerInstr.func1(0x0?, {0xc000616000?, 0x4167eb?, 0x108a720?}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, 0x80, ...})
	/go/pkg/mod/github.com/w!i2!l/[email protected]/instruction.go:241 +0x76 fp=0xc00033cca8 sp=0xc00033cc18 pc=0x7aca56
github.com/wI2L/jettison.encodeArray(0xc0001c8a00, {0xc000616000?, 0x969459?, 0xb1af88?}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, 0x80, ...}, ...)
	/go/pkg/mod/github.com/w!i2!l/[email protected]/encode.go:312 +0x1ae fp=0xc00033cd40 sp=0xc00033cca8 pc=0x7a74ce
github.com/wI2L/jettison.encodeSlice(0xc00033ce48?, {0xc000616000?, 0x0, 0xc00018fa10?}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, 0x80, ...}, ...)
	/go/pkg/mod/github.com/w!i2!l/[email protected]/encode.go:267 +0xe6 fp=0xc00033cdd8 sp=0xc00033cd40 pc=0x7a6ea6
github.com/wI2L/jettison.newSliceInstr.func1(0x41299a?, {0xc000616000?, 0x7f86c053cd28?, 0x40?}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, 0x80, ...})
	/go/pkg/mod/github.com/w!i2!l/[email protected]/instruction.go:364 +0x5c fp=0xc00033ce58 sp=0xc00033cdd8 pc=0x7ad93c
github.com/wI2L/jettison.marshalJSON({0x969440?, 0xc000801698?}, {{0xbb4028, 0xc000192008}, {0xabcb10, 0x23}, 0x5, 0x80, 0x0, 0x0})
	/go/pkg/mod/github.com/w!i2!l/[email protected]/json.go:167 +0xd9 fp=0xc00033cf20 sp=0xc00033ce58 pc=0x7aee79
github.com/wI2L/jettison.MarshalOpts({0x969440, 0xc000801698}, {0xc00033d008, 0x1, 0xc0004b7b60?})
	/go/pkg/mod/github.com/w!i2!l/[email protected]/json.go:142 +0x1a9 fp=0xc00033cfe0 sp=0xc00033cf20 pc=0x7aec69
github.com/MontFerret/ferret/pkg/runtime/values.(*Array).MarshalJSON(0xc0006538f0?)
	/go/pkg/mod/github.com/!mont!ferret/[email protected]/pkg/runtime/values/array.go:42 +0x56 fp=0xc00033d020 sp=0xc00033cfe0 pc=0x7b4616
github.com/MontFerret/ferret/pkg/runtime.(*Program).Run(0xc000653b60, {0xbb4098, 0xc000653bf0}, {0xc0008b5240?, 0x0?, 0x0?})
	/go/pkg/mod/github.com/!mont!ferret/[email protected]/pkg/runtime/program.go:99 +0x366 fp=0xc00033d1c8 sp=0xc00033d020 pc=0x7be5e6
github.com/MontFerret/worker/pkg/worker.(*Worker).DoQuery(0xc0002e6180, {0xbb4098, 0xc00008a7b0}, {{0xc0000b8140?, 0xc00008a7e0?}, 0x0?})
	/go/src/github.com/MontFerret/worker/pkg/worker/worker.go:72 +0x1ae fp=0xc00033d258 sp=0xc00033d1c8 pc=0x8fcaae
github.com/MontFerret/worker/internal/controllers.(*Worker).runScript(0xc00019c1d8, {0xbc3310, 0xc000010180})
	/go/src/github.com/MontFerret/worker/internal/controllers/worker.go:52 +0x33b fp=0xc00033d308 sp=0xc00033d258 pc=0x8fe11b
github.com/MontFerret/worker/internal/controllers.(*Worker).runScript-fm({0xbc3310?, 0xc000010180?})
.....

enhancement: Info Request

Hey @ziflex ,

Sometimes we are billed per request on the hosting of an API;

it would be possible to have an option on the post / to retrieve the info in the process?

{ 
  data: ...
  info: ...
}

I have a quick look for PR .. I have to dig a little GO before ๐Ÿ˜…

It's my last step to align all my containers on your worker and trash my fork ๐Ÿคž

Returns 200 and success even when the URL requested is down or offline the second time [Cache issue?]

Hello,

First of all great work, creating a declarative scraper. We were testing out worker using docker, ran using

docker run -d -p 8080:8080 montferret/worker

and it's running great. We sent a POST request to above with below payload and got 200 OK which is good.

{ "text": "LET doc = DOCUMENT(@url, { driver: \"cdp\", userAgent: \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome 76.0.3809.87 Safari/537.36\"}) RETURN {}", "params": { "url": "http://192.168.0.10/test" } }

However the problem is, when the URL is down/offline (we intentionally took site http://192.168.0.10/test down) and we're still getting the same 200 and OK. Looks like the previously successful request is cached since http://192.168.0.10/test was running when the very 1st time request went through. [if we restart docker container while http://192.168.0.10/test is down and send a new fresh request, it's showing net::ERR_ADDRESS_UNREACHABLE as expected and working correctly]

Not sure if this is due to Chrome caching or ferret caching it?

If it is cache, is there a way to disable the cache so that every time it hits the live URL instead of using the cache version?

If there is a flag how to pass it to the docker image?

Appreciate your help, thanks in advance.

Add support of async execution

We need to add possibility execute queries asynchronously i.e. not blocking HTTP request.

The design is pretty simple:

  • Add [POST] /async (or other name, naming is hard :)) method that receives similar payload to [POST] / with additional property .callback that represents a valid URL (we need to validate it before responding):
type AsyncScript struct {
    Script
    Callback string `json: "callback"`
}
  • If everything is ok, the output must be an uuid value representing a job ID.
  • Once Worker received an async payload, it schedules it internally using thread-pool.
  • When Worker is done with a job, whether it failed or not, it should call a given callback url with a following payload
type AsyncResult struct {
    Status string // succeeded or failed
    Data byte[] // script result or error
}

feat(api): screenshot if failed

Hey, still in a spirit of debugging it could be interesting via post options to manage a screenshot of the state of the doc in case of failed.

I'm just sharing ideas, feel free to close the issue.

Chrome no headless option

it might be nice to have the possibility for the build to choose a non-headless for some usecases

lite authentication

quick idea

regularly we will host this docker behind an auth managed by google amazon or .... in case we want to host it on a dedicated server, do you think it would be interesting to have a basic micro authentication system?

external proxy support

thanks for the worker component, it fits my needs and loving it.

I link to know how to use "worker component" with external proxy.

Any tips or idea will be appreciated.

thanks in advance.

Update Chrome to 83.0.4103.0

Worker is using pretty outdated version of Chrome that has some issues. Let's update to the most recent stable one (that one that Puppeteer team is using).

microbox/chromium-headless:83.0.4103.0

Caching

We need to be able to cache frequently used queries (compiled) using simple LFU algorithm.

Make fails

I am trying to run 'make' so that I can use a local chrome instance for CDP. But it fails with the following error. Any pointers to help?

go build -v -o ./bin/worker
-ldflags "-X main.version=v1.10.0-1-gfa6fdbf -X main.ferretVersion=0.15.0"
./main.go
main.go:13:2: use of internal package not allowed
main.go:14:2: use of internal package not allowed
Makefile:8: recipe for target 'compile' failed
make: *** [compile] Error 1

Create proxy sidecars

Once #5 is done, we need to provide some proxy sidecars that implement an interface to popular queue services like:

  • RabbitMQ
  • Redis
  • AWS SQS
  • Google Cloud Pub Sub
  • Azure Service Bus

Any additional proxies will be created if there is such a need.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.