Giter VIP home page Giter VIP logo

creeper's People

Contributors

wspl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

creeper's Issues

Feature: Next Page Node - functional node for directing next page

page = "http://example.com/info?page=1"
demo[]: page -> $(".example")
    text: $(".title").html
    @next: $("a.next").href

I am thinking about another method for page number directing, that is simulating the operation of the user click on the next page. We can add a @next node to indicates the next page link. Page director would switch to next page automatically when current page has no more content.

New grammar features - Functional node: For assisting crawling. Node name start with @. It is less readable than private nodes.

Concurrency?

Is it possible to scrape pages concurrently with creeper?

@next node function implementation

There are some hindrance in implementing the functional part of @next. They have stumped me for a long time:

  • InitSelector's loop call
  • Wait until the page cycle ends and blocks the total cycle when there is no next page

how to parse Json structures

We probably use both HTML parser and JSON parser for crawling complex pages, I found that pattern files support HTML parser only, how could I use this framework to parse JSON structures or extend functionalities by myself? Thanks.

Problem since new commits ?

Hi,

I just copy paste your example code (hacker_news). Yesterday, it worked. Today with the new sources, it doesn't work anymore :(

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x45cdb6]

goroutine 1 [running]:
panic(0x65db60, 0xc42000c190)
        /opt/go/src/runtime/panic.go:500 +0x1a1
github.com/wspl/creeper.(*Node).Value(0x0, 0x0, 0xc42001acd0, 0x0, 0xc4200340b8)
        /home/ubuntu/workspace/src/github.com/wspl/creeper/node.go:118 +0x26
github.com/wspl/creeper.(*Creeper).Each(0xc42001acd0, 0x6cc428)
        /home/ubuntu/workspace/src/github.com/wspl/creeper/creeper.go:74 +0x8b
main.main()
        /home/ubuntu/workspace/main.go:12 +0x73
exit status 2

(A simple println alone works :) )

simple http get

if the website need login, this crawler does not work。

func (p *Page) Body() (string, error) {
u, err := p.Url()
if err != nil {
return "", err
}
if v, e := p.Node.Creeper.CacheGet(u); e {
return v, nil
}
res, err := http.Get(u)
if err != nil {
return "", err
}
defer res.Body.Close()
body, err := ioutil.ReadAll(res.Body)
if err != nil {
return "", err
}
sb := string(body)
p.Node.Creeper.CacheSet(u, sb)
return sb, nil
}

go get

go get github.com/wspl/creeper

github.com/PuerkitoBio/goquery

fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x1880e6d3a41e pc=0xf0eb]

runtime stack:
runtime.throw(0x4971c0, 0x2a)
/usr/local/go/src/runtime/panic.go:547 +0x90
runtime.sigpanic()
/usr/local/go/src/runtime/sigpanic_unix.go:12 +0x5a
runtime.unlock(0x982540)
/usr/local/go/src/runtime/lock_sema.go:107 +0x14b
runtime.(*mheap).alloc_m(0x982540, 0x1, 0x10000000010, 0xeed928)
/usr/local/go/src/runtime/mheap.go:492 +0x314
runtime.(*mheap).alloc.func1()
/usr/local/go/src/runtime/mheap.go:502 +0x41
runtime.systemstack(0xc82047fe58)
/usr/local/go/src/runtime/asm_amd64.s:307 +0xab
runtime.(*mheap).alloc(0x982540, 0x1, 0x10000000010, 0xed8f)
/usr/local/go/src/runtime/mheap.go:503 +0x63
runtime.(*mcentral).grow(0x983f10, 0x0)
/usr/local/go/src/runtime/mcentral.go:209 +0x93
runtime.(*mcentral).cacheSpan(0x983f10, 0xeed928)
/usr/local/go/src/runtime/mcentral.go:89 +0x47d
runtime.(*mcache).refill(0xaf4000, 0x10, 0xeed928)
/usr/local/go/src/runtime/mcache.go:119 +0xcc
runtime.mallocgc.func2()
/usr/local/go/src/runtime/malloc.go:642 +0x2b
runtime.systemstack(0xc820025500)
/usr/local/go/src/runtime/asm_amd64.s:291 +0x79
runtime.mstart()
/usr/local/go/src/runtime/proc.go:1051

goroutine 1 [running]:
runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc821c79140 sp=0xc821c79138
runtime.mallocgc(0xf0, 0x438dc0, 0x0, 0x438dc0)
/usr/local/go/src/runtime/malloc.go:643 +0x869 fp=0xc821c79218 sp=0xc821c79140

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.