Giter VIP home page Giter VIP logo

zhshch2002 / goribot Goto Github PK

View Code? Open in Web Editor NEW
210.0 11.0 30.0 627 KB

[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Home Page: https://github.com/zhshch2002/gospider

License: Apache License 2.0

Go 97.76% JavaScript 2.24%
golang-library golang spider-framework spider spiderbasic go crawler scrapy scraper

goribot's Introduction

Goribot

一个分布式友好的轻量的 Golang 爬虫框架。

完整文档 | Document

!! Warning !!

Goribot 已经被迁移到 Gospider|github.com/zhshch2002/gospider。修复了一些调度问题并分离了网络请求部分到另一个仓库。此仓库会继续保留,建议新朋友使用新的 Gospider。

Goribot has been moved to Gospider|github.com/zhshch2002/gospider. Fixed some scheduling issues and separated the network request part to another repo. This repo will continue to be kept, suggest new friends to use the new Gospider.

GitHub go.mod Go version GitHub tag (latest by date) codecov go-report license code-size

🚀Feature

版本警告

Goribot 仅支持 Go1.13 及以上版本。

👜获取 Goribot

go get -u github.com/zhshch2002/goribot

Goribot 包含一个历史开发版本,如果您需要使用过那个版本,请拉取 Tag 为 v0.0.1 版本。

⚡建立你的第一个项目

package main

import (
	"fmt"
	"github.com/zhshch2002/goribot"
)

func main() {
	s := goribot.NewSpider()

	s.AddTask(
		goribot.GetReq("https://httpbin.org/get"),
		func(ctx *goribot.Context) {
			fmt.Println(ctx.Resp.Text)
			fmt.Println(ctx.Resp.Json("headers.User-Agent"))
		},
	)

	s.Run()
}

🎉完成

至此你已经可以使用 Goribot 了。更多内容请从 开始使用 了解。

🙏感谢

万分感谢以上项目的帮助🙏。

goribot's People

Contributors

dependabot[bot] avatar fossabot avatar zhshch2002 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

goribot's Issues

一个任务还没执行完成爬虫就退出?

`
package main

import (
"fmt"
"github.com/zhshch2002/goribot"
"crawlab/service/babyTree/parser"
)

func main() {
s := goribot.NewSpider(
//goribot.Limiter(
// true,
// &goribot.LimitRule{
// Glob: "*.babytree.com",
// Allow: goribot.Allow,
// Rate: 2,
// //Delay: 5 * time.Second,
// RandomDelay: 5 * time.Second,
// Parallelism: 3,
// MaxReq: 3,
// MaxDepth: 5,
// },
//),
goribot.SetDepthFirst(true),
goribot.RefererFiller(),
goribot.RandomUserAgent(),
goribot.SpiderLogPrint(),
)

s.AddTask(
	goribot.Get("http://www.babytree.com/difang/allCities.php"),
	parser.CityList,
)

s.OnItem(func(i interface{}) interface{} {
	fmt.Printf("Item : %+v\r\n",i)
	return i
})

s.Run()

}

package parser

import (
"fmt"
"github.com/PuerkitoBio/goquery"
"github.com/zhshch2002/goribot"
"crawlab/models"
"net/url"
"strings"
)

func CityList(ctx *goribot.Context){
if ctx.Resp.Dom != nil {
cityList := make([]map[string]string, 2)
cityList[0] = map[string]string{"city": "北京", "py": "beijing"}
cityList[1] = map[string]string{"city": "上海", "py": "shanghai"}

	ctx.Resp.Dom.Find("a").Each(func(i int, sel *goquery.Selection) {
		href, exists := sel.Attr("href")
		index := strings.LastIndex(href, "?location=")
		if index == -1 && exists {
			u, _ := url.Parse(href)
			i2 := strings.Index(u.Host, ".")
			u2 := u.Host[:i2]
			i3 := strings.Index(u2, "-city")
			if i3 > -1 {
				py := u2[:i3]
				cityList = append(cityList, map[string]string{"city": sel.Text(), "py": py})
			}
		}
	})
	fmt.Printf("city Lists : %+v\r\n",cityList)
	for _, c := range cityList {
		py := c["py"]
		uri := "http://www.babytree.com/community/" + py + "/index_1.html#topicpos"
		ctx.AddItem(models.CityList{py,uri})
	}
}

}
`
这是执行打印的结果:
image

image

为什么map内的Item无法全部打印出来?run方法就退出了?

请教一个关于爬取结果存储的问题

代码结构如下:

func main() {
	...
	s.AddTask(goribot.GetReq(url), f1)
	s.Run()
}

var f1 = func(ctx *goribot.Context) {
	...
        url:= ctx.Resp.Json("url").Int()
	file_name := ctx.Resp.Json("file_name").Int()
	file_size := ctx.Resp.Json("file_size").Int()
	ctx.AddTask(goribot.GetReq(url),f2)
}

var f2 = func(ctx *goribot.Context) {
	file_download_url := ctx.Resp.Json("file_download_url")
}

f1中获取的file_namefile_size、和f2中获取的file_download_url ,如何能让这三个字段同时整体输出

PostJsonReq(urladdr string, requestData interface{}) 中requestData 参数格式的疑惑

PostJsonReq字面理解应该是生成一个携带参数为json格式的post请求,但是requestData 如果传入InfoRaw会请求失败,但是如果是InfoMap会请求成功。

func PostJsonReq(urladdr string, requestData interface{}) *Request {
	body, err := json.Marshal(requestData)
	req := PostReq(urladdr, bytes.NewReader(body))
	if req.Err == nil {
		req.Err = err
	}
	req.SetHeader("Content-Type", "application/json")
	return req
}

主要是json.Marshal()解析的问题:

InfoRaw = `{"account":"XXX","password":"YYY"}`
InfoMap = map[string]string{
		"account":  "XXX",
		"password": "YYY",
}

func xx2JSON(requestData interface{}) {
	body, err := json.Marshal(requestData)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(body))
}

	xx2JSON(InfoRaw)    //  "{\"account\":\"XXX\",\"password\":\"YYY\"}"
	xx2JSON(InfoMap)   //  {"account":"XXX","password":"YYY"}

所以对于PostJsonReq函数,是不是内部应该使用gjson.Parse(InfoRaw).String()

建议:SetParam 时应同步现有 Param

当前一下代码,会将原有的 foo=bar 去掉了

req:=goribot.
	GetReq("https://example.com?foo=bra&ping=pong").
	SetParam(map[string]string{
		"ping": "pong"
	})

建议增加一个方法 func (s *Request) AddParam(key, value string) *Request

爬虫问题请教

再请教一个问题
爬取如下页面:
https://github.com/search?q=go&type=Topics
然后
指定开始爬取条目startTopic (比如:golang-library)
以及结束爬取条目endTopic(比如:google-cloud-storage 在第3页)

爬虫从startTopic开始爬取,到endTopic后就不在爬取。

我实现代码大致如下:

var (
	startTopic = "XXXXX"
	endTopic   = "YYYYY"
)

func main1() {
	s := goribot.NewSpider(
		goribot.Limiter(true, &goribot.LimitRule{
			//Glob: "*.github.com",
			Rate: 2,
		}),
		goribot.RefererFiller(),
		goribot.RandomUserAgent(),
	)

	s.AddTask(goribot.Get("https://github.com/search?q=go&type=Topics"), func(ctx *goribot.Context) {
		totalPage := ctx.Resp.Dom.Find("XXXXXXX").Text()

		f := func(p int) string {
			return fmt.Sprintf("https://github.com/search?p=%v&q=go&type=Topics", p)
		}

		for i := 1; i <= totalPage; i++ {
			ctx.AddTask(goribot.Get(f(i)), func(ctx *goribot.Context) {
				ctx.Resp.Dom.Find("tbody[id^=normalthread]").Each(func(i int, s *goquery.Selection) {

					topic := s.Find("XXXX").Text()
                                      
                                       ..............................

				})

			})
		}

	})

	s.Run()
}

但是在在实现判断逻辑的时候,始终有问题,主要是判断到了endTopic,任然往后爬取

建议:自动处理 cookie

当前貌似不能自动处理 cookie 的保存和发送,每次都需要手动设置 cookie 值,能否第一次任务设置 cookie 之后,自动处理呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.