Context: <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Souldn't be too crazy to do via <a href="https://github.com/gocolly/colly/blob/v1.2.0/

Refactor visitHTMLLink() about cariddi HOT 7 CLOSED

edoardottt commented on August 17, 2024

Refactor visitHTMLLink()

from cariddi.

Comments (7)

edoardottt commented on August 17, 2024 1

aaah okok now I got it ahaha (I thought you was referring to this with the word 'context': https://github.com/gocolly/colly/blob/947eeead97b39d46ce2c89b06164c78b39d25759/response.go#L36).
Anyway, yes I know. I have tried in initial releases to implement the current behavior using that struct, but without success. I have encountered difficulties with -intensive mode, ignoring URLs... but of course if you want to use them it's super okay. I mean, they are built for these types of usage :)

from cariddi.

cyb3rjerry commented on August 17, 2024 1

Sounds good! Will try using Colly OOTB config and if it doesn't work I'll revert back to passing our custom struct

from cariddi.

edoardottt commented on August 17, 2024 1

Actually the implementation of -intensive, ignoring URLs etc.. should be done in CreateColly passing the Scan struct and playing with the fields and methods of the collector object.
We should also have clear test cases to test URLs ignoring and intensive mode as it's really easy to mess with them

from cariddi.

cyb3rjerry commented on August 17, 2024

@edoardottt Quick question before I get too deep into this one, is there a reason why we don't use the e.Request.Visit(e.Attr("...")) colly method instead of making our own version of it?

After looking at the code, I notice that most of the cases are covered by it.

It supports URL ignore
It supports checking if the URL has already been visited
Using the e.Request.Visit method would be a lot more efficient

from cariddi.

edoardottt commented on August 17, 2024

Maybe I'm not understanding what you mean...This is the implementation of Request.Visit:

// Visit continues Collector's collecting job by creating a
// request and preserves the Context of the previous request.
// Visit also calls the previously provided callbacks
func (r *Request) Visit(URL string) error {
	return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)
}

ref: https://github.com/gocolly/colly/blob/947eeead97b39d46ce2c89b06164c78b39d25759/request.go#L119

from cariddi.

cyb3rjerry commented on August 17, 2024

Yes but it uses the context that's defined at the collector's creation (URLs to ignore for example)

All of the "ignoring checks" that are manually there here:

cariddi/pkg/crawler/colly.go

Lines 381 to 383 in 4d59028

 // visitHTMLLink checks if the collector should visit a link or not. 

 func visitHTMLLink(link, protocolTemp, targetTemp, target string, intensive, ignoreBool, debug bool, 

 ignoreSlice []string, finalResults *[]string, e *colly.HTMLElement, c *colly.Collector) {

Are already taken into account via the context if you use e.Request.Visit(e...)

We're currently doing manually something Colly does by default by using e.Request.Visit

For example, we don't need to check whether the URL that's currently being scanned is "allowed" to be scanned or not as colly already does this by using the context values

In essence what I'm trying to say is that we don't need to check if a URL is to be scrapped or not, if it should scrape a given subdomain, if it should be ignored or not, ... This can all be set at the Collectors creation and colly will then (without us having to code it) make sure all the conditions are respected before continuing to a page. e.Request.Visit by default makes sure it's not out of bounds

from cariddi.

cyb3rjerry commented on August 17, 2024

Souldn't be too crazy to do via DisallowedURLFilters which can take a Regex

from cariddi.

Refactor visitHTMLLink() about cariddi HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	// visitHTMLLink checks if the collector should visit a link or not.
	func visitHTMLLink(link, protocolTemp, targetTemp, target string, intensive, ignoreBool, debug bool,
	ignoreSlice []string, finalResults []string, e colly.HTMLElement, c *colly.Collector) {