Giter VIP home page Giter VIP logo

Comments (7)

edoardottt avatar edoardottt commented on August 17, 2024 1

aaah okok now I got it ahaha (I thought you was referring to this with the word 'context': https://github.com/gocolly/colly/blob/947eeead97b39d46ce2c89b06164c78b39d25759/response.go#L36).
Anyway, yes I know. I have tried in initial releases to implement the current behavior using that struct, but without success. I have encountered difficulties with -intensive mode, ignoring URLs... but of course if you want to use them it's super okay. I mean, they are built for these types of usage :)

from cariddi.

cyb3rjerry avatar cyb3rjerry commented on August 17, 2024 1

Sounds good! Will try using Colly OOTB config and if it doesn't work I'll revert back to passing our custom struct

from cariddi.

edoardottt avatar edoardottt commented on August 17, 2024 1

Actually the implementation of -intensive, ignoring URLs etc.. should be done in CreateColly passing the Scan struct and playing with the fields and methods of the collector object.
We should also have clear test cases to test URLs ignoring and intensive mode as it's really easy to mess with them

from cariddi.

cyb3rjerry avatar cyb3rjerry commented on August 17, 2024

@edoardottt Quick question before I get too deep into this one, is there a reason why we don't use the e.Request.Visit(e.Attr("...")) colly method instead of making our own version of it?

After looking at the code, I notice that most of the cases are covered by it.

  1. It supports URL ignore
  2. It supports checking if the URL has already been visited
  3. Using the e.Request.Visit method would be a lot more efficient

from cariddi.

edoardottt avatar edoardottt commented on August 17, 2024

Maybe I'm not understanding what you mean...This is the implementation of Request.Visit:

// Visit continues Collector's collecting job by creating a
// request and preserves the Context of the previous request.
// Visit also calls the previously provided callbacks
func (r *Request) Visit(URL string) error {
	return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)
}

ref: https://github.com/gocolly/colly/blob/947eeead97b39d46ce2c89b06164c78b39d25759/request.go#L119

from cariddi.

cyb3rjerry avatar cyb3rjerry commented on August 17, 2024

Yes but it uses the context that's defined at the collector's creation (URLs to ignore for example)

All of the "ignoring checks" that are manually there here:

// visitHTMLLink checks if the collector should visit a link or not.
func visitHTMLLink(link, protocolTemp, targetTemp, target string, intensive, ignoreBool, debug bool,
ignoreSlice []string, finalResults *[]string, e *colly.HTMLElement, c *colly.Collector) {

Are already taken into account via the context if you use e.Request.Visit(e...)

We're currently doing manually something Colly does by default by using e.Request.Visit

For example, we don't need to check whether the URL that's currently being scanned is "allowed" to be scanned or not as colly already does this by using the context values

In essence what I'm trying to say is that we don't need to check if a URL is to be scrapped or not, if it should scrape a given subdomain, if it should be ignored or not, ... This can all be set at the Collectors creation and colly will then (without us having to code it) make sure all the conditions are respected before continuing to a page. e.Request.Visit by default makes sure it's not out of bounds

from cariddi.

cyb3rjerry avatar cyb3rjerry commented on August 17, 2024

Souldn't be too crazy to do via DisallowedURLFilters which can take a Regex

from cariddi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.