Giter VIP home page Giter VIP logo

Comments (5)

 avatar commented on August 24, 2024 1

WRT the session variables, it may be worth considering also the google analytics etc. URL params, like "utm_source"; those attributes will be added by lots of websites and social media tools to all outbound URLs and could probably be safely stripped during canonicalisation.

from w3lib.

kmike avatar kmike commented on August 24, 2024

Nice links, thanks!

removal of userinfo

What does it mean? username/password?

dots and slashes in path and hostname

Could you please give an example? What's wrong with e.g. dots in hostname?

spaces succeeding and preceding the URL

Arguably this is an issue with link extraction, not with canonicalization. URLs shouldn't have such whitespaces. See also: https://github.com/scrapy/scrapy/issues/1614.

common session id variables and their values

This would be a very good feature to have, but we can't just blindly strip some known session_id parameter names and values by default. See also: scrapy/scrapy#1560.

ip v6 canonicalization

a good call.

from w3lib.

sibiryakov avatar sibiryakov commented on August 24, 2024

yes, userinfo is username and password.

Could you please give an example? What's wrong with e.g. dots in hostname?
google.com.

obviously these are task-dependent issues, but there is no mechanism to enable such behaviour.

from w3lib.

kmike avatar kmike commented on August 24, 2024

those attributes will be added by lots of websites and social media tools to all outbound URLs and could probably be safely stripped during canonicalisation.

I don't quite like doing this all by default without an option to turn it off. So the main question for now is how to make this behavior overridable, so that users can implement such rules themselves, without having to modify w3lib or scrapy. We can also provide something by default, but I think it should be a next step, and a separate task.

obviously these are task-dependent issues, but there is no mechanism to enable such behaviour.

Yep, it is discussed in scrapy/scrapy#1560. Changing canonicalize_url to do these actions is not enough, there should be a mechanism in Scrapy to enable it, and it also should be customizable. This part of the ticket is a duplicate of scrapy/scrapy#1560, probably it is better to keep discussion of this feature there. The problem is known, but there was no concrete proposal on how to fix it so far.

from w3lib.

kmike avatar kmike commented on August 24, 2024

yes, userinfo is username and password.

Is it a real issue in practice? I understand why it can help, but I can also see how it can break some of the use cases if scrapy/scrapy#1466 gets merged, if we do it by default.

from w3lib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.