scrapy / w3lib Goto Github PK

View Code? Open in Web Editor NEW

386.0 386.0 104.0 522 KB

Python library of web-related functions

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

hacktoberfest python

w3lib's Introduction

Scrapy

Overview

Scrapy is a BSD-licensed fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is maintained by Zyte (formerly Scrapinghub) and many other contributors.

Check the Scrapy homepage at https://scrapy.org for more information, including a list of features.

Requirements

Python 3.8+
Works on Linux, Windows, macOS, BSD

Install

The quick way:

pip install scrapy

See the install section in the documentation at https://docs.scrapy.org/en/latest/intro/install.html for more details.

Documentation

Documentation is available online at https://docs.scrapy.org/ and in the docs directory.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct.

By participating in this project you agree to abide by its terms. Please report unacceptable behavior to [email protected].

Companies using Scrapy

See https://scrapy.org/companies/ for a list.

Commercial Support

See https://scrapy.org/support/ for details.

w3lib's People

Contributors

Stargazers

Watchers

Forkers

dangra netconstructor cloudappsetup tomoyo nasirsphi alepharchives xiaoyili dior222 gmbackup lkulbacki imclab kmike tontonmax enagorny pib genba mlespiau azizur77 fubuki jranz eliasdorneles atassumer widgital djayb6 yubobo yarikoptic dvdbng mahak christwell nyov digenis lopuhin kolyaflash codecov-test cyberplant kelly2016 preetwinder starrify letser fladi dmippolitov 2aces hackrush01 jvanasco hmozju hidva jin10086 francisrol elacuesta sibiryakov malloxpb fullstackenviormentss patiences ggeg hsumerf okomestudio wrar kant lucywang000 influentcoder maramsumanth thundermindtech py361 angelomedeiros dkajtoch iamminji rennerocha gallaecio l0kix2 omarfarrag peonone sortafreel realslimshanky masterscott zanachka msgpo rrosajp eltonchou who0sy rwaycachedlibs andersk royvb-git satya469 hugovk frankfanslc shivamshan madhupatel08 limkokholefork python-repository-hub vishalbelsare angelikiboura mtabbasi gsweiz senliontec burnzz vanshajpoonia felipeboffnunes deadex-ng ppfranco kupacariibumu

w3lib's Issues

Tests fail with PyPy 2.0beta1

Traceback:

.......F...........................
======================================================================
FAIL: test_invalid_utf8_encoded_body_with_valid_utf8_BOM (w3lib.tests.test_encoding.HtmlConversionTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/kmike/svn/w3lib/w3lib/tests/test_encoding.py", line 115, in test_invalid_utf8_encoded_body_with_valid_utf8_BOM
    'utf-8',u'WORD\ufffd\ufffd')
  File "/Users/kmike/svn/w3lib/w3lib/tests/test_encoding.py", line 96, in _assert_encoding
    self.assertEqual(body_unicode, expected_unicode)
AssertionError: u'WORD\ufffd' != u'WORD\ufffd\ufffd'
- WORD\ufffd
+ WORD\ufffd\ufffd
?      +

Bug in NEW add_or_replace_parameter (w3lib 1.20.0)

Hi everybody,

New add_or_replace_parameter doesn't work correctly for list parameters in url e.g.:

>>> from w3lib.url import add_or_replace_parameter
>>> add_or_replace_parameter('http://example.com/?a=1&a=2', 'b', 1)
'http://example.com/?a=2&b=1'

I wait http://example.com/?a=1&a=2&b=1 but get http://example.com/?a=2&b=1, also I don't have this problem with w3lib 1.19.0.

remove_tags_with_content not removing javascript content

I am using Python 3.5 & w3lib 1.14.2

Let's say we have the following HTML:

Before JS <script type="text/javascript"> var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-24343546-1']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); </script> After JS

And we use the following code to strip out the above section:

full_text_no_script = remove_tags_with_content(full_text_dirty[0], ('script',))

Assume full_text_dirty[0] contains the HTML above, the variable full_text_no_script will end up containing the following content:

Before JS var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-24343546-1']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); After JS

HTML Entities and Numeric character references in URL

URLs on some sites erroneously contain valid "safe" characters in an invalid way and the standard Python library is unable to deal with this; therefore it might be nice if w3lib could. For example the hash character # normally marks the beginning of the fragment; but, it is possible that the url contains Numeric Character References (NCRs) like ® for example.

w3lib.url.safe_url_string() uses urllib.quote() with the following safe chars:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-%;/?:@&=+$|,#-_.!~*'()

and the following (invalid) url does not get altered:

>>> url = "/Pioneer_Speakers_with_iPod&reg;~iPhone&#174;_Dock?id=123#ipad"
>>> assert url == safe_url_string(url)
>>>

urlparse.urldefrag() is confused:

>>> urlparse.urldefrag(url)
('/Pioneer_Speakers_with_iPod&reg;~iPhone&', '174;_Dock?id=123#ipad')

Since safe_url_string() is used in SgmlLinkExtractor, for example with canonicalization turned on, we get fragment misinterpretation as the first hash triggers the slice:

/Pioneer_Speakers_with_iPod&reg;~iPhone&

Using urllib.quote() directly does not work since it encodes all hashes, including the fragment hash:

>>> print urllib.quote(url)
/Pioneer_Speakers_with_iPod%26reg%3B%7EiPhone%26%23174%3B_Dock%3Fid%3D123%23ipad

What is needed is perhaps a Entity/NCR regex that first converts the references and then does the safe encoding. So that in the end we get:

/Pioneer_Speakers_with_iPod%26reg%3B~iPhone%26%23174%3B_Dock?id=123#ipad

Python's gb18030 decoder is not the same as w3c's

https://www.w3.org/TR/encoding/#gb18030-decoder specifies a single-byte special case 0x80 → U+20AC for gbk compatibility, but Python's decoder does not perform this translation.

Scrapy can not auto detect GBK html encoding

Hi,

Thanks you guys for the great framework.

I am using scrapy to crawl multiple sites. Sites are diffrerent encodings.
One site is encoding as 'gbk' and it's declared in HTML meta. but scrapy can not auto detect the encoding.

I tried using Beautiful soup, it can parse it correctly. So I dig into w3lib. found that the pattern
_BODY_ENCODING_BYTES_RE can not correctly found the encoding in meta.

HTML snippet as below:

b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'

my test :

>>> from w3lib.encoding import html_body_declared_encoding
>>> b
b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
>>> html_body_declared_encoding(b)
>>> enc = html_body_declared_encoding(b)
>>> enc
>>> print('"%s"' % enc)
"None"
>>> soup = BeautifulSoup(b)
>>> soup.title
<title>网站地图</title>
>>> soup.original_encoding
'gbk'
>>>

Fix CI issue on PyPy 3 and the Rust compiler

https://travis-ci.org/github/scrapy/w3lib/jobs/759457398:

  error: Can not find Rust compiler
  
      =============================DEBUG ASSISTANCE=============================
      If you are seeing a compilation error please try the following steps to
      successfully install cryptography:
      1) Upgrade to the latest pip and try again. This will fix errors for most
         users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip
      2) Read https://cryptography.io/en/latest/installation.html for specific
         instructions for your platform.
      3) Check our frequently asked questions for more information:
         https://cryptography.io/en/latest/faq.html
      4) Ensure you have a recent Rust toolchain installed:
         https://cryptography.io/en/latest/installation.html#rust
      5) If you are experiencing issues with Rust for *this release only* you may
         set the environment variable `CRYPTOGRAPHY_DONT_BUILD_RUST=1`.
      =============================DEBUG ASSISTANCE=============================
  
  ----------------------------------------
  ERROR: Failed building wheel for cryptography

Note: Once fixed, re-run the CI jobs of affected pull requests.

remove_tags removes text inbetween < >

>>> from w3lib.html import remove_tags
>>> TEXT = u'<span itemprop="name"> <Afgeschermd2> </span>'
>>> remove_tags(TEXT)
u'  '
>>> remove_tags(TEXT, keep=['Afgeschermd2'])
u' <Afgeschermd2> '

`canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered

canonicalize_url will decode all percent encoded elements in a string.

if a hash # is present as a percent-encoded entity (%23) it will be decoded... however it shouldn't be as it is a rfc delimiter and fundamentally changes the URL structure -- turning the subsequent characters into a fragment. one of the effects is that a canonical url with a safely encoded hash will point to another url; another is that running the canonical on the output will return a different url.

example:

>>> import w3lib.url
>>> url = "https://example.com/path/to/foo%20bar%3a%20biz%20%2376%2c%20bang%202017#bash"
>>> canonical = w3lib.url.canonicalize_url(url)
>>> print canonical
https://example.com/path/to/foo%20bar:%20biz%20#76,%20bang%202017
>>> canonical2 = w3lib.url.canonicalize_url(canonical)
>>> print canonical2
https://example.com/path/to/foo%20bar:%20biz%20

what is presented as a fragment in "canonical": #76,%20bang%202017 is part of the valid url - not a fragment - and is discarded when canonicalize_url is run again.

references:

canonicalize_url breaks certain url(s)

The url /cmp/Supermercados-Dia%25 is incorrectly unquoted into /cmp/Supermercados-Dia%

Problem happens in
def _unquotepath(path): for reserved in ('2f', '2F', '3f', '3F'): path = path.replace('%' + reserved, '%25' + reserved.upper()) return urllib.unquote(path)

Many schemes for URL

In the following file,
w3lib/w3lib/url.py.

The following method checks for only 3 schemes, but there are many schemes possible out there which are valid urls.
def is_url(text):
return text.partition("://")[0] in ('file', 'http', 'https')

Add canonicalize_url() to w3lib.url

The idea is to move scrapy's canonicalize_url() helper to w3lib.url
Projects like frontera could be interested: see scrapinghub/frontera#168 (comment)

html encoded urls

the HTML5 spec (possibly 4) allows for urls to encode an ampersand character, which the browser will automatically decode.

To a browser, these two urls are identical:

url1 = "http://example.com?foo=bar&bar=foo"
url2 = "http://example.com?foo=bar&amp;bar=foo"

To w3lib (and urlparse) they are not:

print w3lib.url.canonicalize_url(url1)
# http://example.com/?bar=foo&foo=bar

print w3lib.url.canonicalize_url(url2)
# http://example.com/?amp=&bar=foo&foo=bar

test_add_or_replace_parameter fails on Python 3.6.13, 3.7.10, 3.8.8, 3.9.2 due to CVE-2021-23336 fix

In Python 3.6.13, 3.7.10, 3.8.8, and 3.9.2, urllib.parse.parse_qsl no longer treats ; as a separator by default.

Python 3.8.8 (default, Feb 19 2021, 11:04:50) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.parse
>>> urllib.parse.parse_qsl('arg1=v1;arg2=v2')
[('arg1', 'v1;arg2=v2')]

This causes UrlTests.test_add_or_replace_parameter to fail.

__________________________________________________________ UrlTests.test_add_or_replace_parameter ___________________________________________________________

self = <tests.test_url.UrlTests testMethod=test_add_or_replace_parameter>

    def test_add_or_replace_parameter(self):
        url = 'http://domain/test'
        self.assertEqual(add_or_replace_parameter(url, 'arg', 'v'),
                         'http://domain/test?arg=v')
        url = 'http://domain/test?arg1=v1&arg2=v2&arg3=v3'
        self.assertEqual(add_or_replace_parameter(url, 'arg4', 'v4'),
                         'http://domain/test?arg1=v1&arg2=v2&arg3=v3&arg4=v4')
        self.assertEqual(add_or_replace_parameter(url, 'arg3', 'nv3'),
                         'http://domain/test?arg1=v1&arg2=v2&arg3=nv3')
    
        url = 'http://domain/test?arg1=v1;arg2=v2'
>       self.assertEqual(add_or_replace_parameter(url, 'arg1', 'v3'),
                         'http://domain/test?arg1=v3&arg2=v2')
E       AssertionError: 'http://domain/test?arg1=v3' != 'http://domain/test?arg1=v3&arg2=v2'
E       - http://domain/test?arg1=v3
E       + http://domain/test?arg1=v3&arg2=v2
E       ?                           ++++++++

tests/test_url.py:303: AssertionError

w3lib.http.headers_raw_to_dict doesn't give output I expect

My raw data is "Content-type: text/html\n\rAccept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"

I expect to see the output as
"{'Content-type': ['text/html'], 'Accept': ['text/html', 'application/xhtml+xml', 'application/xml;q=0.9', 'image/webp', '*/*;q=0.8']}"

but I got
{'Content-type': ['text/html'], 'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8']}

I think w3lib.http.headers_raw_to_dict should be handled the comma.

thanks and best regards

url_query_cleaner appends ? to urls without a query string

>>> url_query_cleaner('http://domain.tld/', ['bla'], remove=True)
'http://domain.tld/?'

This is the code which does it: url = '?'.join([base, sep.join(querylist)]) if querylist else base (querylist==['']).

multiple parameters for add_or_replace_parameter

Usually, when working with add_or_replace_parameter we often end up writing something like

url = add_or_replace_parameter(url, 'q', 'query')
url = add_or_replace_parameter(url, 'location', 'US')
url = add_or_replace_parameter(url, 'count', 10)

What do you think of a similar function that works with a dictionary?

url = add_or_replace_parameters(url, {
  'q': 'query',
  'location': 'US',
  'count': 10
})

remove_tags not working on html comments

I want to remove_tags from '<div>text</div>'. This is what I get as a result: '-->text', while 'text' is expected.

url_query_cleaner and fragments

Hi guys, why url_query_cleaner removes the fragment part? Is it expected?

In [5]: url_query_cleaner('http://domain.tld/?bla=123#123123', ['bla'], remove=True)
Out[5]: 'http://domain.tld/'

Pipe symbol ("|") is not percent encoded

Pipe symbol ("|") is in reserved symbols list in url.py https://github.com/scrapy/w3lib/blob/master/w3lib/url.py#L67 and is not percent encoded by safe_url_string which used by scrapy to download urls.
RFC mentioned in url.py https://www.ietf.org/rfc/rfc3986.txt doesn't contain "|" in reserved symbols:

      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

And I've found a site (in top 20 of Alexa, possible using play framework) which has such links with pipes, and it is answering with http code 400 (bad request) if "|" is not percent encoded in url.
Is this a bug? How can I avoid it properly? For now I just removed "|" from url.py itself.

`canonicalize_url` should not accept whitespace at the begin of an URI

Accepting whitespace in the begin of URI introduces a crawling issue. Unique URIs are generated on each depth and crawler gets in a loop.

When a URI is extracted and begins with whitespace it is being considered a relative URI and its concatenated at the end of current URI, which leads to unique URI on each crawler depth.

But, accordingly to RFC2396 [ Page 38], the scheme part of an URI should begin with an ALPHA character.

"The syntax for URI scheme has been changed to require that all schemes begin with an alpha character."

After further investigation, the core python urlparse function doesn't follow RFC strictly, they consider some rules. Like on Python 3.5 they say:

"Changed in version 3.3: The fragment is now parsed for all URL schemes (unless allow_fragment is false), in accordance with RFC 3986. Previously, a whitelist of schemes that support fragments existed."

However a better implementation should be in accordance with RFC 3986 (which is an update of RFC2396 mentioned before), the urlparse should not accept whitespace at the begin of URI.

When compared to the with Ruby implementation of RFC 2396, they don't allow whitespace at the begin of absolute or relative URIs and throws an exception "bad URI(is not URI?)".

In order to fix that on Scrapy level, I suggest the following.

In scrapy/utils/url.py add at the begin of canonicalize_url:

if re.match('[a-zA-Z]', url[:1]) == None:
    raise ValueError('Bad URI (is not a valid URI?)')

An a test in tests/test_utils_url.py:

def test_rfc2396_scheme_should_begin_with_alpha(self):
    self.assertRaises(ValueError, canonicalize_url," http://www.example.com")

But raising an exception breaks a lot of tests and can introduce unexpected behaviors. You could simple .strip() url, but won't make it RFC compliant.

Redirection may not work depending on order of 'content' and 'http-equiv' in meta tag

Description

Scrapy may not handle redirection depending on the order of content and http-equiv attributes of <meta> tag

Steps to Reproduce

Create two sample pages and serve them using simple http server:

echo '<html><head><meta content="0;url=dummy.html" http-equiv="refresh"></head></html>' > index1.html
echo '<html><head><meta http-equiv="refresh" content="0;url=dummy.html"></head></html>' > index2.html
python3 -m http.server -d .

On another terminal open scrapy shell:

scrapy shell
>>> fetch('http://localhost:8000')
2021-01-29 21:24:22 [scrapy.core.engine] INFO: Spider opened
2021-01-29 21:24:22 [scrapy.core.engine] DEBUG: Crawled (404) <GET http: //localhost:8000/robots.txt> (referer: None)
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2021-01-29 21:24:22 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2021-01-29 21:24:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http: //localhost:8000> (referer: None)

>>> fetch('http://localhost:8000/index1.html')
2021-01-29 21:24:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http: //localhost:8000/index1.html> (referer: None)

>>> fetch('http://localhost:8000/index2.html')
2021-01-29 21:24:30 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (meta refresh) to <GET http: //localhost:8000/dummy.html> from <GET http: //localhost:8000/index2.html>
2021-01-29 21:24:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET http: //localhost:8000/dummy.html> (referer: None)

Expected behavior:

Redirection happens in both cases.

Actual behavior:

Redirection only happens in second case (http-equiv, content), not in first (content, http-equiv).

Reproduces how often:

Always.

Versions

Scrapy       : 2.4.1
lxml         : 4.6.2.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 20.3.0
Python       : 3.8.5 (default, Jul 28 2020, 12:59:40) - [GCC 9.3.0]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020)
cryptography : 3.3.1
Platform     : Linux-4.4.0-18362-Microsoft-x86_64-with-glibc2.29

Why does html_body_declared_encoding() convert 'ISO-8859-1' to 'cp1252'?

w3lib.encoding.html_body_declared_encoding() takes this markup which clearly states the encoding to be "ISO-8859-1":

<meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">

and returns to the caller 'cp1252'.

The following dict is providing the translation... but why is this even happening:

DEFAULT_ENCODING_TRANSLATION = {...
    'latin_1': 'cp1252',...}

Here is a shell log to demonstrate:

stav@maia:$ scrapy shell http://www.highrisefire.com/career.html
2013-03-31 21:20:00-0600 [scrapy] INFO: Scrapy 0.17.0 started (bot: def)
...
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> le = SgmlLinkExtractor(allow=r'job', restrict_xpaths='//p')
>>> response.encoding
'cp1252'
>>> le.extract_links(response)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/srv/scrapy/scrapy/scrapy/contrib/linkextractors/sgml.py", line 124
    ).encode(response.encoding)
  File "/usr/lib/python2.7/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x96'
    in position 674: character maps to <undefined>
>>>
>>> le.extract_links(response.replace(encoding='ISO-8859-1'))
[Link(url='http://www.highrisefire.com/job1.html', text=u'Get Details',...]

remove_tags_with_content bug

use this
remove_tags_with_content("""<div class="hqimg_related"><div class="to_page"><span>热点栏目</span> <s class='hotSe'></s>""", ("s",))
get result:
<div class="hqimg_related"><div class="to_page">
There may be some mistakes
then modified the expression.
tags = '|'.join([r'<%s\s+.*?</%s>|<%s\s*/>' % (tag, tag, tag) for tag in which_ones])
finally,it's work fine

It's not a good idead to parse HTML text using regular expressions

In w3lib.html regular expressions are used to parse HTML texts:

_ent_re = re.compile(r'&((?P<named>[a-z\d]+)|#(?P<dec>\d+)|#x(?P<hex>[a-f\d]+))(?P<semicolon>;?)', re.IGNORECASE)
_tag_re = re.compile(r'<[a-zA-Z\/!].*?>', re.DOTALL)
_baseurl_re = re.compile(six.u(r'<base\s[^>]*href\s*=\s*[\"\']\s*([^\"\'\s]+)\s*[\"\']'), re.I)
_meta_refresh_re = re.compile(six.u(r'<meta\s[^>]*http-equiv[^>]*refresh[^>]*content\s*=\s*(?P<quote>["\'])(?P<int>(\d*\.)?\d+)\s*;\s*url=\s*(?P<url>.*?)(?P=quote)'), re.DOTALL | re.IGNORECASE)
_cdata_re = re.compile(r'((?P<cdata_s><!\[CDATA\[)(?P<cdata_d>.*?)(?P<cdata_e>\]\]>))', re.DOTALL)

However this is definitely incorrect when it involves commented contents, e.g.

>>> from w3lib import html
>>> html.get_base_url("""<!-- <base href="http://example.com/" /> -->""")
'http://example.com/'

Introducing "heavier" utilities like lxml would solve this issue easily, but that might be an awful idea as w3lib aims to be lightweight & fast.
Or maybe we could implement some quick parser merely for eliminating the commented parts.

Any ideas?

Port of Django's urlize

Hi,

it would be awesome to have a port of Django's urlize (https://github.com/django/django/blob/master/django/utils/html.py#L236) in w3lib.

urlize creates clickable links, i.e. <a href="https://example.com>https://example.com</a> out of every URL detected in a string.

Cheers,

Lukas

w3lib/encoding.py:45: DeprecationWarning: Flags not at the start of the expression

I'm seeing this warning on Python 3.7. Importing the module is enough to show these warnings.

> python -Wall
/home/mhaase/.local/share/virtualenvs/starbelly-21k54S-f/lib/python3.7/site.py:165: DeprecationWarning: 'U' mode is deprecated
  f = open(fullname, "rU")
/home/mhaase/.local/share/virtualenvs/starbelly-21k54S-f/lib/python3.7/site.py:165: DeprecationWarning: 'U' mode is deprecated
  f = open(fullname, "rU")
/home/mhaase/.local/share/virtualenvs/starbelly-21k54S-f/lib/python3.7/site.py:165: DeprecationWarning: 'U' mode is deprecated
  f = open(fullname, "rU")
Python 3.7.0 (default, Sep 12 2018, 18:30:08) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import w3lib.encoding
/home/mhaase/.local/share/virtualenvs/starbelly-21k54S-f/lib/python3.7/site-packages/w3lib/encoding.py:45: DeprecationWarning: Flags not at the start of the expression '<\\s*(?:meta(?x)(?:\\s' (truncated)
  _BODY_ENCODING_STR_RE = re.compile(_BODY_ENCODING_PATTERN, re.I)
/home/mhaase/.local/share/virtualenvs/starbelly-21k54S-f/lib/python3.7/site-packages/w3lib/encoding.py:45: DeprecationWarning: Flags not at the start of the expression '<\\s*(?:meta(?x)(?:\\s' (truncated)
  _BODY_ENCODING_STR_RE = re.compile(_BODY_ENCODING_PATTERN, re.I)
/home/mhaase/.local/share/virtualenvs/starbelly-21k54S-f/lib/python3.7/site-packages/w3lib/encoding.py:46: DeprecationWarning: Flags not at the start of the expression b'<\\s*(?:meta(?x)(?:\\s' (truncated)
  _BODY_ENCODING_BYTES_RE = re.compile(_BODY_ENCODING_PATTERN.encode('ascii'), re.I)
/home/mhaase/.local/share/virtualenvs/starbelly-21k54S-f/lib/python3.7/site-packages/w3lib/encoding.py:46: DeprecationWarning: Flags not at the start of the expression b'<\\s*(?:meta(?x)(?:\\s' (truncated)
  _BODY_ENCODING_BYTES_RE = re.compile(_BODY_ENCODING_PATTERN.encode('ascii'), re.I)

This deprecation appears to result from BPO #22493: inline flags may not be used in the middle of a regex anymore. I cannot see where encoding.py is using inline flags, though.

get_meta_refresh() doesn't use encoding when extracting url

https://github.com/scrapy/w3lib/blob/master/w3lib/html.py#L203

Shouldn't it pass the encoding parameter into safe_url_string() method?
Otherwise it is a different behavior from SGMLs in Scrapy, when response encoding IS used.

url.add_or_replace_parameter(s) removes param values for param with multiple values

url.add_or_replace_parameter(s) removes param values for param with multiple values, and only the last value is kept, even if the param name is not included in the param params.

url.add_or_replace_parameter('http://example.com/?foo=1&foo=2', 'bar', '3')                 
> http://example.com/?foo=2&bar=3

should the canonize_url function convert an apostrophe to %27

I assume we want to normalize the url so that we don't have a problem with possible characters.

In this case, the lack of converting ' to %27 means that there may be problems saving the url in the SQL database.

Issue in safe_url_encoding

Hi,

I hope I find you well.

I found the following issue, not sure if it's a bug, but it didn't behave like this in the previous version (1.22).

Before on 1.22.0:

safe_url_string("foo://BAR/foobar") == "foo://BAR/foobar"

But now on 2.0.1:

safe_url_string("foo://BAR/foobar") == "foo://bar/foobar"

I understand that the domain is not case sensitive, but is this the desired behavior of the function?

Cheers,

scrapy gunzip inclusion?

Would scrapy's gunzip and is_gzipped functions (scrapy/utils/gz.py) fit in w3lib?
If is_gzipped would take headers instead of response as argument, there is no other dependency.

should `canonicalize_url` treat path parameters like query string parameters?

This is sort of an edge case as very few websites use path parameters anymore, however some do.

For those unfamiliar, they're contained in urlparse()[3] or urlparse().params. The RFCs basically describe them as parameters specific to the last path segment and can be kwargs or raw values.

Very few systems still use it, but some do. For example, Amazon used them from launch until the early 2000s to handle cookieless-sessions and much of what is now in query-strings. A handful of java servers use them for sessions too (e.g JSESSIONID).

safe_url_string can raise UnicodeError

A similar issue was fixed in scrapy 1.1.1: scrapy/scrapy#2038, but it is still a problem in w3lib.url.safe_url_string.

Add sphinx documentation

Something simple that lists functions available and uses autodoc to pull their docstrings, should be enough.

w3lib.html.remove_entities have a confusing name

The function says in its docstring:

Remove entities from the given text by converting them to
corresponding unicode character.

I understand it actually, makes the entities go away (by converting them!) but common sense tells me remove is take away something and I would expect the function to remove those characters from the text.

I suggest rename remove_entities to convert_entities so it's clear what it does. (Even already uses a helper function named convert_entity!)

allow to customize encoding in w3lib.http.basic_auth_header

See scrapy/scrapy#2530

canonicalize_url isn't handling some crucial cases

removal of userinfo
dots and slashes in path and hostname
spaces succeeding and preceding the URL
common session id variables and their values
ip v6 canonicalization

Useful links:
https://developers.google.com/safe-browsing/v4/urls-hashing
https://github.com/iipc/urlcanon/blob/master/python/urlcanon/canon.py#L530

add_or_replace_parameter() when URL doesn't end on slash

Is this the expected behavior for add_or_replace_parameter?

>>> from w3lib.url import add_or_replace_parameter
>>> add_or_replace_parameter('http://example.com', 'foo', 'bar')
'http://example.com?foo=bar'

Or should add_or_replace_parameter add a slash right after the netloc? Like:

http://example.com/?foo=bar

Chrome seems to automatically add that slash, so I'm not sure if this should be handled by w3lib or by the http client.

html.get_meta_refresh incorrectly parses url

I have an example of meta refresh tag which is correctly parsed by major browsers such as Chrome and Safari, but not by html.get_meta_refresh. Link: http://www.moveomed.de

<meta http-equiv="refresh" content="0; URL=
http://www.domain.de/index.php" />

Note the newline between URL= and http://

This returns (0.0, 'http://moveomed.de/%0Ahttp://moveomed.de/index.php') whereas it should have returned (0.0, 'http://moveomed.de/index.php')

[request] Update pypi release

It's been over 2 years. Please create a new tagged release for PyPi.

add_or_replace_parameter lacks control of what is encoded

add_or_replace_parameter uses urlencode function to convert a string to percent-encoded ASCII string:
https://github.com/scrapy/w3lib/blob/master/w3lib/url.py#L229

If the value of the parameter is already percent-encoded (in my scenario, this value came from a JSON response of an API), it is not possible to specify safe characters that should not be encoded. This argument is available in urlencode signature:
https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlencode

My suggestion is to include a safe argument in add_or_replace_parameter to improve the control over the encoding of your values:
Actual: w3lib.url.add_or_replace_parameter (url, name, new_value)
New: w3lib.url.add_or_replace_parameter(url, name, new_value, safe='')

In [1]: from urllib.parse import urlencode
   ...: from w3lib.url import add_or_replace_parameter
   ...: # This came from an API response already encoded
   ...: full_hierarchy = 'Appliances_Air+Purifiers+%26+Dehumidifiers_Air+Purifiers'
   ...: url = 'http://example.com'

In [2]: urlencode({'hierarchy': full_hierarchy})
Out[2]: 'hierarchy=Appliances_Air%2BPurifiers%2B%2526%2BDehumidifiers_Air%2BPurifiers'

In [3]: urlencode({'hierarchy': full_hierarchy}, safe='+%')
Out[3]: 'hierarchy=Appliances_Air+Purifiers+%26+Dehumidifiers_Air+Purifiers'

In [4]: add_or_replace_parameter(url, 'hierarchy', full_hierarchy)
Out[4]: 'http://example.com?hierarchy=Appliances_Air%2BPurifiers%2B%2526%2BDehumidifiers_Air%2BPurifiers'

headers_raw_to_dict, problem with multiple headers with same name

Hello. I have some problems with headers_raw_to_dict.
As you can see in a code snippet we have two X-Robots-Tag in raw headers, but i have only one in resulting dict. Thanks for your attention

>>> with open('../sandbox/response_headers', 'rb') as f:
...    headers = f.read()
... 
>>> headers
b'Accept-Ranges: bytes\r\nEtag: "23b8cacf-1635-546b539ee0f1a"\r\nContent-Type: text/html; charset=ISO-8859-1\r\nLast-Modified: Sun, 22 Jan 2017 21:04:18 GMT\r\nX-Robots-Tag: noindex, nofollow\r\nX-Robots-Tag: otherbot: index, nofollow\r\nServer: Apache\r\nDate: Mon, 30 Jan 2017 20:43:35 GMT\r\nContent-Language: it-IT\r\n'

>>> from w3lib.http import headers_raw_to_dict
>>> import pprint

>>> headers_dict = headers_raw_to_dict(headers)
>>> pprint.pprint(headers_dict)
{b'Accept-Ranges': [b'bytes'],
 b'Content-Language': [b'it-IT'],
 b'Content-Type': [b'text/html; charset=ISO-8859-1'],
 b'Date': [b'Mon, 30 Jan 2017 20:43:35 GMT'],
 b'Etag': [b'"23b8cacf-1635-546b539ee0f1a"'],
 b'Last-Modified': [b'Sun, 22 Jan 2017 21:04:18 GMT'],
 b'Server': [b'Apache'],
 b'X-Robots-Tag': [b'otherbot: index, nofollow']}

The problem is here:

    headers = headers_raw.splitlines()
    headers_tuples = [header.split(b':', 1) for header in headers]
    return dict([
        (header_item[0].strip(), [header_item[1].strip()])
        for header_item in headers_tuples
        if len(header_item) == 2
])

safe_url_string URL-encodes already-encoded username and password, breaking idempodency

The documentation claims that calling safe_url_string on an already “safe” URL will return the URL unmodified, but this breaks when the username or password include %.

>>> url = 'http://%25user:%25pass@host'
>>> url = w3lib.url.safe_url_string(url); url
'http://%2525user:%2525pass@host'
>>> url = w3lib.url.safe_url_string(url); url
'http://%252525user:%252525pass@host'
>>> url = w3lib.url.safe_url_string(url); url
'http://%25252525user:%25252525pass@host'
>>> url = w3lib.url.safe_url_string(url); url
'http://%2525252525user:%2525252525pass@host'
>>> url = w3lib.url.safe_url_string(url); url
'http://%252525252525user:%252525252525pass@host'
>>> url = w3lib.url.safe_url_string(url); url
'http://%25252525252525user:%25252525252525pass@host'

Header values must be of type str or bytes

I have a puzzle，
why headers_raw_to_dict don't return to this?

>>> import w3lib.http
>>> w3lib.http.headers_raw_to_dict(b"Content-type: text/html\n\rAccept: gzip\n\n")   
{'Content-type': 'text/html', 'Accept': 'gzip'}

now it return this

>>> import w3lib.http
>>> w3lib.http.headers_raw_to_dict(b"Content-type: text/html\n\rAccept: gzip\n\n")   
{'Content-type': ['text/html'], 'Accept': ['gzip']}

i use headers_raw_to_dict when i want to copy Request Headers from chrome,

In [31]: copy_from_chrome = """Accept:text/html,application/xhtml+xml,applicati
    ...: on/xml;q=0.9,image/webp,*/*;q=0.8^M
    ...: Accept-Encoding:gzip, deflate, sdch^M
    ...: Accept-Language:zh-CN,zh;q=0.8^M
    ...: Cache-Control:max-age=0^M
    ...: Connection:keep-alive^M
    ...: Cookie:username-pes-8888="2|1:0|10:1494207240|17:username-pes-8888|48:
    ...: ODdmYWI4NmQtNDA0OC00Y2YzLTg3ZjYtOWE3Mzk0YmRiZTA2|3284b8f38c8d142ac8e71
    ...: 21c4dfd6f04d7548ccb6680f56e74a32c5f3f9dc3d4"^M
    ...: Host:pes^M
    ...: Upgrade-Insecure-Requests:1^M
    ...: User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHT
    ...: ML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"""

In [32]: from w3lib.http import headers_raw_to_dict

In [33]: headers = headers_raw_to_dict(copy_from_chrome)
In [35]: headers
Out[35]:
{'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/
*;q=0.8'],
 'Accept-Encoding': ['gzip, deflate, sdch'],
 'Accept-Language': ['zh-CN,zh;q=0.8'],
 'Cache-Control': ['max-age=0'],
 'Connection': ['keep-alive'],
 'Cookie': ['username-pes-8888="2|1:0|10:1494207240|17:username-pes-8888|48:ODdm
YWI4NmQtNDA0OC00Y2YzLTg3ZjYtOWE3Mzk0YmRiZTA2|3284b8f38c8d142ac8e7121c4dfd6f04d75
48ccb6680f56e74a32c5f3f9dc3d4"'],
 'Host': ['pes'],
 'Upgrade-Insecure-Requests': ['1'],
 'User-Agent': ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/57.0.2987.133 Safari/537.36']}

then i use this headers for requests

In [36]: import requests

In [37]: z = requests.get(url,headers=headers)

but Header values must be of type str or bytes
so i need do this

In [39]: {i:headers[i][0] for i in headers}
Out[39]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*
;q=0.8',
 'Accept-Encoding': 'gzip, deflate, sdch',
 'Accept-Language': 'zh-CN,zh;q=0.8',
 'Cache-Control': 'max-age=0',
 'Connection': 'keep-alive',
 'Cookie': 'username-pes-8888="2|1:0|10:1494207240|17:username-pes-8888|48:ODdmY
WI4NmQtNDA0OC00Y2YzLTg3ZjYtOWE3Mzk0YmRiZTA2|3284b8f38c8d142ac8e7121c4dfd6f04d754
8ccb6680f56e74a32c5f3f9dc3d4"',
 'Host': 'pes',
 'Upgrade-Insecure-Requests': '1',
 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, l
ike Gecko) Chrome/57.0.2987.133 Safari/537.36'}

deprecate w3lib.form.encode_multipart

w3lib.form.encode_multipart has the following issues:

there are no tests;
it uses fixed boundary that can appear in uploaded data;
it doesn't escape nor keys nor filenames so they can break the uploading code;
it doesn't support unicode keys and values (they are encoded implicitly), so users must figure RFC details themselves if they have non-ascii input;
minor one - it doesn't support integer values.

w3lib.form.encode_multipart is based on distutils code; this code was never intended to be a general multipart encoding code, and there was veto on distutils development for several years (it was lifted not so long ago), so the code in CPython trunk is not much better.

As a part of #13 I copied distutils code from CPython trunk, wrote some tests and backported the code to make it work in Python 2.x.

Instead of fixing these issues I propose to deprecate w3lib.form.encode_multipart and point users to urllib3 implementation. This implementation has similar interface and handles more edge cases; it is used by requests library, works in Python 2.x and 3.x and looks well supported.

basic_auth_header uses the wrong flavor of base64

I have reason to believe that basic_auth_header is wrong in using urlsafe_b64encode (which replaces +/ with -_) instead of b64encode.

The first specification of HTTP basic auth according to Wikipedia is HTTP 1.0, which does not mention any special flavor of base64, and points for a definition of base64 to RFC-1521, which describes regular base64. The latest HTTP basic auth specification according to Wikipedia is RFC-7617, which similarly does not specify any special flavor of base64, and points to section 4 of RFC-4648, which also describes the regular base64.

I traced the origin of this bug, and it has been there at least since the first Git commit of Scrapy.

>>> from w3lib.http import basic_auth_header

Actual:

>>> basic_auth_header('aa~aa¿', '')
b'Basic YWF-YWG_Og=='

Expected:

>>> basic_auth_header('aa~aa¿', '')
b'Basic YWF+YWG/Og=='

I believe this bug only affects ASCII credentials that include the >, ? or ~ characters in certain positions.

For richer encodings like UTF-8, which is what basic_auth_header uses (~~and makes sense as a default, but it should be configurable~~ rightly so), many more characters can be affected (e.g. ¿ in the example above).

/w3lib/url.py

Hi,

I'm not sure this is just me or we have this bug in this file /w3lib/url.py
unicode_to_str is not defined there.
Is is used to be imported from there
(from w3lib.util import unicode_to_str)
please advice

w3lib.url.safe_url_string incorrectly encode IDNA domain with port

Step to reproduce:

>>> from w3lib.url import safe_url_string
>>> safe_url_string('http://新华网.**')
'http://xn--xkrr14bows.xn--fiqs8s'
>>> safe_url_string('http://新华网.**:80')
'http://xn--xkrr14bows.xn--:80-u68dy61b'

safe_url_string('http://新华网.**:80')
expected result:

'http://xn--xkrr14bows.xn--fiqs8s:80'

real result:

'http://xn--xkrr14bows.xn--:80-u68dy61b'

Related code:

w3lib/w3lib/url.py

Line 80 in ef5c110

netloc = parts.netloc.encode('idna')

netloc = parts.netloc.encode('idna')

Maybe IDNA encoding should be done on hostname rather than netloc.

w3lib 1.14 breaks scrapy 0.24.6 usage of urlparse.urldefrag

$ pip install scrapy==0.24.6

This currently pulls in w3lib 1.14.0 or 1.14.1.

$ python -c "from scrapy import Request; req = Request(url='http://www.example.com')"

Gives:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/avish/.virtualenvs/w3lib-bug/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 26, in __init__
    self._set_url(url)
  File "/home/avish/.virtualenvs/w3lib-bug/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 52, in _set_url
    self._url = escape_ajax(safe_url_string(url))
  File "/home/avish/.virtualenvs/w3lib-bug/local/lib/python2.7/site-packages/scrapy/utils/url.py", line 102, in escape_ajax
    defrag, frag = urlparse.urldefrag(url)
AttributeError: 'function' object has no attribute 'urldefrag'

Downgrading to w3lib==1.13.0 fixes this issue.

The problem doesn't happen with scrapy>=1.0 (but 1.0 introduced breaking changes so some projects still use 0.24)