maxhumber / gazpacho Goto Github PK

View Code? Open in Web Editor NEW

731.0 18.0 57.0 12.57 MB

🥫 The simple, fast, and modern web scraping library

Home Page: https://www.gazpacho.xyz

License: MIT License

Python 100.00%

gazpacho webscraping scraping

gazpacho's People

Contributors

Stargazers

Watchers

gazpacho's Issues

Unused modules imported and other things that make linters sad

Describe the bug
Running Flake8 highlighted a couple of unused imports and other little things that aren't serious but can be cleaned up pretty easily.

To Reproduce

Run flake8 on the codebase

Expected behavior
Warnings about unused imports should be removed

Environment:

OS: Linux
Version: master

Additional context

Nothing serious, just a bit of a cleanup. PR raised - #52

Proxies

Is your feature request related to a problem? Please describe.

Really just a question, followed by a request. Does gazpacho currently support the use of proxies? (If not, would be great to include).

Describe the solution you'd like

Describe alternatives you've considered

Additional information

Auto isort > black > mypy on pushes, merges, and releases

Is your feature request related to a problem? Please describe.

Right now I manually run:

isort gazpacho
black .
mypy gazpacho

To make sure that the types are appropriate and the code is black.

Describe the solution you'd like

This should be performed automatically on pushes, merges, and releases

Add release versions to GitHub?

$ git tag v0.7.2 && git push --tags 🎉 🎈

I really like this project. I think that adding releases to the repository can help the project grow in popularity. I'd like to see that!

find an element with an attribute regardless of value

Is your feature request related to a problem? Please describe.

I have 'div' element with class='' and other div elements at the same level with class='whatever'
I cannot find a way to say get all the element with the attribute class

Describe the solution you'd like

soup.find('div', attrs={'class':''}, partial=True, mode='all')
should return a list with all the 'div' elements, but that is not the case

Describe alternatives you've considered

Tried to get the 'divs' in a 'div', but was not able to find a solution for that either.
Maybe solution is to do 2 find mode 'all' and concat the lists

Additional information

None

Finding tags return entire html

Describe the bug

Using soup.find on particular website(s) returns entire html instead of the matching tag(s)

Steps to reproduce the issue

Look for ul tag with attribute class="cves" (<ul class="cves">) on https://mariadb.com/kb/en/security/

from gazpacho import get, Soup
endpoint = "https://mariadb.com/kb/en/security/"
html_dump = Soup.get(endpoint)
sample = html_dump.find('ul', attrs={'class': 'cves'}, mode='all')

sample contains the contents of an entire html

Expected behavior

sample should contain the contents of the tag <ul class "cves">, which in this case would be rows of <li>-s, listing the CVEs and corresponding fixed version in MariaDB, something like:

<ul class="cves">
  <li>..</li>
  ...
  <li>..</li>
</ul>

Environment:

OS: Ubuntu Linux 18.04
Version: gazpacho 1.1, python 3.6.9

Additional information

Using BeautifulSoup on the same html_dump did get the job done, although the <li>-tags are weirdly nested together.

from bs4 import BeautifulSoup
# html_dump from above Soup.get(endpoint)
bs_soup = BeautifulSoup(html_dump.html, 'html.parser')
ul_cves = bs_soup.find_all('ul','cves')

ul_cves contain strangely nested <li>-s, from which it was still possible to extract the rows of <li>-s I was looking for.

<ul class="cves">
  <li>
    <li>
    ...
  </li></li>
</ul>

Parser is unable to capture attrs that have nested quote marks of the same type

Describe the bug
Came across this issue in the wild. If there is a ">" character in an attribute, the parser will misinterpret that as the closing tag, and the parsed text will include the some strings from the attributes.

To Reproduce
Code to reproduce the behaviour:

>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div"}).text
'2"}">text'

Expected behavior

>>> import gazpacho
>>> html = '<div tooltip-content="{"id": "7", "graph": "1->2"}">text</div>'
>>> soup = gazpacho.Soup(html)
>>> soup.find("div").text
'text'

Environment:

OS: macOS
Version: 10.15.6

Was just recommended this library and am a huge fan of the api you came up with, thanks a lot for this project!

Can't parse some HTML entries

Describe the bug

Can't parse some entries, there are 40 entries for every page, but some are not being parsed correctly.

Steps to reproduce the issue

from gazpacho import get, Soup

for i in range(1, 15):
    link = f'https://1337x.to/category-search/aladdin/Movies/{i}/'
    html = get(link)
    soup = Soup(html)
    body = soup.find("tbody")

    # extracting all the entries in the body,
    # there are 40 entries for every page, the last one can have less,
    entries = body.find("tr", mode='all')[::-1]

    # but for some pages it can't retrives all the entries from some reason
    print(f'{len(entries)} entries -> {link}')

Expected behavior

See 40 entries for every page

Environment:

Arch Linux - 5.13.10-arch1-1
Python - 3.9.6
Gazpacho - 1.1

API suggestion: soup.all("div") and soup.first("div")

The default auto behavior of .find() doesn't work for me, because it means I can't trust my code not to start throwing errors if the page I am scraping adds another matching element, or drops the number of elements down to one (triggering a change in return type).

I know I can do this:

div = soup.find("div", mode="first")
# Or this:
divs = soup.find("div", mode="all")

But having function parameters that change the return type is still a bit weird - not great for code hinting and suchlike.

Changing how .find() works would be a backwards incompatible change, which isn't good now that you're past the 1.0 release. I suggest adding two new methods instead:

div = soup.first("div") # Returns a single element
# Or:
divs = soup.all("div") # Returns a list of elements

This would be consistent with your existing API design (promoting the mode arguments to first class method names) and could be implemented without breaking existing code.

Add typehints to code

Add typehints to code.

Can we add type hints to the code? Maybe then we can run mypy on it?

It may help to fish out hidden bugs.

AttributeError: partially initialized module 'gazpacho' has no attribute 'get' (most likely due to a circular import)

Describe the bug

running the example on the readme resulted in such error or this one depending on how the import was written

ImportError: cannot import name 'get' from partially initialized module 'gazpacho' (most likely due to a circular import)

Using windows 10, Python 3.9.1

.find not working on non-closing tags like <img src='hi.png'>

Describe the bug
Find isn't working properly on tags that don't close

To Reproduce
Code to reproduce the behaviour:

from gazpacho import Soup, get

html = """
<div>
  <span>Blah</span>
  <p>Blah Blah</p>
  <img src='hi.png'>
  <br/>
  <img src='sup.png'>
</div>
"""

soup = Soup(html)
imgs = soup.find("img")
imgs[0].attrs['src']

Expected behavior
Should yield: 'hi.png'
Right now it errors with: TypeError: 'Soup' object is not subscriptable

Environment:

OS: macOS
Version: 0.9

Can't install whl files

Describe the bug

Hi,

There was a pull request (#48) to add whl publishing but it appears to have been lost somewhere in a merge on October 31st, 2020. (v1.1...master). Therefore, no wheels have been published for 1.1.

This causes the installation error on my system that the PR was meant to address.

Expected behavior

Install gazpacho with a wheel, not a tar.gz;. Please re-add the whl publishing.

Environment:

OS: Windows 10

from gazpacho import get, Soup

from gazpacho import get, Soup
ImportError: cannot import name 'get' from 'gazpacho'

import gazpacho
Does work

# Your code here
in VS code

from gazpacho import get, Soup
#import gazpacho #works
'''
url = 'https://scrape.world/books'
html = get(url)
soup = Soup(html)
#soup = soup.get(url)
books = soup.find('div', {'class': 'book-'}, partial=True)

def parse(book):
name = book.find('h4').text
price = float(book.find('p').text[1:].split(' ')[0])
return name, price

[parse(book) for book in books]
'''
Traceback (most recent call last):
File "c:/Passport_G/Rob_justpy/jptutorial/gazpacho.py", line 1, in
from gazpacho import get, Soup
File "c:\Passport_G\Rob_justpy\jptutorial\gazpacho.py", line 1, in
from gazpacho import get, Soup
ImportError: cannot import name 'get' from 'gazpacho' (c:\Passport_G\Rob_justpy\jptutorial\gazpacho.py)

Environment:

Windows 10]

from command line
(jp) C:\Passport_G\Rob_justpy\jptutorial>pip install -U gazpacho
Processing c:\users\rober\appdata\local\pip\cache\wheels\db\6b\a2\486f272d5e523b56bd19817c14ef35ec1850644dea78f9dd76\gazpacho-1.1-py3-none-any.whl
Installing collected packages: gazpacho
Successfully installed gazpacho-1.1
WARNING: You are using pip version 20.2.4; however, version 20.3.3 is available.
You should consider upgrading via the 'c:\passport_g\rob_justpy\jptutorial\jp\scripts\python.exe -m pip install --upgrade pip' command.

(jp) C:\Passport_G\Rob_justpy\jptutorial>
(jp) C:\Passport_G\Rob_justpy\jptutorial>python
Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32

Warning:
This Python interpreter is in a conda environment, but the environment has
not been activated. Libraries may fail to load. To activate this environment
please see https://conda.io/activation

Type "help", "copyright", "credits" or "license" for more information.

import gazpacho
Traceback (most recent call last):
File "", line 1, in
File "C:\Passport_G\Rob_justpy\jptutorial\gazpacho.py", line 1, in
from gazpacho import get, Soup
ImportError: cannot import name 'get' from 'gazpacho' (C:\Passport_G\Rob_justpy\jptutorial\gazpacho.py)

Support not a utf-8 encoding

Thank you for your nice project!

Please add an argument encoding to decode that does not utf-8 encoded pages.

gazpacho/gazpacho/get.py

Line 51 in ecd53af

content = response.read().decode("utf-8")

I tried EUC-KR encoded page and got an error message.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 95: invalid start byte

mypy gazpacho/get.py:35: error

Describe the bug

Error when running mypy gazpacho

Error

This is the error:

max@mbp gazpacho % mypy gazpacho
gazpacho/get.py:35: error: Argument 1 to "update" of "dict" has incompatible type "Optional[Dict[str, Any]]"; expected "Mapping[str, str]"
Found 1 error in 1 file (checked 4 source files)

Expected behavior

No errors 🙈

Environment:

OS: macOS
Version: 1.1.1-beta

Battle testing the format function

Is your feature request related to a problem? Please describe.

I'm worried that the format function is brittle.

Describe the solution you'd like

It should always return html. And never fail. I could use some help writing more tests for this function (that is run on every repr and str call).

Auto publish docs on release

Is your feature request related to a problem? Please describe.

gazpacho uses Portray to publish the documentation at https://gazpacho.xyz/

Describe the solution you'd like

This should happen automatically on new releases (perhaps with TravisCI)

Describe alternatives you've considered

Right now I have to manually run:

portray on_github_pages

To publish...

Making gazpacho PEP 561 compatible

Describe the bug
Although gazpacho is now type hinted, trying to use gazpacho types in another package (quote) causes this error:

quote/quote.py:3: error: Skipping analyzing 'gazpacho': found module but no type hints or library stubs

To Reproduce
Code to reproduce the behaviour:

mypy quote

Expected behavior

Shouldn't throw an error!

Environment:

OS: macOS
Version: 1.0

Pls make `find()` always return a list

Hi! This library is cool, but I've started using it and immediately stumbled upon one difficulty:
Soup.find() returns a list of Soups if it finds multiple tags, a single Soup object if it finds single tag, and None if it finds no tags.
This makes impossible to seamlessly use find() in for expressions and comprehensions like one in the "Books" example.

Imagine, I need to parse multiple pages, each one containing unknown ammount of books in 0 to N range.
To make it seamlessly I need to write 3-branch if expression or somehow catch the TypeError with nested try blocks.

This is what happens with the example when it finds only one book:

In [7]: from gazpacho import get, Soup
   ...: 
   ...: url = 'https://scrape.world/books'
   ...: html = get(url)
   ...: soup = Soup(html)
   ...: books = soup.find('div', {'class': 'book-early'})
   ...: 
   ...: def parse(book):
   ...:     name = book.find('h4').text
   ...:     price = float(book.find('p').text[1:].split(' ')[0])
   ...:     return name, price
   ...: 
   ...: [parse(book) for book in books]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-16cc6dbabde2> in <module>
     11     return name, price
     12 
---> 13 [parse(book) for book in books]

TypeError: 'Soup' object is not iterable

So, could you please make find() return an empty list if no tags, and a list with one element if only one tag is found?

Stacktoberfest

#Stacktoberfest

Introducing a campaign that I'm calling #Stacktoberfest 🥫

If you're a fan of gazpacho and want to help evangelize the package, this campaign is for you!

Question Bank

I've created a question bank and have already committed ~30 answers myself.

There are several questions in the bank that I still think deserve modern gazpacho answers.

Contributing

If you decide to answer any of the questions in the bank (or find another one that you think deserves a gazpacho answer, please submit a PR with a link to your answer!

Importantly, these answers should be high quality (we want to convince users that gazpacho > bs4), respectful, and the opposite of obnoxious.

Example

I found this question by searching for popular [web-scraping], [python] questions. It has 55k views, 19 upvotes and the original link is dead. Given that it gets a lot of traffic, I thought it deserved a new modern answer... here it is:

The original link posted by OP is dead... but here's how you might scrape table data with gazpacho:

Step 1 - import Soup and download the html:

from gazpacho import Soup

url = "https://en.wikipedia.org/wiki/List_of_multiple_Olympic_gold_medalists"
soup = Soup.get(url)

Step 2 - Find the table and table rows:

table = soup.find("table", {"class": "wikitable sortable"}, mode="first")
trs = table.find("tr")[1:]

Step 3 - Parse each row with a function to extract desired data:

def parse_tr(tr):
    return {
        "name": tr.find("td")[0].text,
        "country": tr.find("td")[1].text,
        "medals": int(tr.find("td")[-1].text)
    }

data = [parse_tr(tr) for tr in trs]
sorted(data, key=lambda x: x["medals"], reverse=True)

Looking forward to your contributions!

Improve 'partial' documentation

Is your feature request related to a problem? Please describe.

I don't understand the statement "Element attributes are partially matched by default." Does it mean attrs={"id": "foo"} will match attrs={"id": "foob"} ?

Describe the solution you'd like

Better description with examples of what would/would not be matched with partial=True vs False.

Describe alternatives you've considered

n/a

Additional information

n/a

get fails on urls containing unicode characters

Describe the bug

I get a UnicodeEncodeError when calling get() with a URL that contains Unicode characters.

To Reproduce
Code to reproduce the behaviour:

from gazpacho import get
url = 'https://worldofwarcraft.com/en-us/character/us/stormrage/drãke'
html = get(url)

UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 36: ordinal not in range(128)

Expected behavior

get() should succeed without throwing an exception

Environment:

OS: [Linux]
Version: [e.g. 0.9]

.text returns nested data instead of trailing data

Describe the bug
When trying to get text from a tag, gazpacho returns empty string

To Reproduce
Code to reproduce the behaviour:

from gazpacho import Soup

html  = '<a href="/Sorasful?source=gig_cards&referrer_gig_slug=edit-mixing-and-mastering&ref_ctx_id=42d34014-b499-46fa-a1d3-04318b12fecc" rel="nofollow noopener noreferrer" target="_self"><span>by </span>Sorasful</a>'

soup = Soup(html)

print(soup.text)
# prints nothing

print(soup.find('a').text)
# prints "by"

Expected behavior
Should return "by Sorasful"

Environment:

OS: Windows 10
Version: Python3.8

find fails on nested empty tags

Describe the bug
The find method gets confused on empty element tags (img, meta, etc...)

To Reproduce
Code to reproduce the behaviour:

from gazpacho import Soup

html = '''
<div class="foo-list">
  <a class="foo" href="/foo/1">
    <div class="foo-image-container">
      <img src="image.jpg">
    </div>
  </a>
  <a class="foo" href="/foo/2">
    <div class="foo-image-container">
      <img src="image.jpg">
    </div>
  </a>
</div>
'''

soup = Soup(html)
soup.find('a', {'class': "foo"})

Expected behavior
find should be able to "find" a list of two a tags. Instead the full blob is getting returned.

Environment:

OS: macOS
Version: 0.8

[Error] "HTTPError" has no attribute "msg"

Describe the bug
Error when running mypy:

gazpacho/get.py:68: error: "HTTPError" has no attribute "msg"
Found 1 error in 1 file (checked 5 source files)

To Reproduce
Code to reproduce the behaviour:

mypy gazpacho

Expected behavior
A clear and concise description of what you expected to happen.

mypy should run without error

Environment:

OS: macOS
Version: 1.0

.text is empty on Soup creation

Describe the bug

When I create a soup object...

To Reproduce

Calling .text returns an empty string:

from gazpacho import Soup

html = """<p>&pound;682m</p>"""

soup = Soup(html)
print(soup.text)
''

Expected behavior

Should output:

print(soup.text)
'£682m'

Environment:

OS: macOS
Version: 1.1

Additional context

Inspired by this S/O question

AttributeError: 'Soup' object has no attribute 'decode'

I tried code:

all_urls = [link.attrs['href'] for link in Soup(get(browser_link)).find('a')]

and I got AttributeError: 'Soup' object has no attribute 'decode'. What check? Where is mistake in my code?

Full info:

File "C:\webscraper\lib\site-packages\gazpacho\get.py", line 29, in get
    url = sanitize(url)
  File "C:\webscraper\lib\site-packages\gazpacho\utils.py", line 128, in sanitize
    scheme, netloc, path, query, fragment = urlsplit(url)
  File "C:\Program Files\Python39\lib\urllib\parse.py", line 455, in urlsplit
    url, scheme, _coerce_result = _coerce_args(url, scheme)
  File "C:\Program Files\Python39\lib\urllib\parse.py", line 125, in _coerce_args
    return _decode_args(args) + (_encode_result,)
  File "C:\Program Files\Python39\lib\urllib\parse.py", line 109, in _decode_args
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
  File "C:\Program Files\Python39\lib\urllib\parse.py", line 109, in <genexpr>
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
AttributeError: 'Soup' object has no attribute 'decode'

Parser gets misinterprests ">

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Code to reproduce the behaviour:

Expected behavior
A clear and concise description of what you expected to happen.

Environment:

OS: [macOS, Linux, Windows]
Version: [e.g. 0.8.1]

Additional context
Add any other context about the problem here.

Does not find autoclosing tag (ie: <img />)

Soup.find('img') fails to find tags
To Reproduce

s=Soup('<div><img src="" /></div>')
print(s.find('img'))
>None

Expected behavior
It should return img tag.

As a matter of fact

s=Soup(<div><img src="" ></img></div>')

works
Environment:

OS: Linux

Find the parent of a node

Is your feature request related to a problem? Please describe.

I would like to be able to find the parent of a node.

Describe the solution you'd like

I think soup.parent would be a nice UI.
For example if we have:

<ul>
     <li class="my-class"></li>
</ul>

We can get the ul tag with ul_tag = soup.find("li", attrs={"class": "my-class").parent

Describe alternatives you've considered

An alternative would be some nested filtering perhaps like:

li = soup.find("li", "my-class")
ul = soup.find("ul", with_child=li)

Additional information

Thanks for your work on the package!

separate find into find and find_one

Is your feature request related to a problem? Please describe.
Right now it's hard to reason about the behaviour of the find method. If it finds one element it will return a Soup object, if it finds more than one it will return a list of Soup objects.

Describe the solution you'd like
Separate find into a find method and find_one method.

Describe alternatives you've considered
Keep it and YOLO?

Additional context
Conversation with Michael Kennedy:

If I were designing the api, i'd have that always return a List[Node] (or whatever the class is). Then add two methods:

find() -> List[Node]

find_one() -> Optional[Node]

one() -> Node (exception if the there are zero or two or more nodes)

"first" mode on "find" causes exception if no elements found

Describe the bug
When using the find method on the Soup class, and there are no results found, an exception is thrown if you used mode="first".

To Reproduce
Code to reproduce the behaviour:

from gazpacho import Soup
s = Soup("<p>test</p>")
s.find("a", mode="first")

Expected behavior
It should return None.

Environment:

OS: Linux - Ubuntu
Version: 19.10

Additional context
None

No longer maintained

Hi there! Just wanted to throw out there that your README.md states that this project is actively maintained, while your last commits were some years ago. I think it might useful to remove the 'actively maintained' part from there :p. Cheers!

Pretty Print!

Is your feature request related to a problem? Please describe.

gazpacho should be able to take html that looks like this:

html = """<ul><li>Item</li><li>Item</li></ul>"""

Describe the solution you'd like

And through some kind of magic turn it into this:

<ul>
  <li>Item</li>
  <li>Item</li>
</ul>

Describe alternatives you've considered

A quick prototype:

from xml.dom.minidom import parseString as string_to_dom

def prettify(string, html=True):
    dom = string_to_dom(string)
    ugly = dom.toprettyxml(indent="  ")
    split = list(filter(lambda x: len(x.strip()), ugly.split('\n')))
    if html:
        split = split[1:]
    pretty = '\n'.join(split)
    return pretty

Needs a Render method (like Requests-Html) to allow pulling text rendered by Javascript...

Need support for dynamic text rendering...

Need a method that triggers the Javascript on a page to fire (see https://github.com/psf/requests-html, r.html.render()).

attrs method output is changed when using find

`find` changes the content of `attrs`

When using the find method on a Soup object, the content of attrs is overwritten by the parameter attrs in find.

Steps to reproduce the issue

Try the following:

from gazpacho import Soup

div = Soup("<div id='my_id' />").find("div")
print(div.attrs)
div.find("span", {"id": "invalid_id"})
print(div.attrs)

The expected output will be the following, because we twice print the attributes of a:

{'id': 'my_id'}
{'id': 'my_id'}

But instead you actually receive:

{'id': 'my_id'}
{'id': 'invalid_id'}

which is wrong.

Environment:

OS: Linux
Version: 1.1

My current workaround is to save the attributes before I execute find.

Improve issue and feature request templates

Is your feature request related to a problem? Please describe.
Improve the .github issue template

Describe the solution you'd like
I would like a better issue and feature request template in the .github folder. The format I would like is the bolded headings to become proper sections, and the help line below them comments.

Describe alternatives you've considered
None

Additional context
What I would like is instead of:

---
name: Bug report
about: Create a report to help gazpacho improve
title: ''
labels: ''
assignees: ''
---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Code to reproduce the behaviour:

```python

\```

**Expected behavior**
A clear and concise description of what you expected to happen.

**Environment:**
 - OS: [macOS, Linux, Windows]
 - Version: [e.g. 0.8.1]

**Additional context**
Add any other context about the problem here.

It should be something like:

---
name: Bug report
about: Create a report to help gazpacho improve
title: ''
labels: ''
assignees: ''
---

## Describe the bug
<!-- A clear and concise description of what the bug is. -->

## To Reproduce
<!-- Code to reproduce the behaviour: -->

```python
# code
\```

## Expected behavior
<!-- A clear and concise description of what you expected to happen. -->

**Environment:**
 - OS: [macOS, Linux, Windows]
 - Version: [e.g. 0.8.1]

## Additional context
<!-- Add any other context about the problem here. Delete this section if not applicable -->

Or something like this

lxml requirement is not installed with gazpacho

Describe the bug
It seems that the lxml dependency is not install by this package on installation.

To Reproduce
I just pip installed the project and this code (which is on the docs) fails.

from gazpacho import get, Soup
import pandas as pd

url = 'https://www.capfriendly.com/browse/active/2020/salary?p=1'
response = get(url)
soup = Soup(response)
df = pd.read_html(str(soup.find('table')))[0]
print(df[['PLAYER', 'TEAM', 'SALARY', 'AGE']].head(3))

This is the error.

venv/lib/python3.7/site-packages/pandas/io/html.py in _parser_dispatch(flavor)
    846     else:
    847         if not _HAS_LXML:
--> 848             raise ImportError("lxml not found, please install it")
    849     return _valid_parsers[flavor]
    850 

ImportError: lxml not found, please install it

Expected behavior

Everything here is fixed with a manual pip install lxml. But if lxml is a dependency then I would expect it to be installed automatically when gazpacho is installed.

Environment:

OS: macOS
Version: 0.9.2

"in" support

Is your feature request related to a problem? Please describe.

I'd like to use gazpacho for testing the HTML output of my Django application. A natural way to do this would be to search for a given HTML chunk. Therefore I'd like to be able to use the in operator on a Soup to check for that HTML chunk. This probably has utility outside of tests.

Describe the solution you'd like

I imagine this working something like:

response = self.client.get('/')  # django's test client
assert response.status_code == HTTPStatus.OK
body = response.content.decode()
assert '<h1><a href="/">Home page</a></h1>' in Soup(body)

Describe alternatives you've considered

It's possible to reproduce this with find() and making assertions on the contents of the found node, but much more complicated since it requires assertions for each node in the tree.

Additional information

Django already supports a similar assertion called assertInHTML. However this relies on normalizing the HTML text and making text assertions, so it's clunky around matching the actual elements.

Get all the child elements of a Soup object

Is your feature request related to a problem? Please describe.
I would like try to a .children() method in the Soup object that can list all the child elements of the Soup object.

Describe the solution you'd like
I would make a regex pattern to match each inner element and return a list of Soup() objects with those elements. I might also try to make an option for recurse or not.

Describe alternatives you've considered
All that I can think of is doing the same thing mentioned above in the scraping code

Additional context
None

Use typing.Literal for mode

Is your feature request related to a problem? Please describe.

Only certain "mode" strings are accepted to Soup.find(), and underneath that, Soup._triage(). The type annotation str allows typos through for a runtime error.

Describe the solution you'd like

Use typing.Literal to specify the known mode names, so that such bugs can be caught at type checking type.

Describe alternatives you've considered
n/a

Additional context
n/a

A select function similar to soups.

Is your feature request related to a problem? Please describe.
It's great to be able to run find and then find within the initial result, but it seems more readable to be able to find based on CSS selectors.

Describe the solution you'd like

selector = '.foo img.bar'
soup.select(selector) # this would return any img item with the class "bar" inside of an object with the class "foo"

Help with mocking network (get) tests

Is your feature request related to a problem? Please describe.

Right now gazpacho has to hit various websites in it's test suite to make sure everything works, including:

Describe the solution you'd like

These tests should really be mocked.

Additional context

Could use some help on doing this properly!

User Agent Rotation / Faking

Is your feature request related to a problem? Please describe.

It might be nice if gazpacho had the ability to rotate/fake a user agent

Describe the solution you'd like

Sort of like this but more primitive. (Importantly gazpacho does not want to take on any dependencies)

Additional context

Right now gazpacho just spoofs the latest Firefox User Agent

Enable strict matching for find

Describe the bug
Right now match has an ability to be strict. This functionality is presently not enable for find.

To Reproduce
Code to reproduce the behaviour:

from gazpacho import Soup, match

match({'foo': 'bar'}, {'foo': 'bar baz'})
# True

match({'foo': 'bar'}, {'foo': 'bar baz'}, strict=True)
# False

Expected behavior
The find method should be forgiving (partial match) to protect ease of use, and maintain backwards compatibility, but there should be an argument to enable strict/exact matching that piggybacks on match

Environment:

OS: macOS
Version: 0.7.2

Format/Pretty Print can't handle void tags

Describe the bug

Soup can handle and format matched tags no problem:

from gazpacho import Soup
html = """<ul><li>Item 1</li><li>Item 2</li></ul>"""
Soup(html)

Which correctly formats to:

<ul>
  <li>Item 1</li>
  <li>Item 2</li>
</ul>

But it can't handle void tags (like img)...

To Reproduce

For example, this bit of html:

html = """<ul><li>Item 1</li><li>Item 2</li></ul><img src="image.png">"""
Soup(html)

Will fail to format on print:

<ul><li>Item 1</li><li>Item 2</li></ul><img src="image.png">

Expected behavior

Ideally Soup formats it as:

<ul>
  <li>Item 1</li>
  <li>Item 2</li>
</ul>
<img src="image.png">

Environment:

OS: macOS
Version: 1.1

Additional context

The problem has to do with the underlying parseString function unable to handle void tags:

from xml.dom.minidom import parseString as string_to_dom
string_to_dom(html)

Possible solution, turn void tags into self-closing tags on input, and the transform them back to void tags on print....

maxhumber / gazpacho Goto Github PK

gazpacho's People

Contributors

Stargazers

Watchers

Forkers

gazpacho's Issues

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional information

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional information

Describe the bug

Steps to reproduce the issue

Expected behavior

Environment:

Additional information

Describe the bug

Steps to reproduce the issue

Expected behavior

Environment:

Describe the bug

Describe the bug

Expected behavior

Environment:

Environment:

#Stacktoberfest

Question Bank

Contributing

Example

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional information

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional information

Need support for dynamic text rendering...

find changes the content of attrs

Steps to reproduce the issue

Environment:

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional information

Recommend Projects

Recommend Topics

Recommend Org

`find` changes the content of `attrs`