Comments (3)
You need to strip them manually.
def get_text(html):
tree = HTMLParser(html)
if tree.body is None:
return None
for tag in tree.css('script'):
tag.decompose()
for tag in tree.css('style'):
tag.decompose()
text = tree.body.text(separator='\n')
return text
from selectolax.
@mpage-uga
In html spec serialization in If current node is a Text node
section:
If current node is a Text node
If the parent of current node is a style, script, xmp, iframe, noembed, noframes, or plaintext element, or if the parent of current node is a noscript element and scripting is enabled for the node, then append the value of current node's data IDL attribute literally.Otherwise, append the value of current node's data IDL attribute, escaped as described below.
from selectolax.
Perfect !
Thanks for this very good tool.
from selectolax.
Related Issues (20)
- Node.child should be named Node.first_child ? HOT 2
- Awful text parsing issue HOT 6
- Release wheel for python 3.12 HOT 5
- Tags out of order in returned list when using css to specify multiple tags HOT 5
- What is/was the format for the pages/pages.json file? HOT 1
- HTMLParser and LexborHTMLParser search differently HOT 1
- css_matches of LexborHTMLParser does not free memory HOT 2
- [Typing] `_Attributes` in .pyi stub file is missing dictionary methods like `__getitem__`
- Selectolax couldn't load large html string (87MB) but lxml could HOT 3
- I am still getting this error even with the update - not able to load large html contents HOT 1
- Error in LexborHTMLParser HOT 7
- LexborHTMLParser find by text HOT 1
- Memory leak when using LexborHTMLParser HOT 1
- Segmentation fault with Lexbor engine HOT 2
- Allow regular expressions in `text_contains` / `any_text_contains` HOT 2
- Adding AdvancedHTMLParser to benchmark HOT 2
- Weird issue in rendering HTML HOT 4
- Cannot import name modest HOT 1
- ModuleNotFoundError: No module named 'selectolax.parser'; 'selectolax' is not a package HOT 1
- Best way to handle content not found? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from selectolax.