Comments (3)
Is it reproducible when you parse different HTML files?
Selectolax uses Modest directly, when removing nodes:
selectolax/selectolax/modest/node.pxi
Lines 491 to 494 in 6b67223
I think your particular HTML file violates some of the standards and Modest can't process it properly.
If you have more example, please upload them too.
from selectolax.
I think the real problem is that your function can call decompose
on already removed objects.
You can try using decompose(recursive=False)
.
Unfortunately, selectolax is a very thin wrapper over Modest and it does not check for such problems.
I think removing the same node multiple times corrupts memory.
You always need to keep in mind, that traverse iterates over all objects, and some of them could be already deleted or modified.
The nodes_to_remove
array contains a parent and some of its children. When you use recursive decomposing, the child nodes get removed with the parent. On the next iteration, you are trying to remove the child object which does not exist anymore.
This is a common problem: lexbor/lexbor#132 (comment)
from selectolax.
Thank you, much clearer now!
from selectolax.
Related Issues (20)
- Node.child should be named Node.first_child ? HOT 2
- Awful text parsing issue HOT 6
- Release wheel for python 3.12 HOT 5
- Tags out of order in returned list when using css to specify multiple tags HOT 5
- What is/was the format for the pages/pages.json file? HOT 1
- HTMLParser and LexborHTMLParser search differently HOT 1
- css_matches of LexborHTMLParser does not free memory HOT 2
- [Typing] `_Attributes` in .pyi stub file is missing dictionary methods like `__getitem__`
- Selectolax couldn't load large html string (87MB) but lxml could HOT 3
- I am still getting this error even with the update - not able to load large html contents HOT 1
- Error in LexborHTMLParser HOT 7
- Memory leak HOT 3
- Performance optimization css_first
- .child and .last_child not working when those child are in a separeted html line
- Content of scripts always being outputed with .text() HOT 2
- Why have .text_lexbor (publicly available) if it's equivalent to .text() with default parameters ? HOT 3
- Feature request : Having Node.copy() or LexborNode.copy()
- Cannot import name modest HOT 1
- ModuleNotFoundError: No module named 'selectolax.parser'; 'selectolax' is not a package HOT 1
- Best way to handle content not found? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from selectolax.