Comments (2)
So, a little investigation writeup:
To construct the context object for the request handler, HttpCrawler._runRequestHandler()
calls this._parseResponse()
, which, in turn, calls this._parseHTML()
.
_parseHTML()
is overriden in CheerioCrawler
, and returns
{
get body() {
return isXml ? "..." : $.html({decodeEntities: true})
},
// ...
}
Apparently, this was added there to "save memory for highly parallel runs" (source). However, currently we don't get this effect anyways, since the result of _parseHtml()
is immediately destructured here, so when requestHandler, context.body
is already a string, and not the getter. (You can trivially verify this by putting a breakpoint/debugger;
/console.log()
into the getter, and checking at what moment (with what call stack) it is called.
Also, I don't think there's a way to have $.html()
return what we want. When a website responds with ``, content-type: text/html
:
/* consider a website that responds with
content-type: text/html
{"foo": "<p>bar < " baz</p>"}
*/
const crawler = new CheerioCrawler({
requestHandler({ $ }) {
console.log($.html({ decodeEntities: false }));
},
});
// --> {"foo": "<p>bar < " baz</p>"}
const crawler = new CheerioCrawler({
requestHandler({ $ }) {
console.log($.html({ decodeEntities: false }));
},
});
// --> {"foo": "<p>bar < " baz</p>"}
const crawler = new HttpCrawler({
requestHandler({ body }) {
console.log(body.toString("utf8"));
},
});
// --> {"foo": "<p>bar < " baz</p>"}
First case is current behavior of ctx.body
, and imo it's bad, because it breaks HTML. Also, it's really confusing that CheerioCrawlingContext.body
and HttpCrawlingContext.body
return different values (that's kind of what prompted the original report by @gullmar).
The second case is probably also unjustifiable, since it would change current behavior heavily (websites probably contain a lot of quotes, &
s and idk what else cheerio decides to escape).
Therefore, I propose we remove body getter, and instead return the original body buffer .toString("utf8")
, to have the same data like HttpCrawler, but also keep body
as a string to avoid breaking Actors.
It is also ok with me to call this a won'tfix, since website returning JSON with content-type: text/html
is just weird.
from crawlee.
Possibly (quite remotely) related to #2317
from crawlee.
Related Issues (20)
- Huge sitemap takes forever to load
- Make RequestQueueV2 default
- type error `puppeteerUtils.gotoExtended` ?
- Incorrect Request Timeout in Error Message
- The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration HOT 2
- Missing `create*Router` helper for AdaptivePlaywrightCrawler
- Support for crawling from secondary IP address HOT 1
- Statistics does not use crawler log HOT 1
- Race conditions in CI/CD HOT 4
- Malformed Sitemap content when url contains searchParams HOT 7
- Mysterious timeout hard-kills `CheerioCrawler` script HOT 7
- The default value of `availableMemoryRatio` is too low HOT 10
- Some "run on Apify" examples do not work HOT 1
- Control proxy-chain's 'host' parameter from Apify's launchContext HOT 5
- Handle `Crawl-delay` directive in robots.txt
- RobotsFile.isAllowed returns false for allowed routes HOT 3
- Refactor `retireOnBlockedStatusCodes` to `isBlockedStatusCode` and move the retiring out of the `Session` class
- `AdaptivePlaywrightCrawler`: programmatically deciding when to render JS HOT 3
- Cheerio crawler going out of memory unexpectedly with lot of concatenated strings
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawlee.