Giter VIP home page Giter VIP logo

Comments (2)

mvolfik avatar mvolfik commented on May 25, 2024 1

So, a little investigation writeup:

To construct the context object for the request handler, HttpCrawler._runRequestHandler() calls this._parseResponse(), which, in turn, calls this._parseHTML().

_parseHTML() is overriden in CheerioCrawler, and returns

{ 
  get body() {
    return isXml ? "..." : $.html({decodeEntities: true})
  },
  // ...
}

Apparently, this was added there to "save memory for highly parallel runs" (source). However, currently we don't get this effect anyways, since the result of _parseHtml() is immediately destructured here, so when requestHandler, context.body is already a string, and not the getter. (You can trivially verify this by putting a breakpoint/debugger;/console.log() into the getter, and checking at what moment (with what call stack) it is called.

Also, I don't think there's a way to have $.html() return what we want. When a website responds with ``, content-type: text/html:

/* consider a website that responds with
content-type: text/html
{"foo": "<p>bar &lt; &quot; baz</p>"}
*/
const crawler = new CheerioCrawler({
    requestHandler({ $ }) {
        console.log($.html({ decodeEntities: false }));
    },
});
// --> {"foo": "<p>bar < " baz</p>"}

const crawler = new CheerioCrawler({
    requestHandler({ $ }) {
        console.log($.html({ decodeEntities: false }));
    },
});
// --> {&quot;foo&quot;: &quot;<p>bar &lt; &quot; baz</p>&quot;}

const crawler = new HttpCrawler({
    requestHandler({ body }) {
        console.log(body.toString("utf8"));
    },
});
// --> {"foo": "<p>bar &lt; &quot; baz</p>"}

First case is current behavior of ctx.body, and imo it's bad, because it breaks HTML. Also, it's really confusing that CheerioCrawlingContext.body and HttpCrawlingContext.body return different values (that's kind of what prompted the original report by @gullmar).

The second case is probably also unjustifiable, since it would change current behavior heavily (websites probably contain a lot of quotes, &s and idk what else cheerio decides to escape).


Therefore, I propose we remove body getter, and instead return the original body buffer .toString("utf8"), to have the same data like HttpCrawler, but also keep body as a string to avoid breaking Actors.

It is also ok with me to call this a won'tfix, since website returning JSON with content-type: text/html is just weird.

from crawlee.

barjin avatar barjin commented on May 25, 2024

Possibly (quite remotely) related to #2317

from crawlee.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.