Giter VIP home page Giter VIP logo

domhandler's Introduction

domhandler Node.js CI

The DOM handler creates a tree containing all nodes of a page. The tree can be manipulated using the domutils or cheerio libraries and rendered using dom-serializer .

Usage

const handler = new DomHandler([ <func> callback(err, dom), ] [ <obj> options ]);
// const parser = new Parser(handler[, options]);

Available options are described below.

Example

const { Parser } = require("htmlparser2");
const { DomHandler } = require("domhandler");
const rawHtml =
    "Xyz <script language= javascript>var foo = '<<bar>>';</script><!--<!-- Waah! -- -->";
const handler = new DomHandler((error, dom) => {
    if (error) {
        // Handle error
    } else {
        // Parsing completed, do something
        console.log(dom);
    }
});
const parser = new Parser(handler);
parser.write(rawHtml);
parser.end();

Output:

[
    {
        data: "Xyz ",
        type: "text",
    },
    {
        type: "script",
        name: "script",
        attribs: {
            language: "javascript",
        },
        children: [
            {
                data: "var foo = '<bar>';<",
                type: "text",
            },
        ],
    },
    {
        data: "<!-- Waah! -- ",
        type: "comment",
    },
];

Option: withStartIndices

Add a startIndex property to nodes. When the parser is used in a non-streaming fashion, startIndex is an integer indicating the position of the start of the node in the document. The default value is false.

Option: withEndIndices

Add an endIndex property to nodes. When the parser is used in a non-streaming fashion, endIndex is an integer indicating the position of the end of the node in the document. The default value is false.


License: BSD-2-Clause

Security contact information

To report a security vulnerability, please use the Tidelift security contact. Tidelift will coordinate the fix and disclosure.

domhandler for enterprise

Available as part of the Tidelift Subscription

The maintainers of domhandler and thousands of other packages are working with Tidelift to deliver commercial support and maintenance for the open source dependencies you use to build your applications. Save time, reduce risk, and improve code health, while paying the maintainers of the exact dependencies you use. Learn more.

domhandler's People

Contributors

aknuds1 avatar awwright avatar bryant1410 avatar cvrebert avatar delgan avatar dependabot-preview[bot] avatar dependabot[bot] avatar ericjeney avatar fb55 avatar greenkeeperio-bot avatar jenil94 avatar jugglinmike avatar nageshlop avatar neet avatar orta avatar phated avatar rachelmulligan avatar sidx1024 avatar trysound avatar voxpelli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

domhandler's Issues

Add support for DOM standard interfaces

This is a great library and it's already been extremely useful for a number of things. ๐Ÿ™‚๐Ÿ‘

Is there a design reason why the APIs deviate from the DOM standard, or is it just the way it ended up?

I won't go into too much detail here, but the fact that property names and functions have different names and types is extremely disruptive, if you're trying to integrate with existing code and tests, etc. - and especially in TypeScript.

To be clear, I'm not asking for or expecting a full implementation of the DOM standard - I'm not asking for any new features per se. But even code that requires a subset of a DOM interface does not immediately work without remapping the node model to something compatible first.

Would you be at all open to changing this? I might be able to help. (It would be a break change, of course.)

(I apologize if this has already been asked and answered - it seems unlikely I could be the first person to ask, but I did search your issues and, to my surprise, I didn't find anything.)

Distinguish between Node.childNodes and Element.children

This package is proving useful in a script I'm writing to convert SVG files to config files! One issue I noticed, though, is that It appears that it's conflating children and childNodes.

The childNodes getter should be available on all node types and return a list of nodes (Node[]). (See https://developer.mozilla.org/en-US/docs/Web/API/Node/childNodes)

A children getter, however, should only be available on Elements, and it should only return a list of Element child nodes (aka HTMLCollection), excluding Text and Comment nodes. If a setter is provided, it should only take HTMLCollection as an argument (See https://developer.mozilla.org/en-US/docs/Web/API/Element/children)

In src/node.ts, there's a comment above the childNodes getter that reads "Same as children. DOM spec-compatible alias", but I don't see where that's stated in https://dom.spec.whatwg.org. On the contrary, at https://dom.spec.whatwg.org/#dom-parentnode-children, it appears to be in agreement with the Mozilla developer documentation.

How to delete an element?

What is the best way to delete an element?

Example: Remove the <book id="bk102"> element from

<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
</catalog>

Open Source license?

Are you giving people permission to use this code in their projects? If so, please let us know. The best way is to put an open source license in your code project, or to indicate that you are giving permission for people to copy, use, and modify your code.

Here's the license text, you can just add a file to your project, or put this in the README -- this way we'll know you are giving permission to use the code, and also it will require users of your code to maintain your copyright, so that when you code is used you get credit for the code you created and shared.

http://opensource.org/licenses/BSD-2-Clause

Thanks!
Gil

Cannot read property 'Tag' of undefined.

I have a react.js project, which indirectly uses domhandler v4.2.0 viacheerio I believe.

Its worked fine for months and then suddenly, my project has started throwing this error when I try to build it.

C:\product-app\node_modules\domhandler\lib\node.js:32
    [domelementtype_1.ElementType.Tag, 1],
                                  ^

TypeError: Cannot read property 'Tag' of undefined
    at Object.<anonymous> (C:\product-app\node_modules\domhandler\lib\node.js:32:35)
    at Module._compile (internal/modules/cjs/loader.js:1158:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1178:10)
    at Module.load (internal/modules/cjs/loader.js:1002:32)
    at Function.Module._load (internal/modules/cjs/loader.js:901:14)
    at Module.require (internal/modules/cjs/loader.js:1044:19)
    at require (internal/modules/cjs/helpers.js:77:18)
    at Object.<anonymous> (C:\product-app\node_modules\domhandler\lib\index.js:15:14)
    at Module._compile (internal/modules/cjs/loader.js:1158:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1178:10)

Any ideas what might be causing it?

I'm running node.js 12.16.1 with typescript 3.9.10.

How to get modified html ?

i have modifying all objects from inside function handler. and i want return the modified html. how to get that ?
no complete usage on this documentation and htmlparser2 documentation.

TypeError: document.children.find is not a function

This happens when running serverside.
In my case inside Java Nashorn.
I typically have to Polyfill stuff.
So that might be the case here too.

Going to continue looking for "DOM" polyfills. Any suggestions on the way? :)

`withDomLvl1` option support prevents use in some browsers.

The withDomLvl1 option uses code, particularly the declarative syntax for [NodePrototype https://github.com/fb55/domhandler/blob/master/index.js#L78-L90], that causes issues in some browsers.

Can this be abstracted away to another module?

test/cases/24-with-start-indices failing

Hi, I'm having this failure of the last test with nodejs v4.6.1:

  1)  withStartIndices adds correct startIndex properties:
     TypeError: Cannot read property 'startIndex' of null
    at DomHandler._addDomElement (/root/debian/node-cheerio/node-domhandler/index.js:71:36)
    at DomHandler.onprocessinginstruction (/root/debian/node-cheerio/node-domhandler/index.js:175:7)
    at Parser.ondeclaration (/usr/lib/nodejs/htmlparser2/lib/Parser.js:254:13)
    at Tokenizer._stateInDeclaration (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:336:13)
    at Tokenizer._parse (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:674:9)
    at Tokenizer.write (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:627:7)
    at Tokenizer.end (/usr/lib/nodejs/htmlparser2/lib/Tokenizer.js:820:17)
    at Parser.end (/usr/lib/nodejs/htmlparser2/lib/Parser.js:322:18)
    at Parser.parseComplete (/usr/lib/nodejs/htmlparser2/lib/Parser.js:314:7)
    at Context.<anonymous> (/root/debian/node-cheerio/node-domhandler/test/tests.js:46:10)
    at callFn (/usr/lib/nodejs/mocha/lib/runnable.js:223:21)
    at Test.Runnable.run (/usr/lib/nodejs/mocha/lib/runnable.js:216:7)
    at Runner.runTest (/usr/lib/nodejs/mocha/lib/runner.js:373:10)
    at /usr/lib/nodejs/mocha/lib/runner.js:451:12
    at next (/usr/lib/nodejs/mocha/lib/runner.js:298:14)
    at /usr/lib/nodejs/mocha/lib/runner.js:308:7
    at next (/usr/lib/nodejs/mocha/lib/runner.js:246:23)
    at Immediate._onImmediate (/usr/lib/nodejs/mocha/lib/runner.js:275:5)
    at processImmediate [as _immediateCallback] (timers.js:383:17)

any idea why thus could happen ? thanks, Paolo

version 4.2.0 error

$ npm update             
npm ERR! code ETARGET
npm ERR! notarget No matching version found for domhandler@^4.2.0.
npm ERR! notarget In most cases you or one of your dependencies are requesting
npm ERR! notarget a package version that doesn't exist.

incorrect version of @types/domhandler

you use @types/htmlparser2": "^3.10.1" and it uses @types/[email protected].
and @types/domhandler has interface for DomElement, but realization of your lib dosn't have it, that's why build always failed.

https://github.com/DefinitelyTyped/DefinitelyTyped/blob/02db5ccb68be79df3f24cfc323bad5a609ff4d5f/types/domutils/index.d.ts

node_modules/@types/domutils/index.d.ts:6:10 - error TS2614: Module '"project/node_modules/domhandler/lib"' has no exported member 'DomElement'. Did you mean to use 'import DomElement from "project/node_modules/domhandler/lib"' instead?

6 import { DomElement } from "domhandler";
           ~~~~~~~~~~

node_modules/@types/htmlparser2/index.d.ts:17:10 - error TS2614: Module '"project/node_modules/domhandler/lib"' has no exported member 'DomElement'. Did you mean to use 'import DomElement from "project/node_modules/domhandler/lib"' instead?

17 export { DomElement, DomHandlerOptions, DomHandler, Element, Node } from 'domhandler';
            ~~~~~~~~~~

node_modules/@types/sanitize-html/index.d.ts:17:10 - error TS2459: Module '"project/node_modules/htmlparser2/lib"' declares 'Options' locally, but it is not exported.

17 import { Options } from "htmlparser2";
            ~~~~~~~

  node_modules/htmlparser2/lib/index.d.ts:5:14
    5 declare type Options = ParserOptions & DomHandlerOptions;
                   ~~~~~~~
    'Options' is declared here.

Update npm version

Hi

I'm using your great product, and I really need the new "EndIndices" feature.
I'm using the code as an npm package, but currently, I must direct my config file to the git himself due to the un-updated version in the npm.

Can you please update the npm package to the latest version?

Many thanks
Ysrael

patch release - Add npmignore test

The fix made in #51 hasn't been applied to major version 2 of this package. It would be nice for a version 2.4.3 to be published with such a fix as well so that dependents of v2 can also reap the benefit.

Incorrect implementation in README

This code imports DomHandler, however it is never used in the following code.

const { Parser } = require("htmlparser2");
const { DomHandler } = require("domhandler");
const rawHtml =
    "Xyz <script language= javascript>var foo = '<<bar>>';< /  script><!--<!-- Waah! -- -->";
const handler = new htmlparser.DomHandler(function(error, dom) {
    if (error) {
        // Handle error
    } else {
        // Parsing completed, do something
        console.log(dom);
    }
});
const parser = new Parser(handler);
parser.write(rawHtml);
parser.end();

different parsed result from README

Hi, I tried the example code in README but there is no comment element after parsed.

in README:

[
    // ignoring first element
    {
        type: "script",
        name: "script",
        attribs: {
            language: "javascript",
        },
        children: [
            {
                data: "var foo = '<bar>';<",
                type: "text",
            },
        ],
    },
    {
        data: "<!-- Waah! -- ",
        type: "comment",
    },
];

with [email protected] & [email protected]

[
    // ignoring first element
    {
        type: "script",
        name: "script",
        attribs: {
            language: "javascript",
        },
        children: [
            {
                data: "var foo = '<<bar>>';< /  script><!--<!-- Waah! -- -->",
                type: "text",
            },
        ],
    }
];

I am not sure which should I expect.

Thanks in advance!

Feature request: attribs indices

Could it be done that, with withStartIndices and withEndIndices set, a node had attribsIndices property (or something like that) that would contain atributes names and values start and end indices?

It also would be great to have not only offsets but line and column numbers as well.

Option to include character indexes in output?

Would it be possible to add an option to include the start index of each node? That is, the node's starting character index in the original markup string.

For example...

Xyz <script language= javascript>var foo = '<<bar>>';< /  script><!--<!-- Waah! -- -->
[{
    data: 'Xyz ',
    type: 'text',
    startIndex: 0
}, {
    type: 'script',
    name: 'script',
    startIndex: 4
    attribs: {
        language: 'javascript'
    },
    children: [{
        data: 'var foo = \'<bar>\';<',
        type: 'text',
        startIndex: 33
    }]
}, {
    data: '<!-- Waah! -- ',
    type: 'comment',
    startIndex: 65
}]

I know this value is available on the htmlparser.Parser instance (you can look in parser.startIndex during an onopentag call)... but I don't know if a DomHandler instance could access this property, because its onopentag function doesn't have access to the parser instance that's using it... is there a way?

Feature request: Identify missing ending tags

With a document such as this:

<!doctype html>
<html lang="en">
<title>My Document</title>
<h1>Title</h1>

Notice that it is missing the </html> end tag. Since this tag is optional I often omit it. It would be great if domhandler had some way to indicate that a tag is missing the ending. Something like closing: false would suffice.

Domhandler is not a constructor

I follow the readme but when I use const handler = new DomHandler(() => {}); I have TypeError: DomHandler is not a constructor

children and childNodes should not be identical

Currently, children and childNodes refer to the same thing, which is what browser-DOM calls childNodes. That's not spec compliant - children is an HTMLCollection containing only elements, without things like text nodes and comments (spec).

I can PR this if you're interested; while it breaks backwards-compat with this library, it's breaking compliance to spec, and surprised me quite a bit.

serializer serializes elements removed through DomUtils

When you parse this code that contains a duplicate HTML element through parser.parseDocument():

<html>
  <body>
      <h1>Foo</h1>
  </body>
</html>

<html>
  <body>
      <h1>Bar</h1>
  </body>
</html>

... and run this code over the returned document which removes every html element after the first one...

const elements = DomUtils.getChildren(document)

let found = false
for (const child of elements) {
	if (found) {
		DomUtils.removeElement(child)
		continue
	}

	if (child.tagName === 'html') found = true
}

The children do not exist on the syntax tree anymore. But if you then serialize the document using dom-serializer the last HTML tag is back.

I think this has to do with the prev and next helper functions still having a reference to the second html element, but I am unable to confirm this as the htmlparser2 playground (https://astexplorer.net/#/2AmVrGuGVJ) is not capable of outputting json and uses version 5.0.1.

I know html documents are not supposed to have more than two elements, that is why I want to remove them automatically.

Node#cloneNode does not inherit source indices

Thanks for keep maintaining this project & adding new features. Currently, Node#cloneNode does not clone indices from the original object regardless it is set or not.

Current behaviour

const [elm] = parseDOM(
  `<div>
    <p>
      Hello world
    </p>
  </div>`, {
  withEndIndices: true,
  withStartIndices: true,
});

assert(elm.startIndex === 0);
assert(elm.endIndex   === 48);

const newElm = elm.cloneNode(true);
newElm.startIndex // --> null :(
newElm.endIndex // --> null :(

Expected behaviour

cloned node to inherit startIndex or endIndex from the original object

Warning about mutating [[Prototype]] of elements Objects

Hello.

Using Firefox and others libraries which make use of domhandler, the following warning message is sometimes prompted into the console:

: mutating the [[Prototype]] of an object will cause your code to run very slowly; instead create the object with the correct initial [[Prototype]] value using Object.create

This comes from these lines:

if (this._options.withDomLvl1) {
    element.__proto__ = element.type === "tag" ? ElementPrototype : NodePrototype;
}

More information about this warning can be found on the MDN documentation or in this SO question.

I would like to know if you was aware about this and if you thought it might need a fix, or is it an intended choice to implement the prototype mutation like this?

DomHandler with async callback

const handler = new DomHandler(null, null, async (element) => {
Currently i need a DomHandler that also allows a async function "with await".

Publish new npm version

The current npm version is behind master and doesn't have the most recent bug fix. Could you publish a new version? Thanks!

Use LF instead of CRLF

warning: CRLF will be replaced by LF in 
jshint/node_modules/htmlparser2/node_modules/domhandler/index.js.

Normalize HTML entities

It would be a great feature if the parser could resolve HTML entities. For example, if the parser passes it &#x0061;, the resulting "data" for the text node would be 'a' instead of the entity. Similarly, named entities, like &nbsp; could be resolved to their Unicode equivalent characters.

"attribs" is an uncomfortable compromise between "attributes" and "attrs"

This could very well be a 'close, won't fix' issue, and if so, that's ok.

But, I wanted to point out how unusual it is to use attribs as the property name that stores HTML attributes. I would much prefer attrs or attributes. If this could be changed, it would be nice.

Like I said though, I understand how complex it can be to change after release.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.