taoqf / node-html-parser Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ashi009/node-fast-html-parser

1.1K 7.0 105.0 1.6 MB

A very fast HTML parser, generating a simplified DOM, with basic element query support.

License: MIT License

JavaScript 1.26% TypeScript 0.48% HTML 98.26%

node-html-parser's Introduction

Fast HTML Parser

Fast HTML Parser is a very fast HTML parser. Which will generate a simplified DOM tree, with element query support.

Per the design, it intends to parse massive HTML files in lowest price, thus the performance is the top priority. For this reason, some malformatted HTML may not be able to parse correctly, but most usual errors are covered (eg. HTML4 style no closing <li>, <td> etc).

Install

npm install --save node-html-parser

Note: when using Fast HTML Parser in a Typescript project the minimum Typescript version supported is ^4.1.2.

Performance

-- 2022-08-10

html-parser     :24.1595 ms/file ± 18.7667
htmljs-parser   :4.72064 ms/file ± 5.67689
html-dom-parser :2.18055 ms/file ± 2.96136
html5parser     :1.69639 ms/file ± 2.17111
cheerio         :12.2122 ms/file ± 8.10916
parse5          :6.50626 ms/file ± 4.02352
htmlparser2     :2.38179 ms/file ± 3.42389
htmlparser      :17.4820 ms/file ± 128.041
high5           :3.95188 ms/file ± 2.52313
node-html-parser:2.04288 ms/file ± 1.25203
node-html-parser (last release):2.00527 ms/file ± 1.21317

Tested with htmlparser-benchmark.

Usage

import { parse } from 'node-html-parser';

const root = parse('<ul id="list"><li>Hello World</li></ul>');

console.log(root.firstChild.structure);
// ul#list
//   li
//     #text

console.log(root.querySelector('#list'));
// { tagName: 'ul',
//   rawAttrs: 'id="list"',
//   childNodes:
//    [ { tagName: 'li',
//        rawAttrs: '',
//        childNodes: [Object],
//        classNames: [] } ],
//   id: 'list',
//   classNames: [] }
console.log(root.toString());
// <ul id="list"><li>Hello World</li></ul>
root.set_content('<li>Hello World</li>');
root.toString();	// <li>Hello World</li>

var HTMLParser = require('node-html-parser');

var root = HTMLParser.parse('<ul id="list"><li>Hello World</li></ul>');

Global Methods

parse(data[, options])

Parse the data provided, and return the root of the generated DOM.

data, data to parse

options, parse options

{
  lowerCaseTagName: false,		// convert tag name to lower case (hurts performance heavily)
  comment: false,           		// retrieve comments (hurts performance slightly)
  fixNestedATags: false,    		// fix invalid nested <a> HTML tags 
  parseNoneClosedTags: false, 	// close none closed HTML tags instead of removing them 
  voidTag: {
    tags: ['area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'],	// optional and case insensitive, default value is ['area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr']
    closingSlash: true	// optional, default false. void tag serialisation, add a final slash <br/>
  },
  blockTextElements: {
    script: true,		// keep text content when parsing
    noscript: true,		// keep text content when parsing
    style: true,		// keep text content when parsing
    pre: true			// keep text content when parsing
  }
}

valid(data[, options])

Parse the data provided, return true if the given data is valid, and return false if not.

Class

classDiagram
direction TB
class HTMLElement{
	this trimRight()
	this removeWhitespace()
	Node[] querySelectorAll(string selector)
	Node querySelector(string selector)
	HTMLElement[] getElementsByTagName(string tagName)
	Node closest(string selector)
	Node appendChild(Node node)
	this insertAdjacentHTML('beforebegin' | 'afterbegin' | 'beforeend' | 'afterend' where, string html)
	this setAttribute(string key, string value)
	this setAttributes(Record string, string attrs)
	this removeAttribute(string key)
	string getAttribute(string key)
	this exchangeChild(Node oldNode, Node newNode)
	this removeChild(Node node)
	string toString()
	this set_content(string content)
	this set_content(Node content)
	this set_content(Node[] content)
	this remove()
	this replaceWith((string | Node)[] ...nodes)
	ClassList classList
	HTMLElement clone()
	HTMLElement getElementById(string id)
	string text
	string rawText
	string tagName
	string structuredText
	string structure
	Node firstChild
	Node lastChild
	Node nextSibling
	HTMLElement nextElementSibling
	Node previousSibling
	HTMLElement previousElementSibling
	string innerHTML
	string outerHTML
	string textContent
	Record<string, string> attributes
	[number, number] range
}
class Node{
	<<abstract>>
	string toString()
	Node clone()
	this remove()
	number nodeType
	string innerText
	string textContent
}
class ClassList{
	add(string c)
	replace(string c1, string c2)
	remove(string c)
	toggle(string c)
	boolean contains(string c)
	number length
	string[] value
	string toString()
}
class CommentNode{
	CommentNode clone()
	string toString()
}
class TextNode{
	TextNode clone()
	string toString()
	string rawText
	string trimmedRawText
	string trimmedText
	string text
	boolean isWhitespace
}
Node --|> HTMLElement
Node --|> CommentNode
Node --|> TextNode
Node ..> ClassList

HTMLElement Methods

trimRight()

Trim element from right (in block) after seeing pattern in a TextNode.

removeWhitespace()

Remove whitespaces in this sub tree.

querySelectorAll(selector)

Query CSS selector to find matching nodes.

Note: Full range of CSS3 selectors supported since v3.0.0.

querySelector(selector)

Query CSS Selector to find matching node. null if not found.

getElementsByTagName(tagName)

Get all elements with the specified tagName.

Note: Use * for all elements.

closest(selector)

Query closest element by css selector. null if not found.

appendChild(node)

Append a child node to childNodes

insertAdjacentHTML(where, html)

Parses the specified text as HTML and inserts the resulting nodes into the DOM tree at a specified position.

setAttribute(key: string, value: string)

Set value to key attribute.

setAttributes(attrs: Record<string, string>)

Set attributes of the element.

removeAttribute(key: string)

Remove key attribute.

getAttribute(key: string)

Get key attribute. undefined if not set.

exchangeChild(oldNode: Node, newNode: Node)

Exchanges given child with new child.

removeChild(node: Node)

Remove child node.

toString()

Same as outerHTML

set_content(content: string | Node | Node[])

Set content. Notice: Do not set content of the root node.

remove()

Remove current element.

replaceWith(...nodes: (string | Node)[])

Replace current element with other node(s).

classList

classList.add

Add class name.

classList.replace(old: string, new: string)

Replace class name with another one.

classList.remove()

Remove class name.

classList.toggle(className: string):void

Toggle class. Remove it if it is already included, otherwise add.

classList.contains(className: string): boolean

Returns true if the classname is already in the classList.

classList.value

Get class names.

clone()

Clone a node.

getElementById(id: string): HTMLElement | null

Get element by it's ID.

HTMLElement Properties

text

Get unescaped text value of current node and its children. Like innerText. (slow for the first time)

rawText

Get escaped (as-is) text value of current node and its children. May have & in it. (fast)

tagName

Get or Set tag name of HTMLElement. Notice: the returned value would be an uppercase string.

structuredText

Get structured Text.

structure

Get DOM structure.

firstChild

Get first child node. undefined if no child.

lastChild

Get last child node. undefined if no child

innerHTML

Set or Get innerHTML.

outerHTML

Get outerHTML.

nextSibling

Returns a reference to the next child node of the current element's parent. null if not found.

nextElementSibling

Returns a reference to the next child element of the current element's parent. null if not found.

previousSibling

Returns a reference to the previous child node of the current element's parent. null if not found.

previousElementSibling

Returns a reference to the previous child element of the current element's parent. null if not found.

textContent

Get or Set textContent of current element, more efficient than set_content.

attributes

Get all attributes of current element. Notice: do not try to change the returned value.

range

Corresponding source code start and end indexes (ie [ 0, 40 ])

node-html-parser's People

Stargazers

Watchers

Forkers

razrabw jiangtao raffecat hoang-rio ninjasitm jnarowski vikinghorse2018 rktuxyn tongdada uday-pb dolfdijkstra leeoniya lannstark bitcoinbrisbane fionafibration nlang korkemoms justenrickert sbero casenet-llc wyzgo patrikpihlstrom fgribreau sharcoux stutrek jazz-man jkune caushansen monis0395 danbulant ai-natural-language-processing-lab wesias7 eddyoc a1ip jamiemagee adamasantares matbia cwurtz node-projects sno2 gesu sastan 10ko wanderer-guy milahu addedjacky insanehong pplgin usagizmo social-solutions-global shauntc elenik72 serjant huseyinnurbaki lamplightdev nonara dorumin gram-js jesperhag li7228166 vicdecode mackignacio amrrx againpsychox marconisi salketer rondonjon heypiotr thecodrr ocofaigh jzhong2021 arjprd shroudedcode arunim2405 amour1688 jerryzfc amoulin974 chenymin honza1a codepilotsf pyxide atestology jamespantalones pat1encelos code4fukui davestewart artur- stealthangel jogibear9988 kozzzin jedliu bennbollay gamerslouis reachsuite linecode c7x43t ygorperez wdzeng abhinav054 bendisposto

node-html-parser's Issues

README should make it clear that `parse('<A>foo</A>')` is not supported due to case sensitivity

Here's a quick example showing this core bug and undocumented gotcha that I imagine a large percentage of users would encounter. Not sure if there is a known option or something that fixes this, I could not find anything.

> const { parse } = require('node-html-parser');
> const root = parse('<A href="https://bitcoin.org/en/exchanges">https://bitcoin.org/en/exchanges</A>');
> root.querySelector('a')
null

vs.

> const { parse } = require('node-html-parser');
> const root = parse('<A href="https://bitcoin.org/en/exchanges">https://bitcoin.org/en/exchanges</A>'.toLowerCase());
> root.querySelector('a')
HTMLElement {
  childNodes: [
    TextNode {
      childNodes: [],
      nodeType: 3,
      rawText: 'https://bitcoin.org/en/exchanges'
    }
  ],
  tagName: 'a',
  rawAttrs: 'href="https://bitcoin.org/en/exchanges"',
  parentNode: null,
  classNames: [],
  nodeType: 1
}

support for xmlns:xlink

the xmlns:xlink will convert to xmlns xlink when we call toString()

querySelectorAll acting strange

<dl>
  <dt>A</dt>
  <dd>B</dd>
  <dt>C</dt>
  <dd>D</dt>
</dl>

Issue 1

HTMLParser(contentAbove).querySelectorAll('dl > dt');

This fails to retrieve anything. I presume it's not exactly like the real query selector and can't understand direct descendants only and wants to do the more expensive any descendant. Fair enough, I can get around that. I can test the parent node and throw away that don't match.

Issue 2

HTMLParser(contentAbove).querySelectorAll('dl dt, dl dd');

This returns A, C, B, D but a browser querySelectorAll would return A, B, C, D
This one I am not able to easily code around. Require the correct order.

Uppercase CLASS (with capitals) is not detected

parser.parse('<div CLASS="a"></div>').querySelectorAll('.a')
The code above returns [ ].

querySelector not working with Attributes

Hey there!
I can't find a way to find a specific Element with a specific Attribute. The following works on normal HTML Pages, but with node-html-parser, it returns null

dom.querySelector("meta[property='og:site_name']")

missing node tag

this is the html looks like:

the tree level is: html > body>table>....
after parse the html string, the dom tree looks like this:

It just lost the 'body' level, and I don't know why. Could you help me with this?

here below is the html string:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>鄂尔多斯恩格贝生态示范区</title>
<link href="../../images/gl.css" type="text/css" rel="stylesheet">
</head>

<body onload="scrollTo(document.body.scrollWidth/5.5,0)">
<script> 
function submitA(){   		
		var channelid=document.getElementById("channelid");	
		channelid.value="214851";		
        frmSearch.action="http://www.ordos.gov.cn/was40/search?channelid=214851";  		
        frmSearch.submit();   
  }
  </script>
<table width="1006" border="0" cellspacing="0" cellpadding="0" align="center">
  <tbody><tr>
    <td id="top" align="center" style="height:270px;">
         
		 	<div class="TRS_Editor"><embed width="1014" height="282" src="../../fzlm/top_falsh/201901/W020190119606990532110.swf" type="application/x-shockwave-flash" scale="ShowAll" play="true" loop="true" menu="true" wmode="Transparent" quality="1" mediatype="flash" oldsrc="W020190119606990532110.swf"></div>
		 
	</td>
  </tr>
</tbody></table>
<table width="1006" border="0" cellspacing="0" cellpadding="0" id="memu" align="center" \="">
  <tbody><tr>
    
        <td class="td1">
	         <a href="../../">首页</a>
	</td>
	<td><img src="../../images/menu_line.jpg"></td>
	<td class="td3">
		<a href="../">信息公开</a>
	</td>
	<td><img src="../../images/menu_line.jpg"></td>
	<td class="td3">
		<a href="../../zjegb/">走进恩格贝</a>
	</td>
	<td><img src="../../images/menu_line.jpg"></td>
	<td class="td3">
		<a href="../../tzegb/">投资恩格贝</a>
	</td>
	<td><img src="../../images/menu_line.jpg"></td>
	
	<td class="td3">
		<a href="../../lyegb/">旅游恩格贝</a>
	</td>
	<td><img src="../../images/menu_line.jpg"></td>
	<td class="td3">
		<a href="../../stjj/scy/">沙产业</a>
	</td>
	<td><img src="../../images/menu_line.jpg"></td>
	<td class="td3">
		<a href="../../stjj/xny/">新能源</a>
	</td>
	<td><img src="../../images/menu_line.jpg"></td>

  </tr>
</tbody></table>
<table width="1006" border="0" cellspacing="0" cellpadding="0" align="center">
  <tbody><tr>
    <td><img src="../../images/s_bj.jpg"></td>
  </tr>
</tbody></table>
<table width="1006" border="0" cellspacing="0" cellpadding="0" align="center" bgcolor="#FFFFFF">
  <tbody><tr>
        <td width="98" class="weather" style="padding-left:18px;">天气预报：</td>
	<td width="373" align="left">
          
<iframe allowtransparency="true" frameborder="0" width="317" height="28" scrolling="no" src="http://tianqi.2345.com/plugin/widget/index.htm?s=3&amp;z=1&amp;t=1&amp;v=0&amp;d=1&amp;bd=0&amp;k=&amp;f=&amp;q=1&amp;e=0&amp;a=1&amp;c=60976&amp;w=317&amp;h=28&amp;align=left"></iframe>
		
	</td>
      <td width="84">站内搜索：</td>
	<form name="frmSearch" action="http://was.ordos.gov.cn/was40/search" target="_blank" onsubmit="javascript:return setSearchword(this);"></form>
    <td width="300" valign="middle"><input type="text" name="searchword" id="input_t">
          <input type="hidden" name="channelid" id="channelid"></td>
    <td width="54" valign="middle"><img onclick="submitA();" src="../../images/ss_btn.jpg" style="cursor:hand;"></td>
	
        <td width="70"><img src="../../images/gs_btn.jpg" onclick="javascript:window.open('http://was.ordos.gov.cn/was40/searchtemplet/egb_gj.jsp','_blank');" style="cursor:hand;"></td>
       <!--<td width="159"><a href="http://mail.ordos.gov.cn/"><img src="../../images/email.jpg" /></a></td>-->
  </tr>
</tbody></table>
<table width="1006" border="0" height="875" cellspacing="0" cellpadding="0" class="mag1" align="center">
  <tbody><tr>
    <td align="center" bgcolor="#e1f1ff" width="253" valign="top">
    <table width="220" border="0" cellspacing="0" cellpadding="0">
	  <tbody><tr>
		<td><img src="../../images/xx_tit.jpg"></td>
	  </tr>
	  <tr>
		<td style="background:url(../../images/g_lbj.jpg) repeat-y;" align="center">
		<table width="155" border="0" cellspacing="0" cellpadding="0" class="gl_left">
                    <tbody><tr>
		        <td width="19"><img src="../../images/gl_ico.jpg"></td>
			<td width="136"><a href="../xxgkgd/" target="_self">信息公开规定</a></td>
		    </tr>
		    <tr>
			<td width="19"><img src="../../images/gl_ico.jpg"></td>
			<td width="136"><a href="../xxgkzd/" target="_self">信息公开制度</a></td>
		    </tr>
		    <tr>
			<td width="19"><img src="../../images/gl_ico.jpg"></td>
			<td width="136"><a href="../xxgkzn/" target="_self">信息公开指南</a></td>
		    </tr>
			 <tr>
				<td width="19"><img src="../../images/gl_ico.jpg"></td>
				<td width="136"><a href="../xxgkml/" target="_self">信息公开目录</a></td>
			 </tr>
			  <tr>
				<td width="19"><img src="../../images/gl_ico.jpg"></td>
				<td width="136"><a href="../ysqgklm/" target="_self">依申请公开</a></td>
			 </tr>
			 <tr>
				<td width="19"><img src="../../images/gl_ico.jpg"></td>
				<td width="136"><a href="./" target="_self">信息公开年报</a></td>
			 </tr>
			 <tr>
				<td width="19"><img src="../../images/gl_ico.jpg"></td>
				<td width="136"><a href="../xxgkscx/" target="_self">信息公开查询</a></td>
			 </tr>
			 <tr>
				<td width="19"><img src="../../images/gl_ico.jpg"></td>
				<td width="136"><a href="../rsxx/" target="_self">人事信息</a></td>
			</tr>
			 <tr>
				<td width="19"><img src="../../images/gl_ico.jpg"></td>
				<td width="136"><a href="../cwgk/" target="_self">财务公开</a></td>
			  </tr>
	
			<tr>
			        <td width="19"><img src="../../images/gl_ico.jpg"></td>
			        <td width="136"><a href="../zfcg/" target="_self">政府采购</a></td>
			 </tr>
			 <tr>
				<td width="19"><img src="../../images/gl_ico.jpg"></td>
				<td width="136"><a href="../yjgl/" target="_self">应急管理</a></td>
			</tr>

		</tbody></table>

		</td>
	  </tr>
	  <tr>
		<td><img src="../../images/g_ldb.jpg"></td>
	  </tr>
	</tbody></table>

	</td>
    <td bgcolor="#FFFFFF" valign="top">
	<table width="752" border="0" cellspacing="0" cellpadding="0" style="margin-bottom:5px;">
	  <tbody><tr>
		<td><img src="../../images/xx_tiao.jpg"></td>
	  </tr>
	</tbody></table>
     <table width="752" height="761" border="0" cellpadding="0" cellspacing="0" class="bor">
	  <tbody><tr>
		<td class="pos" height="27">您现在的位置是：
			<a href="../../" title="首页" class="CurrChnlCls">首页</a>&nbsp;&gt;&gt;&nbsp;<a href="../" title="信息公开" class="CurrChnlCls">信息公开</a>&nbsp;&gt;&gt;&nbsp;<a href="./" title="信息公开年报" class="CurrChnlCls">信息公开年报</a>
		</td>
	  </tr>
	  <tr>
		<td height="678" valign="top">
		<table width="682" border="0" cellspacing="0" cellpadding="0" class="gl_list">
			
			  <tbody><tr valign="top" style="padding-bottom:10px;">
				<td width="25" style="padding-top:10px;"><img src="../../images/ico2.jpg"></td>
				<td width="550">
					<a href="./202009/t20200903_2748784.html">鄂尔多斯市2019年政府信息公开工作年度报告</a>
				</td>
				<td>
					2020-09-03
				</td>
			  </tr>
			  
		  	
			  <tr valign="top" style="padding-bottom:10px;">
				<td width="25" style="padding-top:10px;"><img src="../../images/ico2.jpg"></td>
				<td width="550">
					<a href="./202005/t20200506_2633371.html">恩格贝生态示范区召开2020年度工作会暨党风廉政建设会议</a>
				</td>
				<td>
					2020-05-06
				</td>
			  </tr>
			  
		  	
			  <tr valign="top" style="padding-bottom:10px;">
				<td width="25" style="padding-top:10px;"><img src="../../images/ico2.jpg"></td>
				<td width="550">
					<a href="./202003/t20200311_2593126.html">鄂尔多斯市人民政府办公室2019年政府信息公开工作年度报告</a>
				</td>
				<td>
					2020-03-11
				</td>
			  </tr>
			  
		  	
			  <tr valign="top" style="padding-bottom:10px;">
				<td width="25" style="padding-top:10px;"><img src="../../images/ico2.jpg"></td>
				<td width="550">
					<a href="./201906/t20190614_2389850.html">恩格贝生态示范区召开目标责任制考核表彰大会</a>
				</td>
				<td>
					2019-06-14
				</td>
			  </tr>
			  
		  	
			  <tr valign="top" style="padding-bottom:10px;">
				<td width="25" style="padding-top:10px;"><img src="../../images/ico2.jpg"></td>
				<td width="550">
					<a href="./201905/t20190530_2380968.html">鄂尔多斯市2018年政府信息公开年度报告</a>
				</td>
				<td>
					2019-05-30
				</td>
			  </tr>
			  
		  	
			  <tr valign="top" style="padding-bottom:10px;">
				<td width="25" style="padding-top:10px;"><img src="../../images/ico2.jpg"></td>
				<td width="550">
					<a href="./201905/t20190530_2380967.html">内蒙古自治区2018年政府信息公开工作年度报告</a>
				</td>
				<td>
					2019-05-30
				</td>
			  </tr>
			  
		  	
			  <tr valign="top" style="padding-bottom:10px;">
				<td width="25" style="padding-top:10px;"><img src="../../images/ico2.jpg"></td>
				<td width="550">
					<a href="./201811/t20181102_2290635.html">恩格贝生态示范区召开2018年度工作会议暨党风廉政建设会议</a>
				</td>
				<td>
					2018-03-26
				</td>
			  </tr>
			  
		  	
			  <tr valign="top" style="padding-bottom:10px;">
				<td width="25" style="padding-top:10px;"><img src="../../images/ico2.jpg"></td>
				<td width="550">
					<a href="./201811/t20181102_2290607.html">恩格贝生态示范区：栽下梧桐引凤来</a>
				</td>
				<td>
					2018-01-05
				</td>
			  </tr>
			  
		  	
			  <tr valign="top" style="padding-bottom:10px;">
				<td width="25" style="padding-top:10px;"><img src="../../images/ico2.jpg"></td>
				<td width="550">
					<a href="./201805/t20180525_2166460.html">一图读懂示范区2017年重点工作</a>
				</td>
				<td>
					2017-03-22
				</td>
			  </tr>
			  
		  	
			  <tr valign="top" style="padding-bottom:10px;">
				<td width="25" style="padding-top:10px;"><img src="../../images/ico2.jpg"></td>
				<td width="550">
					<a href="./201802/t20180205_2082017.html">2016，恩格贝，精彩继续！</a>
				</td>
				<td>
					2017-01-07
				</td>
			  </tr>
			  
		  	
		</tbody></table>

		</td>
	  </tr>
	  <tr>
		<td height="572" class="page">
                     <script language="JavaScript" type="text/javascript">
var currentPage = 0;//所在页从0开始
var prevPage = currentPage-1//上一页
var nextPage = currentPage+1//下一页
var countPage = 2//共多少页
//共计多少页 
document.write("共计"+"&nbsp;<font style='color:#FF8008'>"+"2"+"&nbsp;</font>"+"页");
// 设置首页
document.write("&nbsp;<a href=\"index."+"html\">首页</a>&nbsp;|");

//设置上一页代码
if(countPage>1&&currentPage!=0&&currentPage!=1)
document.write("<a href=\"index"+"_" + prevPage + "."+"html\"><span class=greyfont>上一页</span></a>&nbsp;");
else if(countPage>1&&currentPage!=0&&currentPage==1)
document.write("<a href=\"index.html\">&nbsp;<span class=greyfont>上一页</span></a>&nbsp;");
else
document.write("&nbsp;上一页&nbsp;");

//循环
var num = 6;
if(currentPage<=3)
{
for(var i=0 ; i<=6 && (i<countPage); i++){
if(currentPage==i)
document.write("&nbsp;<font style='color:#FF8008'>"+(i+1)+"</font>&nbsp;|");
else if(i==0)
document.write("&nbsp;<a href=\"index.html\">1</a>&nbsp;|");
else
if(i>0) document.write("&nbsp;<a href=\"index"+"_" + i + "."+"html\">"+(i+1)+"</a>&nbsp;|");
//alert(i)
}
}
else
{
if(currentPage>(countPage-3))
{

for(var i=(countPage-6) ; i<countPage; i++){
if(currentPage==i)
document.write("&nbsp;<font style='color:#FF8008'>"+(i+1)+"</font>&nbsp;|");
else if(i==0)
document.write("&nbsp;<a href=\"index.html\">1</a>&nbsp;|");
else
if(i>0) document.write("&nbsp;<a href=\"index"+"_" + i + "."+"html\">"+(i+1)+"</a>&nbsp;|");
//alert(i)
}

}else
{
for(var i=(currentPage-3) ; i<=(currentPage+3) &&(i<countPage); i++){
if(currentPage==i)
document.write("&nbsp;<font style='color:#FF8008'>"+(i+1)+"</font>&nbsp;|");
else if(i==0)
document.write("&nbsp;<a href=\"index.html\">1</a>&nbsp;|");
else
if(i>0) document.write("&nbsp;<a href=\"index"+"_" + i + "."+"html\">"+(i+1)+"</a>&nbsp;|");
//alert(i)
}
}
}
//设置下一页代码 
if(countPage>1&&currentPage!=(countPage-1))
document.write("&nbsp;<a href=\"index"+"_" + nextPage + "."+"html\">&nbsp;<span class=greyfont>下一页</span>&nbsp;</a>&nbsp;");

else
document.write("&nbsp;下一页&nbsp;");
// 设置尾页
if(countPage!=1)
document.write("&nbsp;<a href=\"index"+"_" + (countPage-1)  + "."+"html\">尾页</a>&nbsp;");
else
document.write("&nbsp;尾页&nbsp;");
 //跳转页数脚本开始
 document.write(" 转到 <input type='text' id='itemNum' size='5' style='height:15px; width:33px' />&nbsp;<input type='button' style='background-color:#333333; border-color:#006633; height:20px; width:28px; color:#FFFFFF; font-weight:bold' value='GO' onClick='goto();' />");
 var itemNum=document.getElementById("itemNum").value=currentPage+1;
 function goto(){
 var itemNum=document.getElementById("itemNum").value;
 if(itemNum>1&&itemNum<=countPage){
   window.navigate("index_"+(itemNum-1)+".html");
}
else if(itemNum==1){
   window.navigate("index.html");
}
else{
alert("输入的数字不在页数范围内！");
}
}
 //跳转页数脚本结束
</script>共计&nbsp;<font style="color:#FF8008">2&nbsp;</font>页&nbsp;<a href="index.html">首页</a>&nbsp;|&nbsp;上一页&nbsp;&nbsp;<font style="color:#FF8008">1</font>&nbsp;|&nbsp;<a href="index_1.html">2</a>&nbsp;|&nbsp;<a href="index_1.html">&nbsp;<span class="greyfont">下一页</span>&nbsp;</a>&nbsp;&nbsp;<a href="index_1.html">尾页</a>&nbsp; 转到 <input type="text" id="itemNum" size="5" style="height:15px; width:33px">&nbsp;<input type="button" style="background-color:#333333; border-color:#006633; height:20px; width:28px; color:#FFFFFF; font-weight:bold" value="GO" onclick="goto();">
                </td>
	  </tr>
	</tbody></table>

	</td>
  </tr>
</tbody></table>
<table width="1006" border="0" cellspacing="0" cellpadding="0" height="136" id="down" align="center">
  <tbody><tr>
    <td align="center"><table width="684" height="91" border="0" cellpadding="0" cellspacing="0" class="about" style="margin-top:18px;">
  <!--<tr>
    <td width="609" align="center"><a href="../../dbdh/gywm/" >关于我们</a> | <a href="../../dbdh/wyjy/" >网友建议</a> | <a href="../../dbdh/xxbz/">信息保障</a> | <a href="../../dbdh/lxwm/" >联系我们</a> |
      </td>
  </tr>--><!--访问量<script src="http://s17.cnzz.com/stat.php?id=2896628&web_id=2896628&show=pic" language="JavaScript"></script>-->
  <tbody><tr>
    <td align="center">主办：鄂尔多斯市恩格贝生态示范区管委会　<a href="../../wzdt/" target="_blank">网站地图</a> <script src="http://s17.cnzz.com/stat.php?id=2896628&amp;web_id=2896628&amp;show=pic" language="JavaScript"></script><script src="http://c.cnzz.com/core.php?web_id=2896628&amp;show=pic&amp;t=z" charset="utf-8" type="text/javascript"></script><a href="https://www.cnzz.com/stat/website.php?web_id=2896628" target="_blank" title="站长统计"><img border="0" hspace="0" vspace="0" src="http://icon.cnzz.com/img/pic.gif"></a>   |  <a href="../../dbdh/lxwm/">联系我们</a> </td>
  </tr>
  <tr>
    <td align="center">违法和不良信息举报电话：0477-2258659(工作时间)&nbsp;&nbsp;邮箱：[email protected] <br>
      　
      建议使用：1024×768分辩率 真彩32位浏览 <font color="#000000">网站标识码：1506000140</font></td>
  </tr>
  <tr>
    <td align="center">　中文域名：鄂尔多斯市恩格贝生态示范区管理委员会.政务&nbsp;<a href="http://www.beian.miit.gov.cn/" target="_blank"> 蒙ICP备13001412号-2</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>
        <img src="../../images/W020190117350825039697.png" style="float:center;"><a target="_blank" href="http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=15062102000168" style="color: #000000;text-decoration: none;">&nbsp;&nbsp;&nbsp;蒙公网安备 15062102000168号</a> </td>
  </tr>
  <tr>
    <td align="center"><script type="text/javascript">document.write(unescape("%3Cspan id='_ideConac' %3E%3C/span%3E%3Cscript src='http://dcs.conac.cn/js/07/136/0000/40624487/CA071360000406244870001.js' type='text/javascript'%3E%3C/script%3E"));</script><span id="_ideConac"><a href="//bszs.conac.cn/sitename?method=show&amp;id=05C2ED68A8F661C9E053012819ACF5E5" target="_blank"><img id="imgConac" vspace="0" hspace="0" border="0" src="//dcs.conac.cn/image/blue.png" data-bd-imgshare-binded="1"></a></span><script src="http://dcs.conac.cn/js/07/136/0000/40624487/CA071360000406244870001.js" type="text/javascript"></script><span id="_ideConac"></span>
    </td>
    <td width="96"> <script id="_jiucuo_" sitecode="1506000140" src="http://pucha.kaipuyun.cn/exposure/jiucuo.js"></script> </td>
  </tr>
</tbody></table></td>
  </tr>
</tbody></table>

</body></html>

element.getAttribute return undefined

I try to read an attribute, I get 'undefined'.

function extractData(body) {
	const html = HTMLParser.parse(body);
	const date = html.querySelector('cv-stats-virus');
	console.log(date.getAttribute(':charts-data'));
}
// undefined

Is there a way to append just a Text Node

I see there is an ability to append an HTMLNode to children Nodes but is there a way to do that with TextNodes as well?

set_content strips out innerHTML of pre tags

Hello!

I came across a small issue when using the library where if you run set_content on an element with some content that includes a <pre> tag, then the contents of the <pre> tag are stripped out of the inserted content. I've included a small example to demonstrate the problem. If I am missing something with the library that allows you to include this content, could you please let me know?

const { parse } = require("node-html-parser");

const html = `<html>
<head>
</head>
<body>
</body>
</html>`;

const htmlTree = parse(html, { pre: true });

// 1
console.log(htmlTree.toString());

htmlTree
  .querySelector("body")
  .set_content(`<pre>this    is some    preformatted    text</pre>`);

// 2
console.log(htmlTree.toString());

// 1

<html>
<head>

</head>
<body>
</body>
</html>

// 2

<html>
<head>

</head>
<body><pre></pre></body>
</html>

Thanks for the help and for working on such a great library!

Option to parse ONLY HTML tags

Hi! First of all, I would like to thank you for this package.
It does work well except for one issue.

Let's say we have C code like this:

#include <stdio.h>

After I parse it with this package, .text property returns #include without <stdio.h>.
I suspect that is because package treats <stdio.h> as a HTML tag, but it's not.
It's the syntax for importing libraries in C.

Could we have an option where the plugin would strictly parse HTML tags (span, div etc.) without parsing "custom" tags?

Thank you.

How to extract href?

Title

Quotes around attributes are stripped

After using setAttribute to update the href of a link the double quotes are stripped from the attributes when converting it back to a string using toString.

Example:

Original: <a style="width:100%;" href="https://google.com" class="button--action button-text-wrap" target="_blank" rel="noopener noreferrer">Button</a>

New: <a style=width:100%; href=http://localhost:8080/metrics/link?redirect=https://google.com&subject=TEST%20-%20TEST%20-%20test%20link&name=Button class=ods-button--action button-text-wrap target=_blank rel=noopener noreferrer>Button</a>

Notice the double quotes around the attributes are now gone.

Why is the "innerText" attribute renamed to "rawText"?

In the documentation, it's noted that ELEMENT.rawText behaves just like ELEMENT.innerText. Why not use the same attribute??

https://www.npmjs.com/package/node-html-parser#htmlelementtext

Then scripts that run in the browser (for testing) will also run in this environment...

Anyway, thanks for a great library!

Creating new Nodes

There's no documented way to make new Nodes to append into the tree. The problem is that appendChild only takes nodes and not raw strings. I was able to make new Nodes by importing the HTMLElement class and using that to create elements.

Is this intentionally left out or not an intended use case? It's already built in to the code, just needs an easier interface. set_content is able to handle strings and nodes, why not allow appendChild to do the same?

Typescript return-type for HtmlElement.getAttribute() is wrong

In html.d.ts:

    /**
     * Get an attribute
     * @return {string} value of the attribute
     */
    getAttribute(key: string): string;
    /**

If the attribute is not found this method returns undefined so the type should be string|undefined

Typescript types seem to be incorrect

Trying to follow the little example in the ReadMe, it seems like your type definitions are off, as it certainly won't compile. The following code...

const root = parse('<ul id="list"><li>Hello World</li></ul>');
console.log(root.firstChild.structure);

...results in the following error...

index.tsx:31:33 - error TS2339: Property 'firstChild' does not exist on type '(TextNode & { valid: boolean; }) | (HTMLElement & { valid: boolean; })'.
  Property 'firstChild' does not exist on type 'TextNode & { valid: boolean; }'.

31             console.log(content.firstChild.structure);

Using Node-HTML-Parser v1.2.17

can I to get parents ?

rawAttributes behaviour change?

On v1.2.0 rawAttributes returns values wrapped in "
Also what is the reason for not encoding entities in setAttribute anymore?

Breaking changes when moving to 1.1.20

index.ts no longer exports TextNode, CommentNode (or NodeType) leading to that something like this is no longer possible:

if (node instanceof TextNode && !node.isWhitespace) { ... }

Option to keep location info

I want to use this parser in order to create rollup plugin and I need location info of some tags in order to correctly generate sourcemap for browser.

The greater than sign of the attribute value on the tag is treated as a closing tag

<template v-if="list.length>0"> <div> 123 </div> </template>

setAttribute without value

It is not possible to add attributes without values like 'disabled' for html buttons. A workaround like setAttribute("disabled", "") does not work also, since it injects " disabled='' " into the html, which does not disable buttons.

named import does not work

The first usage-example does not work for me.
I created a new file in an empty directory, pasted your code and tried to run it.

$> npm init
....
$> npm install node-html-parser
npm notice created a lockfile as package-lock.json. You should commit this file.
npm WARN [email protected] No description
npm WARN [email protected] No repository field.

+ [email protected]
added 2 packages from 3 contributors and audited 2 packages in 0.409s
found 0 vulnerabilities

$> vim test.mjs
... (pasted the first usage-example) ...
$> node test.mjs
import { parse } from 'node-html-parser';
         ^^^^^
SyntaxError: The requested module 'node-html-parser' is expected to be of type CommonJS, which does not support named exports. CommonJS modules can be imported by importing the default export.
For example:
import pkg from 'node-html-parser';
const { parse } = pkg;
    at ModuleJob._instantiate (internal/modules/esm/module_job.js:97:21)
    at async ModuleJob.run (internal/modules/esm/module_job.js:135:5)
    at async Loader.import (internal/modules/esm/loader.js:178:24)

$> node --version
v14.4.0

Did I do something wrong?
Thank you!

Escaped HTML parsed Unescaped by parse()

When reading in content that includes escaped HTML sequences, these are interpreted by the parse() function as unescaped HTML and included in outputs as unescaped.

This causes issues when text is included on the page that should be unescaped and is interpreted by the browser as an HTML tag.

For example:

SOURCE:

<html>
<body>
<textarea id="source'>
&lt;p&gt;
This content should be enclosed within an escaped p tag&lt;br /&gt;
&lt;/p&gt;
</textarea>
</body>
</html>

PARSED INPUT:

<html>
<body>
<textarea id="source'>
<p>
This content should be enclosed within an escaped p tag<br />
<p>
</textarea>
</body>
</html>

Script tags have no text/inner html

Tried to use this module for extracting some scripts inside HTML response but I cannot.

reproduce

const html_parser = require('node-html-parser')

html_parser.parse('<script>var test = \'\';</script>')
    .querySelectorAll('script')[0].text

Grouping selectors(comma) has different behavior elem.querySelector

<article>
<h4>title</h4>
</article>

Suppose elem is the article, elem.querySelector('h3, h4') return null, but elem.querySelectorAll('h3, h4') return array with that h4. Also, on native browser, elem.querySelector('h3, h4') return h4.

So I guess node-html-parser need to do same, just return the first element in array?

Memory leak

I parsed many different pages and tried to get all links from pages. I collect all acquired URLs into an array, but with time I got an error 'heap out of memory'. I have made a dump of memory and I have discovered that your library returns sliced arrays in some cases. There is a situation when I store small strings but that continues to be linked with large strings that cause out of memory with time. I have used 'getAttribute' function. I recommend writing some notification in docs so that users could avoid this situation in the future. Or create an additional function with deep copy
Code that can reproduce, can use any HTML with links:

const fs = require('fs');
const HTMLParser = require('node-html-parser');

(async () => {
  const array = [];
  setInterval(() => {
    fs.readFile('tmp.html', (err, buf) => {
      let lines = buf.toString();
      const root = HTMLParser.parse(lines);
      for (const elem of root.querySelectorAll('a')) {
        array.push(elem.getAttribute('href'));
      }
    })
  }, 2000);
})();

There's no declaration file for TypeScript

Is that on purpose?
Having proper types can be very very helpful.

Versions

node-html-parser: 1.2.16
Node: 14.0.0
npm: 16.14.4
VSCode: 1.44.2
typescript: 3.7.5

Breaking change upgrading to 1.2.x

Dependabot has been trying to upgrade me from 1.1.20 to 1.2.x, but each attempt has failed type checking on this code:

      const html = doc as HTMLElement;
      const title = html.querySelector('title');
      const content = html.querySelector('ufe-content');

      const forms = content.querySelectorAll('form');

The error I receive is:

    TypeScript diagnostics (customize using `[jest-config].globals.ts-jest.diagnostics` option):
    src/server/content.ts:54:29 - error TS2339: Property 'querySelectorAll' does not exist on type 'Node'.

    54       const forms = content.querySelectorAll('form');

In version 1.1.20, html.querySelector('ufe-content') returns the HTMLElement type. But with version 1.2.x, the error indicates that TypeScript thinks querySelector returns a Node instead.

Help is greatly appreciated!

build using "tsc" command failed with error: "Accessors are only available when targeting ECMAScript 5 and higher"

Error by parsing inline-script elements

import { parse } from 'node-html-parser';
const node = parse('<script>console.log("Hi");</script>')

produces:

attributes: Object
childNodes: [HTMLElement]
classNames: []
firstChild: HTMLElement
innerHTML: "<script></script>"
lastChild: HTMLElement
nodeType: 1
outerHTML: "<script></script>"
parentNode: null
rawAttributes: Object
rawAttrs: ""
rawText: ""
structure: "null↵  script"
structuredText: ""
tagName: null
text: ""
valid: true

and node.toSring() returns <script></script>.
it´s content gets lost

node.tagName is suddenly uppercase in 1.2.21

In version 1.2.20 node.tagName comes back lowercase:

> ht = require('node-html-parser')
> d = ht.parse('<div></div>')
> d.childNodes[0].tagName
'div'

In version 1.2.21 node.tagName comes back uppercase:

> ht = require('node-html-parser')
> d = ht.parse('<div></div>')
> d.childNodes[0].tagName
'DIV'

Patch versions past v1 must not introduce breaking changes like this! It means any tool/project/framework which:

Uses node-html-parser at v 1.2.x (e.g. "node-html-parser": "^1.2.8")
Relies on the case of tag names (e.g. to differentiate between <standard-html-tags> and <CustomComponentClassTags/> like React)

Is now broken for all installs after 25th August.

We need a 1.2.22 release which fixes this bug ASAP.

insertAdjacentHTML not working when having flat HTML

With for example this HTML:

<body>
lorem ipsum
<img src="..." alt="..." />
lorem ipsum
<h2 style="text-align: center;">more text</h2>
<h5 style="text-align: center;">more text</h2>
<h5 style="text-align: center;">more text</h2>
</body>

and this code

    var root = HTMLParser.parse(text);
    var ele = root.querySelector('h2');
    ele.insertAdjacentHTML("afterend", "<div></div>")
    console.log(root.toString())

The div will be added like this <div></div></body>, however it should be added after the h2.

When I don't add a parent tag insertAdjacentHTML will raise an error, probably because of this a72c0da#diff-3ba29b891fcc8c2d131ae8f408ae8177R522-R523

Error:

node-html-parser/dist/nodes/html.js:581
                _this.parentNode.appendChild(n);
                                 ^

TypeError: Cannot read property 'appendChild' of null
    at node-html-parser/dist/nodes/html.js:581:34
    at Array.forEach (<anonymous>)
    at HTMLElement.insertAdjacentHTML (node-html-parser/dist/nodes/html.js:580:26)

Errors with TypeScript 4.x

The breaking change “Properties Overridding Accessors (and vice versa) is an Error” in the TypeScript 4 release throws a few errors in the package.

Export NodeType again

In cba17a7 the export of NodeType got removed (by accident?). I cannot upgrade to the latest version because of this.

It would be nice to have access to the enum and to be able to import it.

removeChild(node: Node) does not actually remove the child node.

HTMLParser.HTMLElement.removeChild(node: Node) does not actually remove the child node.

Method description:
Remove Child element from childNodes array

Expected Result: The child node will be removed from the childNodes array

Actual Result: The childNodes array is not affectd

Parsing text ignore on root

It seems you have a serious mistake in parsing.

Example code:

parse(`Катя, спасибо большое. Все наборы очень хороши. Самое то что нужно))) Прям хоть все скупай  <img src="/new_style/tiny_mce/plugins/emotions/img/big-1/big_smiles_162.gif" alt="big_smiles_162.gif" /><div></div>
`).toString()

Expected result:

Катя, спасибо большое. Все наборы очень хороши. Самое то что нужно))) Прям хоть все скупай  <img src="/new_style/tiny_mce/plugins/emotions/img/big-1/big_smiles_162.gif" alt="big_smiles_162.gif" /><div></div>

Actual result:

<img src="/new_style/tiny_mce/plugins/emotions/img/big-1/big_smiles_162.gif" alt="big_smiles_162.gif"  /><div></div>

<style> tag is not parsed correctly

<style>
  .my-class {
    font-size: 90%;
  }
</style>

toString() will not get content of the style tag

1.2.21 has break change

tagName is break change to "upperCase", the new version must be upgrade first version number
tagName的字段值从小写变成了大写，属于重大变更，应该升级大版本号

is this faster than parse5?

Losing content when using lowerCaseTagName option as true

I'm using axios to fetch some informations from a website, and the HTML code came all with uppercase tags like <TITLE>, and so on. I'm getting a bunch of problems to get elements with querySelector, so I tryied to configure the parsing option , setting lowerCaseTagName to true.

But for some reason it removes a lot of the code like, script tags, body tags, also removes the head tag, but preserve it content.

I solved the problem converting the axios response to lowercase, and then passing it to parse() function, this way, querySelector worked great.

The problematic HTML code is right down, if it helps.
https://pastebin.com/raw/H6Vzwpe9

parser fails to keep script element child nodes

failing repro:

const page = "<!DOCTYPE html><html lang="en"><head></head><body><script id="storeFinder" type="application/json">{"key":true}</script></body></html>"
const root = parse(page);
const storeFinder = document.querySelector("#storeFinder");
console.log("store finder exists", storeFinder !== null);
console.log("no child nodes are kept", storeFinder.childNodes.length === 0);

Expectation: the parser keeps the child nodes of script tags as text nodes, like how the parser keeps text nodes of divs

CSS Selector :nth-child() return null (not implemented ?)

<tbody>
   <tr class="odd">
      <td>13/10/2020</td>
      <td>
         Cell2
      </td>
      <td>
         <a href="/mjrcs-32432">Cell3</a>
      </td>
      <td>
         <a target="_blank" href="/5z9LX1.pdf" ><img alt="PDF File" src="/mjrcs-resa/images/v2/icone_pdf.gif"></a>
      </td>
      <td>
         <a target="winzip" href="/33WzKc.zip"><img alt="ZIP File" src="/mjrcs-resa/images/v2/icone_pdf_archive.gif" height="16" ></a>
      </td>
      <td>
         <a target="_blank" href="/3QhZGq.xml"><img alt="XML file"  src="/mjrcs-resa/images/v2/icone_xml.gif" height="16" ></a>
      </td>
      <td>
      </td>
   </tr>

Expected:
document.querySelector('table tbody tr td:nth-child(6) a').href

/3QhZGq.xml

Result:
document.querySelector('table tbody tr td:nth-child(6) a').href

null

**Screenshoot: **

html 5 escapes are returned in .text

from the interface I see:
/**
* Get unescaped text value of current node and its children.
* @return {string} text content
*/
get text(): string;

How do you get HTML5 encoding unescaped?
Since it didn't work on my test.
i.e:
The king&#39(;)s hat is on fire!

to
The king's hat is on fire!

HTMLElement toString() does not include key attributes

The HTMLElement class toString() function outputs valid HTML for the element, however this output does not include the id or class attributes.

Example:

const parse = require('node-html-parser').parse; const HTMLElement = require('node-html-parser').HTMLElement;
undefined
> 
> var el = new HTMLElement('div', {'id':'new_container', 'class':'container'});
undefined
> el
HTMLElement {
  childNodes: [],
  tagName: 'div',
  rawAttrs: '',
  parentNode: null,
  classNames: [ 'container' ],
  nodeType: 1,
  id: 'new_container'
}
> el.id
'new_container'
> el.classNames
[ 'nav_container' ]
> el.classNames.toString()
'nav_container'
> el.toString()
'<div></div>'

This is due to the toString() function including an output for the rawAttrs property of the object, but not the id or class which are considered keyAttrs.

Affected Code:

in:node-html-parser/src/nodes/html.ts

public toString() {
		const tag = this.tagName;
		if (tag) {
			const is_un_closed = /^meta$/i.test(tag);
			const is_self_closed = /^(img|br|hr|area|base|input|doctype|link)$/i.test(tag);
			const attrs = this.rawAttrs ? ' ' + this.rawAttrs : '';
			if (is_un_closed) {
				return `<${tag}${attrs}>`;
			} else if (is_self_closed) {
				return `<${tag}${attrs} />`;
			} else {
				return `<${tag}${attrs}>${this.innerHTML}</${tag}>`;
			}
		} else {
			return this.innerHTML;
		}
	}

This can be remedied by including the id and classNames properties in the constructed string output. Example code included below:

HTMLElement.prototype.toString = function () {
        var tag = this.tagName;
        if (tag) {
            var is_un_closed = /^meta$/i.test(tag);
            var is_self_closed = /^(img|br|hr|area|base|input|doctype|link)$/i.test(tag);
            var attrs = (this.id ? ' id=\"' + this.id + '\"': '') + (this.classNames.length > 0 ? ' class=\"' + this.classNames.join(' ') + '\"' : '') + (this.rawAttrs ? ' ' + this.rawAttrs : '');
            if (is_un_closed) {
                return "<" + tag + attrs + ">";
            }
            else if (is_self_closed) {
                return "<" + tag + attrs + " />";
            }
            else {
                return "<" + tag + attrs + ">" + this.innerHTML + "</" + tag + ">";
            }
        }
        else {
            return this.innerHTML;
        }
    };

This results in the following behaviour:

> const parse = require('node-html-parser').parse; const HTMLElement = require('node-html-parser').HTMLElement;
undefined
> el = new HTMLElement('div', {id: 'new_container', class:'container container-new'});
HTMLElement {
  childNodes: [],
  tagName: 'div',
  rawAttrs: '',
  parentNode: null,
  classNames: [ 'container', 'container-new' ],
  nodeType: 1,
  id: 'new_container'
}
> el.toString()
'<div id="new_container"class="container container-new"></div>'

how do I select the middle td?

Here's my .structure

'tr
  td
    div
      span
        #text
      p
        #text
      p
        #text
  td
    #text
  td
    span
      #text'

Trying to get the second td. I tried all I could think of... no go. Is this even implemented?

Minify

Is there an option to get a string minified version?

nextElement

Good night!

Thanks you for create work!

Not work next element find :(

Example
html.querySelector('.pages .num span + a'); tag + tag = next element
or
for example found one element and want next element
element1 and
element1.nextElementSibling = null

Quotes in HTML attributes escaped which breaks HTML

Hi!

I wanted to report an issue:

JSON values of HTML attributes are rewritten to an escaped value which breaks the HTML:

<div data-json='{
    "json": "value"
}'></div>

Result of .toString():

<div data-json="{\"json\":\"value\"}"></div>

Edit

Since the goal of the HTML parser is speed, it may be best to replace JSON.stringify for HTML attributes with a simple string based value verification and leave the original value, even if it would be a mere space or empty string, intact. It could save 50,000+ JSON.stringify calls for some HTML documents.

For some attributes or Javascript functionality it does matter if the attribute contains ="". Stripping it would cost parsing resources while it seems to provide no other advantage than HTML compression, which does not seem to be a goal of the HTML parser.

The following example may provide a hint for a solution:

// Update rawString
const quoteRegex = /"/g; // re-use

this.rawAttrs = Object.keys(attrs).map(function(name) {
    var val = attrs[name];
    if (val === undefined) { // not a string
        return name;
    } else {
        return name + '="' + val.replace(quoteRegex, '&#34;') + '"';
    }
}).join(' ');

To many results for querySelectorAll with tag.

bellow the code which get to many results for the table tag.

const request = require('request');
const HTMLParser = require ('node-html-parser');
request('http://nagiosadmin:[email protected]/nagios/cgi-bin/status.cgi?host=all&limit=0', function (error, response, body) {
const root = HTMLParser.parse(body);
const table = root.querySelectorAll('table');
console.log(table)
});