|
|
- # Fast HTML Parser [![NPM version](https://badge.fury.io/js/node-html-parser.png)](http://badge.fury.io/js/node-html-parser) [![Build Status](https://travis-ci.org/taoqf/node-html-parser.svg?branch=master)](https://travis-ci.org/taoqf/node-html-parser)
-
- Fast HTML Parser is a _very fast_ HTML parser. Which will generate a simplified
- DOM tree, with basic element query support.
-
- Per the design, it intends to parse massive HTML files in lowest price, thus the
- performance is the top priority. For this reason, some malformatted HTML may not
- be able to parse correctly, but most usual errors are covered (eg. HTML4 style
- no closing `<li>`, `<td>` etc).
-
- ## Install
-
- ```shell
- npm install --save node-html-parser
- ```
-
- ## Performance
-
- Faster than htmlparser2!
-
- ```shell
- node-html-parser:1.94548 ms/file ± 2.15709
- libxmljs :5.28893 ms/file ± 3.69863
- htmlparser :24.9625 ms/file ± 168.380
- htmlparser2 :3.34011 ms/file ± 4.76959
- parse5 :13.9589 ms/file ± 9.84068
- high5 :6.98078 ms/file ± 4.47575
- ```
-
- Tested with [htmlparser-benchmark](https://github.com/AndreasMadsen/htmlparser-benchmark).
-
- ## Usage
-
- ```ts
- import { parse } from 'node-html-parser';
-
- const root = parse('<ul id="list"><li>Hello World</li></ul>');
-
- console.log(root.firstChild.structure);
- // ul#list
- // li
- // #text
-
- console.log(root.querySelector('#list'));
- // { tagName: 'ul',
- // rawAttrs: 'id="list"',
- // childNodes:
- // [ { tagName: 'li',
- // rawAttrs: '',
- // childNodes: [Object],
- // classNames: [] } ],
- // id: 'list',
- // classNames: [] }
- console.log(root.toString());
- // <ul id="list"><li>Hello World</li></ul>
- root.set_content('<li>Hello World</li>');
- root.toString(); // <li>Hello World</li>
- ```
-
- ```js
- var HTMLParser = require('node-html-parser');
-
- var root = HTMLParser.parse('<ul id="list"><li>Hello World</li></ul>');
- ```
-
- ## HTMLElement Methods
-
- ### parse(data[, options])
-
- Parse given data, and return root of the generated DOM.
-
- - **data**, data to parse
- - **options**, parse options
-
- ```js
- {
- lowerCaseTagName: false, // convert tag name to lower case (hurt performance heavily)
- script: false, // retrieve content in <script> (hurt performance slightly)
- style: false, // retrieve content in <style> (hurt performance slightly)
- pre: false, // retrieve content in <pre> (hurt performance slightly)
- comment: false // retrieve comments (hurt performance slightly)
- }
- ```
-
- ### HTMLElement#trimRight()
-
- Trim element from right (in block) after seeing pattern in a TextNode.
-
- ### HTMLElement#removeWhitespace()
-
- Remove whitespaces in this sub tree.
-
- ### HTMLElement#querySelectorAll(selector)
-
- Query CSS selector to find matching nodes.
-
- Note: only `tagName`, `#id`, `.class` selectors supported. And not behave the
- same as standard `querySelectorAll()` as it will _stop_ searching sub tree after
- find a match.
-
- ### HTMLElement#querySelector(selector)
-
- Query CSS Selector to find matching node.
-
- ### HTMLElement#appendChild(node)
-
- Append a child node to childNodes
-
- ### HTMLElement#insertAdjacentHTML(where, html)
-
- parses the specified text as HTML and inserts the resulting nodes into the DOM tree at a specified position.
-
- ### HTMLElement#setAttribute(key: string, value: string)
-
- Set `value` to `key` attribute.
-
- ### HTMLElement#removeAttribute(key: string)
-
- Remove `key` attribute.
-
- ### HTMLElement#getAttribute(key: string)
-
- Get `key` attribute.
-
- ### HTMLElement#exchangeChild(oldNode: Node, newNode: Node)
-
- Exchanges given child with new child.
-
- ### HTMLElement#removeChild(node: Node)
-
- Remove child node.
-
- ### HTMLElement#toString()
-
- Same as [outerHTML](#htmlelementouterhtml)
-
- ### HTMLElement#set_content(content: string | Node | Node[])
-
- Set content. **Notice**: Do not set content of the **root** node.
-
-
- ## HTMLElement Properties
-
- ### HTMLElement#text
-
- Get unescaped text value of current node and its children. Like `innerText`.
- (slow for the first time)
-
- ### HTMLElement#rawText
-
- Get escpaed (as-it) text value of current node and its children. May have
- `&` in it. (fast)
-
- ### HTMLElement#structuredText
-
- Get structured Text
-
- ### HTMLElement#structure
-
- Get DOM structure
-
- ### HTMLElement#firstChild
-
- Get first child node
-
- ### HTMLElement#lastChild
-
- Get last child node
-
- ### HTMLElement#innerHTML
-
- Get innerHTML.
-
- ### HTMLElement#outerHTML
-
- Get outerHTML.
-
|