You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

177 lines
4.3 KiB

  1. # Fast HTML Parser [![NPM version](https://badge.fury.io/js/node-html-parser.png)](http://badge.fury.io/js/node-html-parser) [![Build Status](https://travis-ci.org/taoqf/node-html-parser.svg?branch=master)](https://travis-ci.org/taoqf/node-html-parser)
  2. Fast HTML Parser is a _very fast_ HTML parser. Which will generate a simplified
  3. DOM tree, with basic element query support.
  4. Per the design, it intends to parse massive HTML files in lowest price, thus the
  5. performance is the top priority. For this reason, some malformatted HTML may not
  6. be able to parse correctly, but most usual errors are covered (eg. HTML4 style
  7. no closing `<li>`, `<td>` etc).
  8. ## Install
  9. ```shell
  10. npm install --save node-html-parser
  11. ```
  12. ## Performance
  13. Faster than htmlparser2!
  14. ```shell
  15. node-html-parser:1.94548 ms/file ± 2.15709
  16. libxmljs :5.28893 ms/file ± 3.69863
  17. htmlparser :24.9625 ms/file ± 168.380
  18. htmlparser2 :3.34011 ms/file ± 4.76959
  19. parse5 :13.9589 ms/file ± 9.84068
  20. high5 :6.98078 ms/file ± 4.47575
  21. ```
  22. Tested with [htmlparser-benchmark](https://github.com/AndreasMadsen/htmlparser-benchmark).
  23. ## Usage
  24. ```ts
  25. import { parse } from 'node-html-parser';
  26. const root = parse('<ul id="list"><li>Hello World</li></ul>');
  27. console.log(root.firstChild.structure);
  28. // ul#list
  29. // li
  30. // #text
  31. console.log(root.querySelector('#list'));
  32. // { tagName: 'ul',
  33. // rawAttrs: 'id="list"',
  34. // childNodes:
  35. // [ { tagName: 'li',
  36. // rawAttrs: '',
  37. // childNodes: [Object],
  38. // classNames: [] } ],
  39. // id: 'list',
  40. // classNames: [] }
  41. console.log(root.toString());
  42. // <ul id="list"><li>Hello World</li></ul>
  43. root.set_content('<li>Hello World</li>');
  44. root.toString(); // <li>Hello World</li>
  45. ```
  46. ```js
  47. var HTMLParser = require('node-html-parser');
  48. var root = HTMLParser.parse('<ul id="list"><li>Hello World</li></ul>');
  49. ```
  50. ## HTMLElement Methods
  51. ### parse(data[, options])
  52. Parse given data, and return root of the generated DOM.
  53. - **data**, data to parse
  54. - **options**, parse options
  55. ```js
  56. {
  57. lowerCaseTagName: false, // convert tag name to lower case (hurt performance heavily)
  58. script: false, // retrieve content in <script> (hurt performance slightly)
  59. style: false, // retrieve content in <style> (hurt performance slightly)
  60. pre: false, // retrieve content in <pre> (hurt performance slightly)
  61. comment: false // retrieve comments (hurt performance slightly)
  62. }
  63. ```
  64. ### HTMLElement#trimRight()
  65. Trim element from right (in block) after seeing pattern in a TextNode.
  66. ### HTMLElement#removeWhitespace()
  67. Remove whitespaces in this sub tree.
  68. ### HTMLElement#querySelectorAll(selector)
  69. Query CSS selector to find matching nodes.
  70. Note: only `tagName`, `#id`, `.class` selectors supported. And not behave the
  71. same as standard `querySelectorAll()` as it will _stop_ searching sub tree after
  72. find a match.
  73. ### HTMLElement#querySelector(selector)
  74. Query CSS Selector to find matching node.
  75. ### HTMLElement#appendChild(node)
  76. Append a child node to childNodes
  77. ### HTMLElement#insertAdjacentHTML(where, html)
  78. parses the specified text as HTML and inserts the resulting nodes into the DOM tree at a specified position.
  79. ### HTMLElement#setAttribute(key: string, value: string)
  80. Set `value` to `key` attribute.
  81. ### HTMLElement#removeAttribute(key: string)
  82. Remove `key` attribute.
  83. ### HTMLElement#getAttribute(key: string)
  84. Get `key` attribute.
  85. ### HTMLElement#exchangeChild(oldNode: Node, newNode: Node)
  86. Exchanges given child with new child.
  87. ### HTMLElement#removeChild(node: Node)
  88. Remove child node.
  89. ### HTMLElement#toString()
  90. Same as [outerHTML](#htmlelementouterhtml)
  91. ### HTMLElement#set_content(content: string | Node | Node[])
  92. Set content. **Notice**: Do not set content of the **root** node.
  93. ## HTMLElement Properties
  94. ### HTMLElement#text
  95. Get unescaped text value of current node and its children. Like `innerText`.
  96. (slow for the first time)
  97. ### HTMLElement#rawText
  98. Get escpaed (as-it) text value of current node and its children. May have
  99. `&amp;` in it. (fast)
  100. ### HTMLElement#structuredText
  101. Get structured Text
  102. ### HTMLElement#structure
  103. Get DOM structure
  104. ### HTMLElement#firstChild
  105. Get first child node
  106. ### HTMLElement#lastChild
  107. Get last child node
  108. ### HTMLElement#innerHTML
  109. Get innerHTML.
  110. ### HTMLElement#outerHTML
  111. Get outerHTML.