You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

323 lines
15 KiB

  1. # saxes
  2. A sax-style non-validating parser for XML.
  3. Saxes is a fork of [sax](https://github.com/isaacs/sax-js) 1.2.4. All mentions
  4. of sax in this project's documentation are references to sax 1.2.4.
  5. Designed with [node](http://nodejs.org/) in mind, but should work fine in the
  6. browser or other CommonJS implementations.
  7. Saxes does not support Node versions older than 10.
  8. ## Notable Differences from Sax.
  9. * Saxes aims to be much stricter than sax with regards to XML
  10. well-formedness. Sax, even in its so-called "strict mode", is not strict. It
  11. silently accepts structures that are not well-formed XML. Projects that need
  12. better compliance with well-formedness constraints cannot use sax as-is.
  13. Consequently, saxes does not support HTML, or pseudo-XML, or bad XML. Saxes
  14. will report well-formedness errors in all these cases but it won't try to
  15. extract data from malformed documents like sax does.
  16. * Saxes is much much faster than sax, mostly because of a substantial redesign
  17. of the internal parsing logic. The speed improvement is not merely due to
  18. removing features that were supported by sax. That helped a bit, but saxes
  19. adds some expensive checks in its aim for conformance with the XML
  20. specification. Redesigning the parsing logic is what accounts for most of the
  21. performance improvement.
  22. * Saxes does not aim to support antiquated platforms. We will not pollute the
  23. source or the default build with support for antiquated platforms. If you want
  24. support for IE 11, you are welcome to produce a PR that adds a *new build*
  25. transpiled to ES5.
  26. * Saxes handles errors differently from sax: it provides a default onerror
  27. handler which throws. You can replace it with your own handler if you want. If
  28. your handler does nothing, there is no `resume` method to call.
  29. * There's no `Stream` API. A revamped API may be introduced later. (It is still
  30. a "streaming parser" in the general sense that you write a character stream to
  31. it.)
  32. * Saxes does not have facilities for limiting the size the data chunks passed to
  33. event handlers. See the FAQ entry for more details.
  34. ## Conformance
  35. Saxes supports:
  36. * [XML 1.0 fifth edition](https://www.w3.org/TR/2008/REC-xml-20081126/)
  37. * [XML 1.1 second edition](https://www.w3.org/TR/2006/REC-xml11-20060816/)
  38. * [Namespaces in XML 1.0 (Third Edition)](https://www.w3.org/TR/2009/REC-xml-names-20091208/).
  39. * [Namespaces in XML 1.1 (Second Edition)](https://www.w3.org/TR/2006/REC-xml-names11-20060816/).
  40. ## Limitations
  41. This is a non-validating parser so it only verifies whether the document is
  42. well-formed. We do aim to raise errors for all malformed constructs
  43. encountered. However, this parser does not thorougly parse the contents of
  44. DTDs. So most malformedness errors caused by errors **in DTDs** cannot be
  45. reported.
  46. ## Regarding `<!DOCTYPE` and `<!ENTITY`
  47. The parser will handle the basic XML entities in text nodes and attribute
  48. values: `&amp; &lt; &gt; &apos; &quot;`. It's possible to define additional
  49. entities in XML by putting them in the DTD. This parser doesn't do anything with
  50. that. If you want to listen to the `doctype` event, and then fetch the
  51. doctypes, and read the entities and add them to `parser.ENTITIES`, then be my
  52. guest.
  53. ## Documentation
  54. The source code contains JSDOC comments. Use them. What follows is a brief
  55. summary of what is available. The final authority is the source code.
  56. **PAY CLOSE ATTENTION TO WHAT IS PUBLIC AND WHAT IS PRIVATE.**
  57. The move to TypeScript makes it so that everything is now formally private,
  58. protected, or public.
  59. If you use anything not public, that's at your own peril.
  60. If there's a mistake in the documentation, raise an issue. If you just assume,
  61. you may assume incorrectly.
  62. ## Summary Usage Information
  63. ### Example
  64. ```javascript
  65. var saxes = require("./lib/saxes"),
  66. parser = new saxes.SaxesParser();
  67. parser.on("error", function (e) {
  68. // an error happened.
  69. });
  70. parser.on("text", function (t) {
  71. // got some text. t is the string of text.
  72. });
  73. parser.on("opentag", function (node) {
  74. // opened a tag. node has "name" and "attributes"
  75. });
  76. parser.on("end", function () {
  77. // parser stream is done, and ready to have more stuff written to it.
  78. });
  79. parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();
  80. ```
  81. ### Constructor Arguments
  82. Settings supported:
  83. * `xmlns` - Boolean. If `true`, then namespaces are supported. Default
  84. is `false`.
  85. * `position` - Boolean. If `false`, then don't track line/col/position. Unset is
  86. treated as `true`. Default is unset. Currently, setting this to `false` only
  87. results in a cosmetic change: the errors reported do not contain position
  88. information. sax-js would literally turn off the position-computing logic if
  89. this flag was set to false. The notion was that it would optimize
  90. execution. In saxes at least it turns out that continually testing this flag
  91. causes a cost that offsets the benefits of turning off this logic.
  92. * `fileName` - String. Set a file name for error reporting. This is useful only
  93. when tracking positions. You may leave it unset.
  94. * `fragment` - Boolean. If `true`, parse the XML as an XML fragment. Default is
  95. `false`.
  96. * `additionalNamespaces` - A plain object whose key, value pairs define
  97. namespaces known before parsing the XML file. It is not legal to pass
  98. bindings for the namespaces `"xml"` or `"xmlns"`.
  99. * `defaultXMLVersion` - The default version of the XML specification to use if
  100. the document contains no XML declaration. If the document does contain an XML
  101. declaration, then this setting is ignored. Must be `"1.0"` or `"1.1"`. The
  102. default is `"1.0"`.
  103. * `forceXMLVersion` - Boolean. A flag indicating whether to force the XML
  104. version used for parsing to the value of ``defaultXMLVersion``. When this flag
  105. is ``true``, ``defaultXMLVersion`` must be specified. If unspecified, the
  106. default value of this flag is ``false``.
  107. Example: suppose you are parsing a document that has an XML declaration
  108. specifying XML version 1.1.
  109. If you set ``defaultXMLVersion`` to ``"1.0"`` without setting
  110. ``forceXMLVersion`` then the XML declaration will override the value of
  111. ``defaultXMLVersion`` and the document will be parsed according to XML 1.1.
  112. If you set ``defaultXMLVersion`` to ``"1.0"`` and set ``forceXMLVersion`` to
  113. ``true``, then the XML declaration will be ignored and the document will be
  114. parsed according to XML 1.0.
  115. ### Methods
  116. `write` - Write bytes onto the stream. You don't have to pass the whole document
  117. in one `write` call. You can read your source chunk by chunk and call `write`
  118. with each chunk.
  119. `close` - Close the stream. Once closed, no more data may be written until it is
  120. done processing the buffer, which is signaled by the `end` event.
  121. ### Properties
  122. The parser has the following properties:
  123. `line`, `column`, `columnIndex`, `position` - Indications of the position in the
  124. XML document where the parser currently is looking. The `columnIndex` property
  125. counts columns as if indexing into a JavaScript string, whereas the `column`
  126. property counts Unicode characters.
  127. `closed` - Boolean indicating whether or not the parser can be written to. If
  128. it's `true`, then wait for the `ready` event to write again.
  129. `opt` - Any options passed into the constructor.
  130. `xmlDecl` - The XML declaration for this document. It contains the fields
  131. `version`, `encoding` and `standalone`. They are all `undefined` before
  132. encountering the XML declaration. If they are undefined after the XML
  133. declaration, the corresponding value was not set by the declaration. There is no
  134. event associated with the XML declaration. In a well-formed document, the XML
  135. declaration may be preceded only by an optional BOM. So by the time any event
  136. generated by the parser happens, the declaration has been processed if present
  137. at all. Otherwise, you have a malformed document, and as stated above, you
  138. cannot rely on the parser data!
  139. ### Error Handling
  140. The parser continues to parse even upon encountering errors, and does its best
  141. to continue reporting errors. You should heed all errors reported. After an
  142. error, however, saxes may interpret your document incorrectly. For instance
  143. ``<foo a=bc="d"/>`` is invalid XML. Did you mean to have ``<foo a="bc=d"/>`` or
  144. ``<foo a="b" c="d"/>`` or some other variation? For the sake of continuing to
  145. provide errors, saxes will continue parsing the document, but the structure it
  146. reports may be incorrect. It is only after the errors are fixed in the document
  147. that saxes can provide a reliable interpretation of the document.
  148. That leaves you with two rules of thumb when using saxes:
  149. * Pay attention to the errors that saxes report. The default `onerror` handler
  150. throws, so by default, you cannot miss errors.
  151. * **ONCE AN ERROR HAS BEEN ENCOUNTERED, STOP RELYING ON THE EVENT HANDLERS OTHER
  152. THAN `onerror`.** As explained above, when saxes runs into a well-formedness
  153. problem, it makes a guess in order to continue reporting more errors. The guess
  154. may be wrong.
  155. ### Events
  156. To listen to an event, override `on<eventname>`. The list of supported events
  157. are also in the exported `EVENTS` array.
  158. See the JSDOC comments in the source code for a description of each supported
  159. event.
  160. ### Parsing XML Fragments
  161. The XML specification does not define any method by which to parse XML
  162. fragments. However, there are usage scenarios in which it is desirable to parse
  163. fragments. In order to allow this, saxes provides three initialization options.
  164. If you pass the option `fragment: true` to the parser constructor, the parser
  165. will expect an XML fragment. It essentially starts with a parsing state
  166. equivalent to the one it would be in if `parser.write("<foo">)` had been called
  167. right after initialization. In other words, it expects content which is
  168. acceptable inside an element. This also turns off well-formedness checks that
  169. are inappropriate when parsing a fragment.
  170. The option `additionalNamespaces` allows you to define additional prefix-to-URI
  171. bindings known before parsing starts. You would use this over `resolvePrefix` if
  172. you have at the ready a series of namespaces bindings to use.
  173. The option `resolvePrefix` allows you to pass a function which saxes will use if
  174. it is unable to resolve a namespace prefix by itself. You would use this over
  175. `additionalNamespaces` in a context where getting a complete list of defined
  176. namespaces is onerous.
  177. Note that you can use `additionalNamespaces` and `resolvePrefix` together if you
  178. want. `additionalNamespaces` applies before `resolvePrefix`.
  179. The options `additionalNamespaces` and `resolvePrefix` are really meant to be
  180. used for parsing fragments. However, saxes won't prevent you from using them
  181. with `fragment: false`. Note that if you do this, your document may parse
  182. without errors and yet be malformed because the document can refer to namespaces
  183. which are not defined *in* the document.
  184. Of course, `additionalNamespaces` and `resolvePrefix` are used only if `xmlns`
  185. is `true`. If you are parsing a fragment that does not use namespaces, there's
  186. no point in setting these options.
  187. ### Performance Tips
  188. * saxes works faster on files that use newlines (``\u000A``) as end of line
  189. markers than files that use other end of line markers (like ``\r`` or
  190. ``\r\n``). The XML specification requires that conformant applications behave
  191. as if all characters that are to be treated as end of line characters are
  192. converted to ``\u000A`` prior to parsing. The optimal code path for saxes is a
  193. file in which all end of line characters are already ``\u000A``.
  194. * Don't split Unicode strings you feed to saxes across surrogates. When you
  195. naively split a string in JavaScript, you run the risk of splitting a Unicode
  196. character into two surrogates. e.g. In the following example ``a`` and ``b``
  197. each contain half of a single Unicode character: ``const a = "\u{1F4A9}"[0];
  198. const b = "\u{1F4A9}"[1]`` If you feed such split surrogates to versions of
  199. saxes prior to 4, you'd get errors. Saxes version 4 and over are able to
  200. detect when a chunk of data ends with a surrogate and carry over the surrogate
  201. to the next chunk. However this operation entails slicing and concatenating
  202. strings. If you can feed your data in a way that does not split surrogates,
  203. you should do it. (Obviously, feeding all the data at once with a single write
  204. is fastest.)
  205. * Don't set event handlers you don't need. Saxes has always aimed to avoid doing
  206. work that will just be tossed away but future improvements hope to do this
  207. more aggressively. One way saxes knows whether or not some data is needed is
  208. by checking whether a handler has been set for a specific event.
  209. ## FAQ
  210. Q. Why has saxes dropped support for limiting the size of data chunks passed to
  211. event handlers?
  212. A. With sax you could set ``MAX_BUFFER_LENGTH`` to cause the parser to limit the
  213. size of data chunks passed to event handlers. So if you ran into a span of text
  214. above the limit, multiple ``text`` events with smaller data chunks were fired
  215. instead of a single event with a large chunk.
  216. However, that functionality had some problematic characteristics. It had an
  217. arbitrary default value. It was library-wide so all parsers created from a
  218. single instance of the ``sax`` library shared it. This could potentially cause
  219. conflicts among libraries running in the same VM but using sax for different
  220. purposes.
  221. These issues could have been easily fixed, but there were larger issues. The
  222. buffer limit arbitrarily applied to some events but not others. It would split
  223. ``text``, ``cdata`` and ``script`` events. However, if a ``comment``,
  224. ``doctype``, ``attribute`` or ``processing instruction`` were more than the
  225. limit, the parser would generate an error and you were left picking up the
  226. pieces.
  227. It was not intuitive to use. You'd think setting the limit to 1K would prevent
  228. chunks bigger than 1K to be passed to event handlers. But that was not the
  229. case. A comment in the source code told you that you might go over the limit if
  230. you passed large chunks to ``write``. So if you want a 1K limit, don't pass 64K
  231. chunks to ``write``. Fair enough. You know what limit you want so you can
  232. control the size of the data you pass to ``write``. So you limit the chunks to
  233. ``write`` to 1K at a time. Even if you do this, your event handlers may get data
  234. chunks that are 2K in size. Suppose on the previous ``write`` the parser has
  235. just finished processing an open tag, so it is ready for text. Your ``write``
  236. passes 1K of text. You are not above the limit yet, so no event is generated
  237. yet. The next ``write`` passes another 1K of text. It so happens that sax checks
  238. buffer limits only once per ``write``, after the chunk of data has been
  239. processed. Now you've hit the limit and you get a ``text`` event with 2K of
  240. data. So even if you limit your ``write`` calls to the buffer limit you've set,
  241. you may still get events with chunks at twice the buffer size limit you've
  242. specified.
  243. We may consider reinstating an equivalent functionality, provided that it
  244. addresses the issues above and does not cause a huge performance drop for
  245. use-case scenarios that don't need it.