HTML 5

Draft Recommendation — 7 July 2008

9.4 Serializing HTML fragments

The following steps form the HTML fragment serialization algorithm. The algorithm takes as input a DOM Element or Document, referred to as the node, and either returns a string or raises an exception.

This algorithm serializes the children of the node being serialized, not the node itself.

  1. Let s be a string, and initialise it to the empty string.

  2. For each child node of the node, in tree order, run the following steps:

    1. Let current node be the child node being processed.

    2. Append the appropriate string from the following list to s:

      If current node is an Element

      Append a U+003C LESS-THAN SIGN (<) character, followed by the element's tag name. (For nodes created by the HTML parser, Document.createElement(), or Document.renameNode(), the tag name will be lowercase.)

      For each attribute that the element has, append a U+0020 SPACE character, the attribute's name (which, for attributes set by the HTML parser or by Element.setAttributeNode() or Element.setAttribute(), will be lowercase), a U+003D EQUALS SIGN (=) character, a U+0022 QUOTATION MARK (") character, the attribute's value, escaped as described below in attribute mode, and a second U+0022 QUOTATION MARK (") character.

      While the exact order of attributes is UA-defined, and may depend on factors such as the order that the attributes were given in the original markup, the sort order must be stable, such that consecutive invocations of this algorithm serialize an element's attributes in the same order.

      Append a U+003E GREATER-THAN SIGN (>) character.

      If current node is an area, base, basefont, bgsound, br, col, embed, frame, hr, img, input, link, meta, param, spacer, or wbr element, then continue on to the next child node at this point.

      If current node is a pre textarea, or listing element, append a U+000A LINE FEED (LF) character.

      Append the value of running the HTML fragment serialization algorithm on the current node element (thus recursing into this algorithm for that element), followed by a U+003C LESS-THAN SIGN (<) character, a U+002F SOLIDUS (/) character, the element's tag name again, and finally a U+003E GREATER-THAN SIGN (>) character.

      If current node is a Text or CDATASection node

      If one of the ancestors of current node is a style, script, xmp, iframe, noembed, noframes, noscript, or plaintext element, then append the value of current node's data DOM attribute literally.

      Otherwise, append the value of current node's data DOM attribute, escaped as described below.

      If current node is a Comment

      Append the literal string <!-- (U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS), followed by the value of current node's data DOM attribute, followed by the literal string --> (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN).

      If current node is a ProcessingInstruction

      Append the literal string <? (U+003C LESS-THAN SIGN, U+003F QUESTION MARK), followed by the value of current node's target DOM attribute, followed by a single U+0020 SPACE character, followed by the value of current node's data DOM attribute, followed by a single U+003E GREATER-THAN SIGN character ('>').

      If current node is a DocumentType

      Append the literal string <!DOCTYPE (U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+0044 LATIN CAPITAL LETTER D, U+004F LATIN CAPITAL LETTER O, U+0043 LATIN CAPITAL LETTER C, U+0054 LATIN CAPITAL LETTER T, U+0059 LATIN CAPITAL LETTER Y, U+0050 LATIN CAPITAL LETTER P, U+0045 LATIN CAPITAL LETTER E), followed by a space (U+0020 SPACE), followed by the value of current node's name DOM attribute, followed by the literal string > (U+003E GREATER-THAN SIGN).

      Other node types (e.g. Attr) cannot occur as children of elements. If, despite this, they somehow do occur, this algorithm must raise an INVALID_STATE_ERR exception.

  3. The result of the algorithm is the string s.

Escaping a string (for the purposes of the algorithm above) consists of replacing any occurrences of the "&" character by the string "&amp;", any occurrences of the "<" character by the string "&lt;", any occurrences of the ">" character by the string "&gt;", any occurrences of the U+00A0 NO-BREAK SPACE character by the string "&nbsp;", and, if the algorithm was invoked in the attribute mode, any occurrences of the """ character by the string "&quot;".

Entity reference nodes are assumed to be expanded by the user agent, and are therefore not covered in the algorithm above.

It is possible that the output of this algorithm, if parsed with an HTML parser, will not return the original tree structure. For instance, if a textarea element to which a Comment node has been appended is serialized and the output is then reparsed, the comment will end up being displayed in the text field. Similarly, if, as a result of DOM manipulation, an element contains a comment that contains the literal string "-->", then when the result of serializing the element is parsed, the comment will be truncated at that point and the rest of the comment will be interpreted as markup. More examples would be making a script element contain a text node with the text string "</script>", or having a p element that contains a ul element (as the ul element's start tag would imply the end tag for the p).

9.5 Parsing HTML fragments

The following steps form the HTML fragment parsing algorithm. The algorithm takes as input a DOM Element, referred to as the context element, which gives the context for the parser, as well as input, a string to parse, and returns a list of zero or more nodes.

Parts marked fragment case in algorithms in the parser section are parts that only occur if the parser was created for the purposes of this algorithm. The algorithms have been annotated with such markings for informational purposes only; such markings have no normative weight. If it is possible for a condition described as a fragment case to occur even when the parser wasn't created for the purposes of handling this algorithm, then that is an error in the specification.

  1. Create a new Document node, and mark it as being an HTML document.

  2. Create a new HTML parser, and associate it with the just created Document node.

  3. Set the HTML parser's tokenisation stage's content model flag according to the context element, as follows:

    If it is a title or textarea element
    Set the content model flag to RCDATA.
    If it is a style, script, xmp, iframe, noembed, or noframes element
    Set the content model flag to CDATA.
    If it is a noscript element
    If the scripting flag is enabled, set the content model flag to CDATA. Otherwise, set the content model flag to PCDATA.
    If it is a plaintext element
    Set the content model flag to PLAINTEXT.
    Otherwise
    Set the content model flag to PCDATA.
  4. Let root be a new html element with no attributes.

  5. Append the element root to the Document node created above.

  6. Set up the parser's stack of open elements so that it contains just the single element root.

  7. Reset the parser's insertion mode appropriately.

    The parser will reference the context element as part of that algorithm.

  8. Set the parser's form element pointer to the nearest node to the context element that is a form element (going straight up the ancestor chain, and including the element itself, if it is a form element), or, if there is no such form element, to null.

  9. Place into the input stream for the HTML parser just created the input.

  10. Start the parser and let it run until it has consumed all the characters just inserted into the input stream.

  11. Return all the child nodes of root, preserving the document order.