The HTML Sanitizer is executed on rich content submitted to a Telligent Community Server site. The Sanitizer is used to ensure user-submitted content is secure and well-formatted - thus preventing JavaScript-, plugin-, and HTML-based hacking attempts (such as cross-site scripting and content spoofing). Specifically, the HTML Sanitizer performs the following verifications/adjustments:
- Ensures that only allowed HTML elements, attributes, and style attributes/values are used. Any invalid markup is removed.
- Ensures that only allowed URLs are referenced. Invalid URLs are removed.
- Ensures that HTML is well-formed. Improper nesting and formatting are corrected.
- Adjusts markup based on configured rules - for example, adjusting markup generated from Word/Outlook.
The HTML Sanitizer's behavior is defined within the <HtmlSanitization> node of the communityserver.config file.
Configuring allowed HTML/CSS
The allowed HTML tags and attributes as well as allowed CSS style attributes are defined in the <AllowedHtml> node under the <HtmlSanitization> node in the communityserver.config file in the following format:
<HtmlSanitization>
<AllowedHtml>
<UrlProtocols>
<Protocol name="..." />
</UrlProtocols>
<Attributes>
<Attribute name="..." />
</Attributes>
<Tags>
<Tag name="...">
<Attribute name="..." />
</Tag>
<Tag name="..." />
</Tags>
<Style>
<Attribute name="..." />
<Attribute name="...">
<Value>...</Value>
</Attribute>
</Style>
</AllowedHtml>
</HtmlSanitization>
URL protocols
The <UrlProtocols> node identifies allowed URL protocols (such as HTTP, FTP, MAILTO, etc.) that the HTML Sanitizer allows in markup. Any URL that doesn't match one of these protocols or is not a local URL is removed. Each supported protocol should be identified via the name attribute of a new <UrlProtocol /> node in the <UrlProtocols /> node.
Attributes
The <Attributes> node identifies allowed HTML attributes for all allowed HTML tags. Specifying attributes in this node prevents the need to duplicate allowed attributes in the <Tags> node. Each globally allowed HTML attribute should be identified via the name attribute of a new <Attribute /> node within the <Attributes> node.
Tags
The <Tags> node identifies allowed HTML tags in content processed by the HTML Sanitizer. Any tag that is now allowed is removed. Each allowed tag should be identified via the name attribute on a new <Tag /> node in the <Tags> node. <Tag /> nodes can optionally identify allowed attributes that are only enabled for this tag. (Note that attributes identified in the <Attributes> node are enabled for all tags.)
Style
The <Style> node identifies allowed CSS style attributes. Only enabled style attributes are allowed in the style attribute of allowed HTML tags (assuming such a style attribute is allowed for the tag). Each allowed style attribute should be identified via the name attribute on a new <Attribute /> node in the <Style> node.
If you only want to allow specific values for an individual style attribute, a <Value> node can be added for each allowed value in the associated <Attribute> node. When an attribute itself is allowed but its value is not allowed, the HTML Sanitizer removes the attribute and its value.
Configuring HTML nesting and formatting
The configuration of the HTML Sanitizer's HTML formatting correction logic is defined in the <SelfContainedHtml>, <CloseBeforeNext>, and <RemoveContentsWithTags> nodes under the <HtmlSanitization> node in the communityserver.config file in the following format:
<HtmlSanitization>
<SelfContainedHtml>
<Tag name="..." />
</SelfContainedHtml>
<CloseBeforeNextHtml>
<Tag name="...">
<ParentTag name="..." />
</Tag>
<Tag name="..." />
</CloseBeforeNextHtml>
<RemoveContentsWithTags>
<Tag name="..." />
</RemoveContentsWithTags>
</HtmlSanitization>
Self contained HTML
The <SelfContainedHtml> node identifies tags that should always be self-contained (that is, they should never have separate opening and closing tags). Each self-contained tag should be identified via the name attribute on a new <Tag> node in the <SelfContainedHtml> node.
Close before next HTML
The <CloseBeforeNextHtml> node identifies tags that should always be closed before the next occurrence of the same tag. For example, an <li> tag should be closed before the next opening <li> tag, so "li" should be identified in the <CloseBeforeNextHtml> node. Additionally for tags that have required parents, the valid parent tags should be defined so the HTML Sanitizer can ensure the tag is properly nested (that is, within the context of an allowed parent tag) and that the tag can be nested in the context of a new parent tag. (For example, <ul><li><ul><li></li></ul></li></ul> is valid even though an <li> opening tag exists in an open <li> tag, because the inner <li> tag is within a valid parent, <ul>.)
Each tag that needs to be closed before the next occurrence of the same tag should be defined via the name attribute of a new <Tag> nodein the <CloseBeforeNextHtml> node. Additionally, if the tag must exist in a valid parent tag, each parent tag should be defined via the name attribute of a <ParentTag> node within the corresponding <Tag> node.
Remove contents with tags
By default, the contents of tags removed by the HTML Scrubber are not removed. To remove the contents of a tag (that is, all content between the opening and closing tag being removed), the tag should be included in the <RemoveContentsWithTags> node. Each tag should be identified via the name attribute of a new <Tag> node in the <RemoveContentsWithTags> node.
Configuring HTML replacements
To automate common HTML cleanup tasks, the HTML Sanitizer supports processing simple replacement rules. As an example, HTML replacement rules are used by default to adjust MsoNormal and MsoListParagraph style rule usage in markup generated from Outlook/Word and to ensure that all <img /> tags in user-entered content have alt attributes. HTML replacements are defined within the <HtmlReplacements> node of the <HtmlSanitization> node in the communityserver.config file in the following format:
<HtmlSanitization>
<HtmlReplacements>
<HtmlReplacement>
<Match tag="..." />
<Replace tag="..." />
</HtmlReplacement>
<HtmlReplacement>
<Match tag="...">
<Attribute name="..." contains="..." />
</Match>
<Replace tag="...">
<Attribute name="..." replace="..." with="..." />
</Replace>
</HtmlReplacement>
</HtmlReplacements>
</HtmlSanitization>
Each replacement consists of a matching tag and optional matching attributes as well as optional replacement tag and optional attribute value replacements. The simple and full examples are shown above. For example, to adjust <p class="MsoNormal">...</p> to <div>...</div>, the following rule could be defined in the <HtmlReplacements> node:
<HtmlReplacement>
<Match tag="p">
<Attribute name="class" contains="MsoNormal" />
</Match>
<Replace tag="div">
<Attribute name="class" replace="MsoNormal" with="" />
</Replace>
</HtmlReplacement>
This rule matches <p> tags containing "MsoNormal" in their class attribute and adjusts the tag name to "div", removing "MsoNormal" from the class attribute.
Notes on HTML replacements
- The Html Sanitizer automatically removes any empty attributes after rules are processed.
- To add content to an attribute or add an attribute that might not already exist, set the replace attribute on the <Attribute> node to an empty value.
- If multiple <Attribute> nodes in a single <Replace> node affect the same attribute name, only the last <Attribute> node with that name will be processed.
- The replacement tag is optional. If the tag attribute is not defined on the <Replace> node, the tag name will not be adjusted.