4.10. Filter HTML/URIs That May Be Re-presented

One special case where cross-site malicious content must be prevented are web applications which are designed to accept HTML or XHTML from one user, and then send it on to other users (see Section 6.13 for more information on cross-site malicious content). The following subsections discuss filtering this specific kind of input, since handling it is such a common requirement.

4.10.1. Remove or Forbid Some HTML Data

It's safest to remove all possible (X)HTML tags so they cannot affect anything, and this is relatively easy to do. As noted above, you should already be identifying the list of legal characters, and rejecting or removing those characters that aren't in the list. In this filter, simply don't include the following characters in the list of legal characters: ``<'', ``>'', and ``&'' (and if they're used in attributes, the double-quote character ``"''). If browsers only operated according the HTML specifications, the ``>"'' wouldn't need to be removed, but in practice it must be removed. This is because some browsers assume that the author of the page really meant to put in an opening "<" and ``helpfully'' insert one - attackers can exploit this behavior and use the ">" to create an undesired "<".

Usually the character set for transmitting HTML is ISO-8859-1 (even when sending international text), so the filter should also omit most control characters (linefeed and tab are usually okay) and characters with their high-order bit set.

One problem with this approach is that it can really surprise users, especially those entering international text if all international text is quietly removed. If the invalid characters are quietly removed without warning, that data will be irrevocably lost and cannot be reconstructed later. One alternative is forbidding such characters and sending error messages back to users who attempt to use them. This at least warns users, but doesn't give them the functionality they were looking for. Other alternatives are encoding this data or validating this data, which are discussed next.

4.10.2. Encoding HTML Data

An alternative that is nearly as safe is to transform the critical characters so they won't have their usual meaning in HTML. This can be done by translating all "<" into "&lt;", ">" into "&gt;", and "&" into "&amp;". Arbitrary international characters can be encoded in Latin-1 using the format "&#value;" - do not forget the ending semicolon. Encoding the international characters means you must know what the input encoding was, of course.

One possible danger here is that if these encodings are accidentally interpreted twice, they will become a vulnerability. However, this approach at least permits later users to see the "intent" of the input.

4.10.3. Validating HTML Data

Some applications, to work at all, must accept HTML from third parties and send them on to their users. Beware - you are treading dangerous ground at this point; be sure that you really want to do this. Even the idea of accepting HTML from arbitrary places is controversial among some security practitioners, because it is extremely difficult to get it right.

However, if your application must accept HTML, and you believe that it's worth the risk, at least identify a list of ``safe'' HTML commands and only permit those commands.

Here is a minimal set of safe HTML tags that might be useful for applications (such as guestbooks) that support short comments: <p> (paragraph), <b> (bold), <i> (italics), <em> (emphasis), <strong> (strong emphasis), <pre> (preformatted text), <br> (forced line break - note it doesn't require a closing tag), as well as all their ending tags.

Not only do you need to ensure that only a small set of ``safe'' HTML commands are accepted, you also need to ensure that they are properly nested and closed. In XML, this is termed ``well-formed'' data. A few exceptions could be made if you're accepting standard HTML (e.g., supporting an implied </p> where not provided before a <p> would be fine), but trying to accept HTML in its full generality is not needed for most applications. Indeed, if you're trying to stick to XHTML (instead of HTML), then well-formedness is a requirement. Also, HTML tags are case-insensitive; tags can be upper case, lower case, or a mixture. However, if you intend to accept XHTML then you need to require all tags to be in lower case (XML is case-sensitive; XHTML uses XML and requires the tags to be in lower case).

Here are a few random tips about doing this. Usually you should design whatever surrounds the HTML text and the set of permitted tags so that the contributed text cannot be misinterpreted as text from the ``main'' site (to prevent forgeries). Don't accept any attributes unless you've checked the attribute type and its value; there are many attributes that support things such as Javascript that can cause trouble for your users. You'll notice that in the above list I didn't include any attributes at all, which is certainly the safest course. You should probably give a warning message if an unsafe tag is used, but if that's not practical, encoding the critical characters (e.g., "<" becomes "&lt;") prevents data loss while simultaneously keeping the users safe.

4.10.4. Validating Hypertext Links (URIs/URLs)

Careful readers will notice that I did not include the hypertext link tag <a> as a safe tag in HTML. Clearly, you could add <a href="safe URI"> (hypertext link) to the safe list (not permitting any other attributes unless you've checked their contents). If your application requires it, then do so. However, permitting third parties to create links is much less safe, because defining a ``safe URI''[1] turns out to be very difficult. Many browsers accept all sorts of URIs which may be dangerous to the user. This section discusses how to validate URIs from third parties for re-presenting to others, including URIs incorporated into HTML.

First, let's look briefly at URI syntax (as defined by various specifications). URIs can be either ``absolute'' or ``relative''. The syntax of an absolute URI looks like this:
scheme://authority[path][?query][#fragment]
A URI starts with a scheme name (such as ``http''), the characters ``://'', the authority (such as ``www.dwheeler.com''), a path (which looks like a directory or file name), a question mark followed by a query, and a hash (``#'') followed by a fragment identifier. The square brackets surround optional portions - e.g., many URIs don't actually include the query or fragment. Some schemes may not permit some of the data (e.g., paths, queries, or fragments), and many schemes have additional requirements unique to them. Many schemes permit the ``authority'' field to identify optional usernames, passwords, and ports, using this syntax for the ``authority'' section:
 [username[:password]@]host[:portnumber]
The ``host'' can either be a name (``www.dwheeler.com'') or an IPv4 numeric address (127.0.0.1). A ``relative'' URI references one object relative to the ``current'' one, and its syntax looks a lot like a filename:
path[?query][#fragment]
There are a limited number of characters permitted in most of the URI, so to get around this problem, other 8-bit characters may be ``URL encoded'' as %hh (where hh is the hexadecimal value of the 8-bit character). For more detailed information on valid URIs, see IETF RFC 2396 and its related specifications.

Now that we've looked at the syntax of URIs, let's examine the risks of each part:

Of course, there is a trade-off with simplicity as well. Simple patterns are easier to understand, but they aren't very refined (so they tend to be too permissive or too restrictive, even more than a refined pattern). Complex patterns can be more exact, but they are more likely to have errors, require more performance to use, and can be hard to implement in some circumstances.

Here's my suggestion for a ``simple mostly safe'' URI pattern which is very simple and can be implemented ``by hand'' or through a regular expression; permit the following pattern:
(http|ftp|https)://[-A-Za-z0-9._/]+

This pattern doesn't permit many potentially dangerous capabilities such as queries, fragments, ports, or relative URIs, and it only permits a few schemes. It prevents the use of the ``%'' character, which is used in URL escapes and can be used to specify characters that the server may not be prepared to handle. Since it doesn't permit either ``:'' or URL escapes, it doesn't permit specifying port numbers, and even using it to redirect to a more dangerous URI would be difficult (due to the lack of the escape character). It also prevents the use of a number of other characters; again, many poorly-designed web applications can't handle a number of ``unexpected'' characters.

Even this ``mostly safe'' URI permits a number of questionable URIs, such as subdirectories (via ``/'') and attempts to move up directories (via `..''); illegal queries of this kind should be caught by the server. It permits some illegal host identifiers (e.g., ``20.20''), though I know of no case where this would be a security weakness. Some web applications treat subdirectories as query data (or worse, as command data); this is hard to prevent in general since finding ``all poorly designed web applications'' is hopeless. You could prevent the use of all paths, but this would make it impossible to reference most Internet information. The pattern also allows references to local server information (through patterns such as "http:///", "http://localhost/", and "http://127.0.0.1") and access to servers on an internal network; here you'll have to depend on the servers correctly interpreting the resulting HTTP GET request as solely a request for information and not a request for an action, as recommended in Section 4.11. Since query forms aren't permitted by this pattern, in many environments this should be sufficient.

Unfortunately, the ``mostly safe'' pattern also prevents a number of quite legitimate and useful URIs. For example, many web sites use the ``?'' character to identify specific documents (e.g., articles on a news site). The ``#'' character is useful for specifying specific sections of a document, and permitting relative URIs can be handy in a discussion. Various permitted characters and URL escapes aren't included in the ``mostly safe'' pattern. For example, without permitting URL escapes, it's difficult to access many non-English pages. If you truly need such functionality, then you can use less safe patterns, realizing that you're exposing your users to higher risk while giving your users greater functionality.

One pattern that permits queries, but at least limits the protocols and ports used is the following, which I'll call the ``simple somewhat safe pattern'':
 (http|ftp|https)://[-A-Za-z0-9._]+(\/([A-Za-z0-9\-\_\.\!\~\*\'\(\)\%\?]+))*/?
This pattern actually isn't very smart, since it permits illegal escapes, mutiple queries, queries in ftp, and so on. It does have the advantage of being relatively simple.

Creating a ``somewhat safe'' pattern that really limits URIs to legal values is quite difficult. Here's my current attempt to do so, which I call the ``sophisticated somewhat safe pattern'', expressed in a form where whitespace is ignored and comments are introduced with "#":
 (
 (
  # Handle http, https, and relative URIs:
  ((https?://([A-Za-z0-9][A-Za-z0-9\-]*(\.[A-Za-z0-9][A-Za-z0-9\-]*)*\.?))|
    ([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)?
  ((/([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)*/?) # path
   (\?(                                                              # query:
       (([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+=
        ([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+
        (\&([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+=
         ([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+)*)
       |
       (([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+  # isindex
       )
   ))?
   (\#([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+)? # fragment
  )|
 # Handle ftp:
 (ftp://([A-Za-z0-9][A-Za-z0-9\-]*(\.[A-Za-z0-9][A-Za-z0-9\-]*)*\.?)
  ((/([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)*/?) # path
  (\#([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+)? # fragment
  )
 )

Even the sophisticated pattern shown above doesn't forbid all illegal URIs. For example, again, "20.20" isn't a legal domain name, but it's allowed by the pattern; however, to my knowledge this shouldn't cause any security problems. The sophisticated pattern forbids URL escapes that represent control characters (e.g., %00 through $1F) - the smallest permitted escape value is %20 (ASCII space). Forbidding control characters prevents some trouble, but it's also limiting; change "2-9" to "0-9" everywhere if you need to support sending all control characters to arbitrary web applications. This pattern does permit all other URL escape values in paths, which is useful for international characters but could cause trouble for a few systems which can't handle it. The pattern at least prevents spaces, linefeeds, double-quotes, and other dangerous characters from being in the URI, which prevents other kinds of attacks when incorporating the URI into a generated document. Note that the pattern permits ``+'' in many places, since in practice the plus is often used to replace the space character in queries and fragments.

Unfortunately, as noted above, there are attacks which can work through any technique that permit query data, and there don't seem to be really good defenses for them once you permit queries. So, you could strip out the ability to use query data from the pattern above, but permit the other forms, producing a ``sophisticated mostly safe'' pattern:
 (
 (
  # Handle http, https, and relative URIs:
  ((https?://([A-Za-z0-9][A-Za-z0-9\-]*(\.[A-Za-z0-9][A-Za-z0-9\-]*)*\.?))|
    ([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)?
  ((/([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)*/?) # path
   (\#([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+)? # fragment
  )|
 # Handle ftp:
 (ftp://([A-Za-z0-9][A-Za-z0-9\-]*(\.[A-Za-z0-9][A-Za-z0-9\-]*)*\.?)
  ((/([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)*/?) # path
  (\#([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+)? # fragment
  )
 )

As far as I can tell, as long as these patterns are only used to check hypertext anchors selected by the user (the "<a>" tag) this approach also prevents the insertion of ``web bugs''. Web bugs are simply text that allow someone other than the originating web server of the main page to track information such as who read the content and when they read it - see Section 7.6 for more information. This isn't true if you use the <img> (image) tag with the same checking rules - the image tag is loaded immediately, permitting someone to add a ``web bug''. Once again, this presumes that you're not permitting any attributes; many attributes can be quite dangerous and pierce the security you're trying to provide.

Please note that all of these patterns require the entire URI match the pattern. An unfortunate fact of these patterns is that they limit the allowable patterns in a way that forbids many useful ones (e.g., they prevent the use of new URI schemes). Also, none of them can prevent the very real problem that some web sites perform more than queries when presented with a query - and some of these web sites are internal to an organization. As a result, no URI can really be safe until there are no web sites that accept GET queries as an action (see Section 4.11). For more information about legal URLs/URIs, see IETF RFC 2396; domain name syntax is further discussed in IETF RFC 1034.

4.10.5. Other HTML tags

You might even consider supporting more HTML tags. Obvious next choices are the list-oriented tags, such as <ol> (ordered list), <ul> (unordered list), and <li> (list item). However, after a certain point you're really permitting full publishing (in which case you need to trust the provider or perform more serious checking than will be described here). Even more importantly, every new functionality you add creates an opportunity for error (and exploit).

One example would be permiting the <img> (image) tag with the same URI pattern. It turns out this is substantially less safe, because this permits third parties to insert ``web bugs'' into the document, identifying who read the document and when. See Section 7.6 for more information on web bugs.

4.10.6. Related Issues

Web applications should also explicitly specify the character set (usually ISO-8859-1), and not permit other characters, if data from untrusted users is being used. See Section 8.5 for more information.

Since filtering this kind of input is easy to get wrong, other alternatives have been discussed as well. One option is to ask users to use a different language, much simpler than HTML, that you've designed - and you give that language very limited functionality. Another approach is parsing the HTML into some internal ``safe'' format, and then translating that safe format back to HTML.

Filtering can be done during input, output, or both. The CERT recommends filtering data during the output process, just before it is rendered as part of the dynamic page. This is because, if it is done correctly, this approach ensures that all dynamic content is filtered. The CERT believes that filtering on the input side is less effective because dynamic content can be entered into a web sites database(s) via methods other than HTTP, and in this case, the web server may never see the data as part of the input process. Unless the filtering is implemented in all places where dynamic data is entered, the data elements may still be remain tainted.

However, I don't agree with CERT on this point for all cases. The problem is that it's just as easy to forget to filter all the output as the input, and allowing ``tainted'' input into your system is a disaster waiting to happen anyway. A secure program has to filter its inputs anyway, so it's sometimes better to include all of these checks as part of the input filtering (so that maintainers can see what the rules really are). And finally, in some secure programs there are many different program locations that may output a value, but only a very few ways and locations where a data can be input into it; in such cases filtering on input may be a better idea.

Notes

[1]

Technically, a hypertext link can be any ``uniform resource identifier'' (URI). The term "Uniform Resource Locator" (URL) refers to the subset of URIs that identify resources via a representation of their primary access mechanism (e.g., their network "location"), rather than identifying the resource by name or by some other attribute(s) of that resource. Many people use the term ``URL'' as synonymous with ``URI'', since URLs are the most common kind of URI. For example, the encoding used in URIs is actually called ``URL encoding''.