The Problem with Markup Languages

Chris Shiflett has a post today, Allowing HTML and Preventing XSS. The problem is how to allow users to format their contributed content without introducing security vulnerabilities. The answer is usually some sort of markup language or filtering and sanitization of HTML.

BBCODE was designed for this purpose. There is no actual standard, but the core syntax seems fairly uniform. It’s good for those used to forums, where it seems to norm.

HTML markup is nice because it is a standard, even if varying subsets are supported. Learning a little HTML isn’t going to hurt anyone, at least for the next 20 years or so. The problem is that HTML was never intended to be hand edited. The syntax is not the most inviting, and different HTML-like markup languages handle whitespace differently than the HTML standard.

Wiki markup syntaxes were designed to be human friendly. The main problem I have with wiki syntax is that there is no standard. It seems like every wiki has a different way to formulate a link, for example. I guess there is some progress with Wiki Creole, but I still have a bad taste in my mouth.

The other problem I have with wiki markup is that I find it to be non-deterministic. When I edit any given wiki and try to use more than basic formatting, I never know what I am going to get. Most of the markup processing engines for these wikis are impenetrable morasses of regular expressions. It can be hard to gauge interactions. Are you really sure they are secure?

Speaking of impenetrable morasses of regular expressions, have you ever looked at WordPress’s input path? I’m sure every one with a WordPress blog who likes to blog about PHP code knows that it is a code eater. I’ve been particularly disappointed with WordPress in this area. Most the “code formatting” plugins still have problems protecting code from WordPress’ heavy hand.

But the WordPress preg_replace gauntlet doesn’t just mangle code. I have a post which has been sitting in draft mode for several weeks because I can’t figure out how to give it the proper markup. WordPress is somehow taking my perfectly balanced input markup and producing “unbalanced” output markup. I haven’t yet tracked down the problem to either submit a fix or to do a good bug report. Frankly, I’m not looking forward to trudging through all those regular expressions.

In Chris’ post, he takes the regular expression approach. Folks in the comments have pointed out a few problems with his approach, including the problem of interleaved tags. If you can’t tell by now, I am not a fan of the regular expression gauntlet approach to markup languages. I prefer a defined syntax and a traditional computer science style parser (which may use regular expressions).

The other must-have is a preview option. With so much variation in markup languages, not having a preview leaves the user to play Russian roulette with their submitted content. I’ve talked about that before in the usability of input filtering. This is another area where WordPress leaves the user high and dry.

The complex input path in WordPress combined with its reliance on global variables seems to leave it unable to do an in-page preview. The admin area preview is an IFRAME so that it launches a separate request. The various live preview plugins are JavaScript based and don’t work when it is disabled. They also don’t pass the input through the same input path that WordPress uses, so they are not a true preview.

I don’t mean for this to be a WordPress rant, on the whole, I like WordPress. Rather, I just wanted to point out how hard it can be to do good input filtering, that is safe, reliable, deterministic, and usable.

Speak Your Mind