BeautifulSoupで簡単に危険なタグをエスケープ

閉じてないタグを閉じたり、タグ名を小文字にしたりというような副作用もあるわけですが。

from BeautifulSoup import BeautifulSoup
import cgi

dangerous_tags = [
    "script", "applet", "object", "embed", "img", 
    "form", "input", "select", "textarea", "button", 
]

def escape(xmlstr):
    """ escape dangerous tags.
    >>> escape(u'<div>snip <script>alert("<b>BAD</b>")</script> snip</div>')
    u'<div>snip &lt;script&gt;alert("&lt;b&gt;BAD&lt;/b&gt;")&lt;/script&gt; snip</div>'
    """
    xml = BeautifulSoup(xmlstr)
    for node in xml.findAll(dangerous_tags):
        node.replaceWith(cgi.escape(unicode(node)))
        pass
    return unicode(xml)

pythonモジュールBeautifulSoupについて詳しくは、Beautiful Soup documentation