Implementing IXmlWriter Part 6: Escaping Attribute Content

This is part 6 of my Implementing IXmlWriter post series.

Last time’s IXmlWriter has a serious bug: it doesn’t properly handle attribute value escaping and can lead to malformed XML.

Consider the following test case:

StringXmlWriter xmlWriter;
xmlWriter.WriteStartElement("root");
  xmlWriter.WriteStartElement("element");
    xmlWriter.WriteAttributeString("att", "\"");
  xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();

std::string strXML = xmlWriter.GetXmlString();

The previous version of IXmlWriter will generate the XML string <root><element att="""/></root>, which is invalid and will be rejected by a XML parser. The rules for XML attribute escaping are given by Section 2.3 of the XML 1.0 spec—specifically, the AttValue literal:

AttValue ::= '"' ([^<&"] | Reference)* '"'
          |  "'" ([^<&'] | Reference)* "'"

This Backus-Naur form-like construct says that attribute values can be enclosed in either single or double quotes, and that the characters <, &, and the respective quotation character cannot appear between these quotes. However, with the exception of < (see Well-formedness constraint: No < in Attribute Values—thanks dbt), we can insert escaped versions of these characters. As we always encase attribute values in double quotes, we only need to worry about escaping the " character and not the ' character. Let’s construct a test case:

StringXmlWriter xmlWriter;
xmlWriter.WriteStartElement("root");
  xmlWriter.WriteStartElement("element");
    xmlWriter.WriteAttributeString("att", "\"&");
  xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();

std::string strXML = xmlWriter.GetXmlString();
// strXML should be <root><element att="&quot;&amp;"/></root>

Note that we are now required to perform escaping (albeit with different characters) in two separate functions: WriteString() and WriteAttributeString(). This is a prime candidate for refactoring—we can separate the escaping code into its own function, and we can make such large changes with confidence because we have a test suite to verify that changed code is correct. Here’s the new code:

typedef std::map<char, std::string> translations_t;

std::string TranslateString
    (
    const std::string& value,
    const translations_t& translations
    )
{
    std::string str;
    for (std::string::const_iterator stringIter = value.begin();
         stringIter != value.end();
         ++stringIter) {
        translations_t::const_iterator mapIter = translations.find(*stringIter);
        if (mapIter != translations.end()) {
            str += mapIter->second;
        } else {
            str += *stringIter;
        }
    }

    return str;
}

class StringXmlWriter
{
private:
    std::stack<std::string> m_openedElements;
    std::string m_xmlStr;
    bool m_unclosedStartElement;
    // Translations used in character data
    translations_t m_charDataTranslations;
    // Translations used in attribute values
    translations_t m_attributeTranslations;

public:
    StringXmlWriter() : m_unclosedStartElement(false)
    {
        m_charDataTranslations['&'] = "&amp;";
        m_charDataTranslations['<'] = "&lt;";
        m_charDataTranslations['>'] = "&gt;";
        m_attributeTranslations['&'] = "&amp;";
        m_attributeTranslations['"'] = "&quot;";
    }

    void WriteStartElement(const std::string& localName)
    {
        if (m_unclosedStartElement) {
            m_xmlStr += '>';
            m_unclosedStartElement = false;
        }

        m_openedElements.push(localName);
        m_xmlStr += '<';
        m_xmlStr += localName;
        m_unclosedStartElement = true;
    }

    void WriteEndElement()
    {
        if (m_unclosedStartElement) {
            m_xmlStr += "/>";
            m_unclosedStartElement = false;
        } else {
            std::string lastOpenedElement = m_openedElements.top();
            m_xmlStr += "</";
            m_xmlStr += lastOpenedElement;
            m_xmlStr += '>';
        }
        m_openedElements.pop();
    }

    void WriteString(const std::string& value)
    {
        if (m_unclosedStartElement) {
            m_xmlStr += '>';
            m_unclosedStartElement = false;
        }

        m_xmlStr += TranslateString(value, m_charDataTranslations);
    }

    void WriteElementString(const std::string& localName,
                            const std::string& value)
    {
        WriteStartElement(localName);
        WriteString(value);
        WriteEndElement();
    }

    void WriteAttributeString(const std::string& localName,
                              const std::string& value)
    {
        m_xmlStr += ' ';
        m_xmlStr += localName;
        m_xmlStr += "=\"";
        m_xmlStr += TranslateString(value, m_attributeTranslations);
        m_xmlStr += ""';
    }

    std::string GetXmlString() const
    {
        return m_xmlStr;
    }
};

Because we cannot insert a < character into an attribute value, escaped or otherwise, we should explicitly forbid this value in the function WriteAttributeString(). I will be sure to address this when I get to error handling in a future post. However, be sure to be aware of this constraint when you design your XML schemas!