A Profile of XML

An Exploration of the eXtensible Markup Language

The Mexican poet Octavio Paz once commented: 'the differences between the spoken or written language and the other ones — plastic or musical — are very profound, but not to such an extent that they make us forget that essentially they are all language: expressive systems which possess significant power.'

Nowhere more than in computing do expressive systems abound, our most recent bouncing baby being the eXtensible Markup Language or XML. In fact Junior is now a strapping toddler of two years of age and given his explosive growth rate one wonders just how big he will be come adulthood. However, the computing industry is also guilty of putting much old wine in many new bottles. There are, therefore, other crucial questions, for example what kind of child is XML and what does it actually want to be when it does grow up? This article examines XML, its origins, syntax and its relationship to software systems.

Thumbnail image of the front cover of the fourth issue of Objective View This article appeared originally in Issue 4 (Feb 2000) of Ratio Group's journal Objective View. Ratio was a software training company, based in Ealing in London in the UK, that folded many years ago and which is unrelated to any current enterprise called 'Ratio'.

Some Background

Throughout this decade the growth of the Internet and the success of the Web has seen an increasing demand for a more powerful version of HTML that can serve as a universal data-interchange standard. For a variety of reasons it was not possible to extend HTML itself, not least because it is a presentation-oriented schema. The search therefore began for an alternative.

Standard Generalised Markup Language (SGML) was originally considered but was rejected because of various inherent problems such as the challenge of developing suitable parser technologies. It was therefore apparent that a new language was needed and thus XML (a subset of SGML) was born. In fact, the specification was developed very quickly over a period of only eighteen months. This was mostly because of the high demand for a universal format, however it was also in an attempt to prevent the standard from becoming clogged with lots of extra 'goodies', most of which appear on only a few people's wish lists.

Note however that although XML can be used for transmitting information across a network (and has a very big future in terms of the Internet) it is not, by definition, an Internet issue. In fact, it can just as easily be used in a non-networked environment. When an application receives some data this can come from a secondary-storage medium such as disk or tape as easily as from a network link. Developing this principle, the information does not even have to come from 'outside' the machine at all. Applications can in fact use XML to communicate between themselves at run-time using operating-system services such as pipes and shared-memory arenas.

Given that XML is not fundamentally an Internet issue it therefore not a 'web language'. It is possible to present XML data in a browser by means of stylesheets, however this issue is peripheral to the core standard. XML is therefore not a direct replacement for, or extension of HTML, although the future of HTML is now a moot point.

What's the Deal?

The pivotal issue is that XML is a universal file-format and therein lies the source of the commotion. The principle of divide and conquer has proved time and again to facilitate systems development because it allows us to deal with problems in manageable chunks. In fact, the softer the links between its components the more robust a system will be as a whole. Indeed, the 'decoupling' theme is the prime mover in the current trend towards component-based development.

XML therefore enables us to soften the links between system components, thereby allowing us to protect and capitalise upon our development investments. XML can do this in two ways. Spatially speaking, an application can communicate with another physically separate application without advance knowledge of that application (hence the fuss about XML and the Net). Temporally speaking, an application can write data without prior knowledge of the future applications that may read it.

XML and the Application

This decoupling is accomplished by sending descriptions of the information being communicated along with the information itself. In other words: the data stream also contains metadata or 'mark-up', and the price paid for this approach is that applications must incorporate some form of parser to separate mark-up from data.

The simplest form of parser is the event-driven variety, whereby an appropriate procedure is called for each type of mark-up construct encountered, (effected in C/C++ by means of pointers to functions). The second class of parser is the tree-based form where an internal data-structure known as a 'document tree' (essentially an object hierarchy) is generated during analysis. Once the tree is built, the application is free to navigate the structure, reading and updating it as it chooses. If need be, the tree can then be written out again, as XML, to a file or network link.

Note that a number of third-party parser technologies are available. Examples are Microsoft's COM based component, Vivid Creation's library and James Clark's Expat.

Let us now explore some XML markup and see how the syntax and grammar operate.

Some Syntax

A complete XML script is called a document, and although this is treated as a single logical-object, a document can be composed physically of many separate files or 'entities'. An XML document must therefore comprise at least one top-level file called the 'document entity', and this must contain a single top-level 'element'. The general form for elements is shown on the right.


 <StartTag>

    Data

 </EndTag>
      

Here <Tag> tells the application that some marked-up data follows and the </Tag> indicates the end of that data. Note that <Tag> will normally be some useful label for the enclosed data. This syntax operates such that elements can be nested within each other, as the next listing shows.


 <Tag>

    <NestedTag>
    Data
    </NestedTag>

    <NestedTag>
    Data
    </NestedTag>

 </Tag>
      

This allows complex composite-data to be represented. The second aspect to element syntax is that the start tag can carry 'attributes'. For example:

   <Tag Attribute1 = "Data", Attribute2 = "Data">

In essence, this is no different to the first general example we saw, in that an element is stated as containing various data, and this is one of the more confusing aspects of XML. In a grammar such as C, block nesting is the only containment model available and this makes for a considerably simpler syntax.

In addition, 'empty' element tags are possible which consist of a start tag but no corresponding end tag. These take the following general form:

   <Tag/>

Note the trailing slash indicating that this tag stands alone. Empty-element tags can possess attributes, and it is only in this way, other than by 'flagging' something in a document, that they are useful because there is no other way for them to 'carry' data.

Let's now see an example of the above general forms in action. The piece of XML on the right describes a Compact Disc. The attributes in the most enclosing start-tag state that it is a music CD and detail the name of the album and the band that recorded it. Our hypothetical CD also contains three tracks and the empty-element tags representing these possess attributes describing the title, running time and author.


 <CompactDisc Type        = "Music"
              Title       = "Shifting Images"
              Band        = "Stampede">

    <Track    Title       = "Huckleberry Finn"
              RunningTime = "4.33"
              Copyright   = "G Shelter"/>

    <Track    Title       = "Lemon Luminance"
              RunningTime = "4.19"
              Copyright   = "A Livingboy"/>

    <Track    Title       = "The Seeing Room"
              RunningTime = "10.56"
              Copyright   = "N Kiwit"/>

 </CompactDisc>
      

Constraining Markup

The above XML script is an example of what is called a 'well formed' document. This means that it obeys all the rules such that start and end tags are properly nested etc. However, there are no constraints placed on this document. The CompactDisc element could easily contain an element describing the price of bread, which would of course be inappropriate.

To assert the required structure and content of marked-up content one must include a Document Type Declaration or DTD. During processing the parser checks the document's element content and structure against the constraints stated in the DTD. If the elements conform to the DTD's type declarations then the document is said to be valid.

In addition to a DTD, documents can contain an 'XML Declaration' in the document entity, and a 'Text Declaration' in external entities. These can be used to state what version of XML is being used (although only version 1.0 currently exists), as well as the kind of character-encoding scheme used etc.

Note that DTDs provide the substrate for the XML vertical-market data formats that are becoming increasingly available. Here an organization can define and publish a data interchange standard for a given domain, thus making a lingua franca available to all interested parties. Examples include ChemML for defining documents that describe chemical and molecular information, DocBook for paper publications and MathML for mathematical equations and data. There are also a slew of business-oriented data formats such as FinXML, OFX, Biztalk and cXML.

Further Markup

XML goes much further than the above example however. It is possible to embed comments within entities (files) just as one would in a traditional programming language. There are some differences however, in that comment text can be made available to the application that is parsing the document, plus there are certain restrictions on the placing of comments within an entity.

It is also possible to create predefined entities within the DTD, which can be referenced in marked-up content rather than spelt out explicitly. This facility is very similar to the #define preprocessor-directive in C/C++. Note that these predefined character sequences can be contained within entities that are separate to the main document-entity, and in this case the 'entity reference' mechanism operates very much like the #include preprocessor-directive in C/C++. Further to this, it is possible for the DTD to be held in part or whole in a separate file, just as one would keep C/C++ type-declarations and prototypes in separate headers.

XML also supports the concept of conditional sections, which operate in much the same way as 'commenting out' a section of code in a program source file. Note however that this and the entity-reference mechanism explained above do not enjoy the flexibilities we are used to in C/C++.

Of course there are times when we wish to use mark-up characters such as <, > and & in their literal form rather than as signals to the parser. To cater for this XML allows these characters to be escaped by means of a 'character reference' mechanism. In addition, XML defines the concept of non-parsed character data or CDATA sections. These are areas of content that the parser ignores completely, thereby allowing markup characters to be used literally without recourse to character references. The purpose of CDATA sections is to enable sequences of raw binary data to be carried within a document thereby enabling an application to serialise any kind of data such as images or sound.

Implications

There are however a number of implications in using a markup solution to data interchange. Firstly, because XML documents are human readable they can also be created by hand using a simple text editor. This is in marked contrast to the use of proprietary file formats, which often yield machine-readable data only.

However, although proprietary formats are generally not human readable, they can be considerably faster to read and write, and can therefore mean faster applications. This is because the 'meaning' in the data is implicit in its sequential byte ordering, as opposed to being stated explicitly by bulky metadata. Similarly the 'understanding' of the data is integrated into the application code itself, obviating the need for a separate parsing stage between the information and the application's internal data structures.

In addition, marked-up data means larger files and longer data streams. These therefore take up more disk space and take longer to transmit across a network connection. In addition, an XML application will often be larger and slower because of the parsing technologies involved. Indeed, this is the classic time and space tradeoff we must accept every time we choose to decouple systems components still further. I.e. performance gets slower and software gets bigger.

Finally, there are the human-resource costs to consider in a move to XML. Programmers need to understand the syntax to be able to work with DTDs, and also need experience in working with parser interfaces in order to produce shippable code.

As pointed out earlier however, the true significance of XML is that disparate applications can talk to each other. Using XML an application can write and read data to and from other unknown applications. Moreover, should one really need speed of transmission, it is quite possible to mix the proprietary and universal file-format approaches by wrapping binary data in CDATA sections.

Further to this, XML text can be compressed, which ameliorates the bandwidth consideration significantly. Note that various schemes have been suggested for making XML more terse, usually at the risk of errors, however none of these have been much more effective than using compression techniques. In essence, XML's design favours readability and robustness over space considerations.

Conclusion

In many ways the advent of XML can therefore be seen as part of a trend towards greater unification, which has a parallel in the rise of UML, and this can only be for the general good.

However, as a formal grammar or 'expressive system', XML simply does not measure up. It is fraught with restrictions and duplications, and generally lacks the flexibility that we enjoy in languages such as C++. For example, there are two entity-reference mechanisms — DTDs use 'parameter entities' whereas content markup uses 'general entities'. There are also restrictions in entity reference syntax regarding external entities. In addition, one can use either attributes or element nesting to the same effect, and this causes considerable confusion in XML neophytes.

Furthermore, comments can be used to signal sections of a document that must be ignored. Yet CDATA sections do the very same thing, while conditional sections can also be employed to the same ends but only in external entities. Why not have a single mechanism for 'blindfolding' the parser? This would make markup semantics easier to understand and parsers would thus be easier to write, more reliable and faster in operation. To signal the difference between a comment and a run of binary data one could use some form of 'comment header'.

Currently, a number of peripheral standards are under development such as XLL (XML Linking Language) [these days called XLink], XPath, XPointer, XSL (XML Stylesheet Language) and XQL (XML Query Language). Amongst these are XML Schemas, which are a proposed alternative to Document Type Declarations. One reason that these are being promoted is because DTD syntax is inextensible. That is to say that additional markup properties cannot be expressed beyond what the lexicon allows. But wait a minute, are we not talking about the eXtensible Markup Language...?

To be a complete heretic, one can argue that a close cousin of C, not a subset of SGML, should have been developed. Learning XML would then parallel the transition from C++ to Java, where the syntax is so similar that an experienced C++ programmer can be up to speed with Java in a matter of days. Instead of having to digest a set of new (and somewhat arcane) semantics, programmers would be able to read and write XML syntax within hours. Moreover, the grammar in which applications were coded (whether C++ or Java) would be near identical to the grammar written and read by those applications. Now that would be real unification [See the endnotes for more on this].

Muddying the waters still further, an initiative has recently been launched to develop a cut down version of XML called Simple Markup Language or SML. Proponents point out that most developers are using only a subset of the full specification, and to standardise on that would therefore result in a language that was easier to learn and implement. The proposed SML specification would be the same as XML but with no attributes, processing instructions, DTDs, CDATA sections, and so on.

In conclusion, it would seem that Junior is a Curates Egg and very much in a state of flux. Moreover, we have been here before with Java and have seen discrepancies in virtual-machine implementations and proprietary extensions to the specification. XML is already showing similar blemishes with, for example, discrepancies between parsers. An overarching solution can only work if the standard is adhered to by all and to the letter.

The reality however is that given its potential and the industry's response, individuals and organizations alike cannot afford to ignore XML. Indeed, it would appear that, despite the confused picture, companies are already investigating and implementing the XML approach because they fear being left behind. It may have its shortcomings, however given the backing of the heavyweights and its potential for the Internet, XML (in whatever form it grows into) is not going to go away.

Endnote

Curiously, it now seems that my putatively heretical comment on how a subset of C syntax should have been developed, instead of XML's tag syntax, was not nearly so heretical after all. JSON, or JavaScript Object Notation, has now arisen, which uses just such a syntax, and is therefore very lightweight semantically, as well as being considerably less verbose (although it has no intrinsic relationship with JavaScript). This brings all the advantages of being quicker to transfer between systems, and quicker to parse, with easier storage demands; all of which are issues with large data-sets.

Given that it is a syntactic subset of the C 'family' of languages, JSON finds favour with developers who have that kind of background (like this one), just as suggested above (although others find it a little terse and therefore harder). A related point is that Cascading Style Sheets also use a C-like syntax (although why they didn't permit double-slash comments is beyond me), and this also makes CSS very easy to learn, and light on resources to boot. This lends further weight to the argument that XML's tag-based syntax should have been avoided all along.

Given these points, time will tell as to whether or not XML is only a transitional step to the rationalisation and unification of our data-interchange standards.

Richard Vaughan October 2005

Copyright © Richard Vaughan 2000 — 2005