Introduction to XML - Overview

Original Author(s): Markus Gylling

What is markup?

Some quotes from the web:

Markup refers to the sequence of characters or other symbols that you insert at certain places in a text or word processing file to indicate how the file should look when it is printed or displayed [...]
Markup encodes a description of [a] document's storage layout and logical structure.

Taxonomies of markup

Coombs, Renear, and DeRose: Markup systems and the future of scholarly text processing. CACM, November 1987

[toc hidden:1]

What is markup?

Some quotes from the web:

Markup refers to the sequence of characters or other symbols that you insert at certain places in a text or word processing file to indicate how the file should look when it is printed or displayed [...]
Markup encodes a description of [a] document's storage layout and logical structure.

Taxonomies of markup

Coombs, Renear, and DeRose: Markup systems and the future of scholarly text processing. CACM, November 1987

Presentational Markup

Examples are MS Word and other word processing systems.

Embedding of codes in the text expressing font, size, color etc.

Is inflexible and offers poor longevity and reusability, but can be used to produce nice-looking pages.

Always used with an authoring system that hides the markup from the user. WYSIWYG is a false claim!

Procedural Markup

Examples are unix tools like Troff, TeX and PostScript.

Primitives remain presentational, but they are embedded in a procedural framework, allowing macros and subroutines, and the notion of the current graphics state can be made concrete.

Procedural markup can be authored directly by humans; much Physics and Math research is published in hand-constructed TeX or LaTeX.

Descriptive Markup

Examples: XML (and its predecessor SGML).

The idea is that the markup doesn't tell you what to do with a piece of text, it tells you what it is, describes it. Another term could be "labeling."

All XML does is provide a nice flexible internationalized way to label the elements of a data structure and ship them around with the labels attached.

Descriptive markup was born in the world of publishing technology, and has many advantages for serious large-scale publishing

Antecedent: SGML

SGML , the Standard Generalized Markup Language, deals with the structural (descriptive) markup of electronic documents.

It was made an international standard by ISO in October 1986.

SGML soon became very popular thanks in particular to acceptance in the editing world, by large multi-national companies, governmental organizations, and, more recently, by the emergence of HTML, HyperText Markup Language, the source language of structured documents on the World Wide Web.

W3C: fixing the web

  • A need for meaningful (descriptive) markup.
  • HTML was originally intended to provide a simple way to markup any type of document to reflect its structure (title, major headings, minor headings, lists, and so on) as well as some stylistic aspects (bold, italics, and so forth). Adding to this the hypertext linking capability HTML offered, as well as browser support for a long list of MIME types, it isn't hard to understand the phenomenal rate at which the Web developed.

    However, businesses and scientists also have the need to exchange data. A new language is needed to express the hierarchical relationship of data values, such as that which is represented by database records and object hierarchies. HTML reflects structure and presentation, but conveys nothing about the meaning of the marked up document.

  • HTML standards changed too slowly
  • Search engines return far too many (unrelated) hits
  • The problem is that search engines typically can only index frequency of words, document titles, and, in some cases, meta tags that describe the contents of a page. What is needed is a way to markup the significant portions of a document and to convey the semantics of documents so search engines can ignore all of the noise and focus instead on the signal. -->
  • A need to specify collections of related pages
  • There has to be a better way to express the interrelationship of a set of pages so they can be processed as a group. A need to be able to attach metadata to Web pages to express interrelationships. -->
  • One-way linking is somewhat limited
  • Although the Web's current one-way hypertext link capability has proven extremely useful, did you know far more flexible schemes have existed for many years in the publishing industry? Since 1992, Hypermedia/Time-based Structuring Language (HyTime) and the Text Encoding Initiative (TEI) have enabled publishers to express complex link relationships, such as links with multiple targets, multi-directional links, and automatically updated link databases. We need a richer linking language for the Web. -->

W3C: XML design principles

Draft DD-1996-0001 - Design Principles for XML

  1. XML shall be straightforwardly usable over the Internet.
  2. XML shall support a wide variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs which process XML documents.
  5. The number of optional features in XML is to be kept to the absolute minimum
  6. XML documents should be human-legible and reasonably clear.
  7. The XML design should be prepared quickly.
  8. The design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness is of minimal importance.

Why is XML such an important development?

XML allows the flexible development of user-defined document types.

It provides a...

  • robust
  • flexible
  • non-proprietary
  • persistent
  • verifiable
  • cross-plattform
  • cross-language

...file format for the storage and transmission of text and data both on and off the Web; and it removes the more complex options of SGML, making it easier to program for.

XML removes two constraints which were holding back Web and Electronic Information developments:

  • dependence on a single, inflexible document type (HTML) which was being much abused for tasks it was never designed for;
  • the complexity of full SGML, whose syntax allows many powerful but hard-to-program options.

Who is responsible?

XML is a project of the World Wide Web Consortium (W3C), and the development of the specification is being supervised by their XML Working Group. A Special Interest Group of co-opted contributors and experts from various fields contributed comments and reviews by email.

XML is a public format: it is not a proprietary development of any company. The v1.0 specification was accepted by the W3C as Recommendation on Feb 10, 1998.

Why do we need it? Why not just use Word or Notes? Or HTML?

Some typical replies off the web:

  • Information on a network which connects many different types of devices has to be usable on all of them.
  • Public information cannot afford to be restricted to one make or model or manufacturer, or to cede control of its data format to private hands.
  • It is also helpful for such information to be in a form that can be reused in many different ways, as this can minimize wasted time and effort.
  • Proprietary data formats, no matter how well documented or publicized, are simply not an option: their control still resides in private hands and they can be changed or withdrawn arbitrarily without notice.

XML does nothing

XML is a meta-language, used to create new languages.

XML is but a set of rules defining a common outer syntax for the markup.

XML is extensible

The markup used in HTML documents and the structure of HTML documents are predefined. The author of HTML documents can only use tags that are defined in the HTML standard. XML allows the author to define his own elements and his own document structure.

Industry adoption

Basic architecture of an XML document

 Azuma Miki
 3-2-11 Nishi-Shinjuku

Wellformedness and "Malformedness"

If all elements in the document that are opened (<elementName>) are also closed (</elementName>) at their respective nesting level, this means that the document is wellformed. Wellformedness is a fundamental requirement.

Else, it is malformed. Malformedness is a fundamental error.

Example of malformedness:


The example above is malformed XML because the first-name element was not closed.

Another example of malformedness:

   <first-name>This is the text</last-name>

The example above is malformed XML because the first-name element was not closed, and because there was a closing tag for an element last-name which was not open.

Another example of malformedness:

       This is a sentence inside a paragraph.

The example above is malformed XML because the paragraph element was closed before the sentence element was closed. Elements must be closed at the same nesting level as they were opened.

Limiting the grammar: schemas and validity

XML provides mechanisms to impose constraints on the documents storage layout and logical structure. One of these mechanisms is the schema.

There are several XML schema languages. The original and most broadly supported schema language (also defined by the XML 1.0 specification) is the Document Type Definition (DTD).

The DTD defines a collection of element and attribute names that are allowed in the document. It also defines the relationship between these names (which element is allowed as a child of which, which attribute is allowed on which element, etcetera). This collection is sometimes referred to as an XML grammar, and sometimes an XML language.

Example: The XHTML 1.0 DTD defines that the element name for paragraph is "p". In the document it should read as follows:

  <p>This is a paragraph</p>

XML and Accessibility: the future

"The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect."

Tim Berners-Lee, W3C Director and inventor of the World Wide Web

Emerging XML-based languages with a promise of enhanced accessibility, such as:

  • SVG
  • XHTML 2

XML and Accessibility: multimodal interaction

Multimodal Content

The Dream

  • Adapting the Web to allow multiple modes of interaction:
    • GUI, Speech, Vision, Pen, Gestures, Haptic interfaces, ...
  • Augmenting human to computer and human to human interaction
    • Communication services involving multiple devices and multiple people
  • Anywhere, Any device, Any time
    • Services that adapt to the device, user preferences and environmental conditions
  • Accessible to all

The Multimodal Interaction Activity is extending the Web user interface to allow multiple modes of interaction, offering users the choice of using their voice, or an input device such as a key pad, keyboard, mouse, stylus or other input device. For output, users will be able to listen to spoken prompts and audio, and to view information on graphical displays. The Working Group is developing markup specifications for synchronization across multiple modalities and devices with a wide range of capabilities.


XML and Accessibility: requirements

For XML to enhance information accessibility, the following is required:

  • Content and Information that is well structured
  • Content Structures that are semantically meaningful
  • Multimodal interaction with the content


First transition to a full XML fileset with Daisy 2.02 (2001)
This recommendation uses XHTML 1.0, an XML reformulation of HTML:
  • Adds a significant simplification for playing devices
  • Adds futuresafing
  • But does not add grammars authored for the particular purpose
First release of Daisy 3 (Z39.86-2002) (2002)
Introduces specific grammars for:
  • The Navigation Control Center (NCX)
  • The Full Text of the publication (DTBOOK)
  • Device-interchangeable bookmarks
  • and more...

Focuses in these grammars on the rigidity of structure, and a simple but semantically rich grammar for print books.

A structurally an semantically correct XML source document can be use to create many different output formats, such as:

  • A DAISY 2.02 Talking Book
  • A DAISY 3 Talking Book
  • Braille Print
  • Dynamic Braille
  • Large-print
  • E-text (ascii, xhtml, ...)
  • for different output formats
  • for new editions of the same book

This DTD is purposively designed for the single source master concept.

Using XSLT and other automated or semi-automated transform processes, the source document can be prepared for the output destination, if needed at all (dtbook is in itself a good e-text format for example).

Strict XHTML 1.0 documents can also be upgraded to DTBOOK at a later stage using XSLT.

  • Manual typing
  • Scanning with automated conversion into text (ascii, rtf)
  • Access to publishers files (in a variety of formats)

Read More

DAISYpedia Categories: 

This page was last edited by PVerma on Wednesday, August 25, 2010 18:57
Text is available under the terms of the DAISY Consortium Intellectual Property Policy, Licensing, and Working Group Process.