Introduction to the XML syntax

Original Author(s): Markus Gylling

XML Markup

XML Markup is a description of the document's storage layout and logical structure. Markup is the methid with which the structure and semantics of the information in document is conveyed.

This section introduces Markup basics, and some common terminology central to all xml-related technologies.

Create a new text (.txt) file and save it with your name as the filename.

D:daisysourcesmyfirstxmlmiki.txt


Write the below name, street adress, postal code, city, country, and two telephone numbers on one line each in the text document.


 Miki Azuma
 3-2-11 Nishi-Shinjuku
 Shinjuku
 Japan
 03-5909-8220
 03-5909-8289

The Element

XML's basic unit of data and markup is called element.

The element NAME mostly describes the kind of data/text that is contained within the element.

Example of an element:


  <paragraph>This is a paragraph</paragraph>

Another example of an element:


  <name>Keun-Hae Youk</name>

XML itself imposes no restrictions (except some restrictions on characters that can be used) on the element name. The language author decides what name is appropriate for the particular data/structure.

In so called document-centric XML, the element names mostly describe the structural and/or semantic role of the enclosed text, in its context.

Add element names to each line in your text document. Use an appropriate semantic to describe the element content.


 <name>Miki Azuma</name>
 <street>3-2-11 Nishi-Shinjuku</street>
 <city>Shinjuku</city>
 <country>Japan</country>
 <telephone>03-5909-8220</telephone>
 <telephone>03-5909-8289</telephone>

Note that XML element names can not contain spaces.

Open tag and Close tag

The XML element has an open tag and a closing tag. These are delimited by the less than ("<") and the greater than (">") characters.


  <name>Miki</name>

In the example above, <name> is the open tag of the element, and </name> is the closing tag of the element.

Note that the only difference is the "slash" character ("/") in the closing tag.

The open tag and the closing tag of an element always have the same name. The example below is an error.


  <name-begin>Miki</name-end>

Also note that element names are case sensitive. The example below is an error, because the open tag uses lower case, and the closing tag uses uppercase.


  <name>Miki</NAME>

Root element

All XML documents form a tree structure.

All XML Documents must have exactly one root element. Example:


  <root>
   <paragraph>text</paragraph>
   <paragraph>text</paragraph>
  </root>

A commonly used synonym for "root" is "document element".

Add a root element to your document. Remember to use an appropriate semantic to describe the element content. In this case, since this is the root element, the semantic should describe the content of the whole document.


 <address>
   <name>Miki Azuma</name>
   <street>3-2-11 Nishi-Shinjuku</street>
   <city>Shinjuku</city>
   <country>Japan</country>
   <telephone>03-5909-8220</telephone>
   <telephone>03-5909-8289</telephone>
 </address>

The Attribute

All XML elements may have attributes that contain additional information about the data/text.

Example, a content attribute:


  <paragraph content="introduction">...</paragraph>

In this example, the attribute name is "content". The attribute value is "introduction".

There must always be an equals sign (=) between the attribute name and the attribute value.

The attribute value must be enclosed in double or single quotes. The following two attributes are regarded as identical.


  <paragraph content="introduction"></paragraph>
  <paragraph content='introduction'></paragraph>

An attribute is always contained within the open tag of the element.

Attributes on closing tags are forbidden.

There is a problem in the semantics used in the document so far. The problem is that there is no way to distiguish between the two telephone numbers. Which one is the home number, and which one is the office number?

Although this problem could have been solved by giving different element names to the two telephone numbers (phone-office and phone-home for example), lets solve it by adding attributes instead.

Add a type attribute to the telephone elements. Give the attributes values that appropriately describe the semantics.


   [...]
   <telephone type="office">03-5909-8220</telephone>
   <telephone type="home">03-5909-8289</telephone>
   [...]

Child, Parent and Sibling: "nesting"

Elements often form a parent-child relationship. Example:


  <parent>
    <child>Text</child>
  </parent>

This relationship is often referred to as elements being nested within each other. In the example above, child is nested within parent.

Of course, a parent of one child may at the same time be the child of another parent. Example:


  <root>
   <paragraph>
    <sentence>This is a sentence</sentence>
    <sentence>This is another sentence</sentence>
   </paragraph>
  </root>


The above is an example of three-level nesting (root-paragraph-sentence).

<paragraph> is parent of <sentence>, and at the same time <paragraph> is child of <root>.

Elements that occur at the same nesting level (such as the two <sentence> elements above), form a sibling relationship.

Refine the semantics of the name element by adding two child elements to it. The first child element should be first-name and the second should be last-name.


  [...]
  <name>
    <first-name>Miki</first-name>
    <last-name>Azuma</last-name>
  </name>
  [...]

The name element now has two children: first-name and last-name.

The first-name and last-name elements are siblings to eachother. They have a common parent: the name element.

The Text Node

Some elements have a text node, others do not.

In the below example, "paragraph" has a text node, but "document" does not.


  <document>
    <paragraph>This is the text node</paragraph>
  </document>

The Empty element

Elements that do not have text nodes, nor other elements as children, can be expressed as empty elements.

The syntax for empty elements is slightly different:


  <elementName />

Note that this element does not have an opening and a closing tag! They are merged into one.

Empty elements often have attributes that contain additional information. Example:


  <image file="/myImages/image.png" />

In this example, the attribute "file" contains a pointer to an image.

Add the empty element added as the first child of the root. Add to this element the attribute date with todays date as the value.


  <address>
    <added date="2003-08-01" />
    <name>
    [...]

Does the date attribute value follow the international standard for dates?

The international standard for dates (ISO 8601) uses the date format yyyy-mm-dd. Make sure the date value you added complies with this standard.

Add a scheme attribute to the added element that makes it explicitly clear that the date format used ISO 8601.


  <address>
    <added date="2003-08-01" scheme="iso 8601" />
    [...]

Wellformedness and "Malformedness"

If all elements in the document that are opened (<elementName>) are also closed (</elementName>) at their respective nesting level, this means that the document is wellformed. Wellformedness is a fundamental requirement.

Else, it is malformed. Malformedness is a fundamental error.

Example of malformedness:


  <name>
    <first-name>Miki
    <last-name>Azuma</last-name>
  </name>

The example above is malformed XML because the first-name element was not closed.

Another example of malformedness:


  <name>
   <first-name>This is the text</last-name>
  </name>

The example above is malformed XML because the first-name element was not closed, and because there was a closing tag for an element last-name which was not open.

Another example of malformedness:


  <document>
   <paragraph>
     <sentence>
       This is a sentence inside a paragraph.
     </paragraph>
   </sentence>
  </document>

The example above is malformed XML because the paragraph element was closed before the sentence element was closed. Elements must be closed at the same nesting level as they were opened.

Make sure your document is well formed. In your text editor, choose "save as" and save the document with the file extension ".xml".

Then go to the folder where the document is placed. Open the document in internet explorer.

If your document is well formed, the XML Document tree will be displayed in Internet Explorer as a collapsible tree. If it is not well formed, an error message will be shown.

Schemas: DTD

XML provides mechanisms to impose constraints on the documents storage layout and logical structure. One of these mechanisms is the schema.

There are several XML schema languages. The original and most broadly supported schema language (also defined by the XML 1.0 specification) is the Document Type Definition (DTD).

The DTD defines a collection of element and attribute names that are allowed in the document. It also defines the relationship between these names (which element is allowed as a child of which, which attribute is allowed on which element, etcetera). This collection is sometimes referred to as an XML grammar, and sometimes an XML language.

Example: The XHTML 1.0 DTD defines that the element name for paragraph is "p". In the document it should read as follows:


  <p>This is a paragraph</p>

The DTD also defines for which elements text nodes are allowed, and which elements are empty.

What is the reason to want to impose grammatical constraints on an XML language?

Validity

If all elements, attributes, etc, use names and syntax as defined in the DTD, the document is valid. If this is not the case, it is invalid. Example:


  <para>This is a paragraph using
    the element name "para".</para>

"para" is unknown to the XHTML DTD. The above example is wellformed XML but invalid XHTML. Element name used must be "p":


  <p>This is a paragraph using
    the element name "p" as defined by the XHTML DTD</p>

Wellformedness and Validity

Wellformedness refers to the syntactical correctness of the document.

Validity refers to the grammatical correctness of the document.

<br/>

A document that is not wellformed can never be valid.

A document that is wellformed can be invalid.

The Prolog: XML and DOCTYPE declarations

The prolog resides above the root element open tag.

The prolog contains the XML and DOCTYPE declarations.

The doctype declaration is required to tell if the the document is valid, since the doctype declaration tells which DTD the document is associated with.

The XML declaration is required only if character set is other than Unicode (utf-8), but it is recommended to always include the XML declaration.

The XML declaration always occurs as the first element in the file:


  <?xml version="1.0"?>

The DOCTYPE declaration always occurs after the XML declaration, and before the root element open tag:


  <?xml version="1.0"?>
  <!DOCTYPE ... >
  <root>
    ...
  </root>

Terms you should understand now

element
open tag
close tag
root element/document element
empty element
attribute
text node
child
parent
sibling
wellformedness
DTD Validity
prolog
structure
semantic

Read More

DAISYpedia Categories: 


This page was last edited by PVerma on Tuesday, July 13, 2010 00:28
Text is available under the terms of the DAISY Consortium Intellectual Property Policy, Licensing, and Working Group Process.