Extensible Markup Language (XML)

  • The Extensible Markup Language (XML) was developed in 1996 by the World Wide Web Consortium’s (W3C’s) XML Working Group. XML is a widely supported open technology (i.e., nonproprietary technology) for describing data that has become the standard format for data exchanged between applications over the Internet
  • XML Basics:

    • XML permits document authors to create markup (i.e., a text-based notation for describing data) for virtually any type of information. This enables document authors to create entirely new markup languages for describing any type of data, such as mathematical formulas, software-configuration instructions, chemical molecular structures, music, news, recipes and financial reports. XML describes data in a way that both human beings and computers can understand
    • The following example is a simple XML document that describes information for a baseball player. We focus on lines 511 to introduce basic XML syntax:

<?xml version = "1.0"?>

<!– Fig. 19.1: player.xml –>

<!– Baseball player structured with XML –>

<player>

<firstName>John</firstName>

<lastName>Doe</lastName>

<battingAverage>0.375</battingAverage>

        </player>

  • XML documents delimit elements with start tags and end tags
  • Every XML document must have exactly one root element that contains all the other elements
  • Some XML-based markup languages include XHTML (Extensible HyperText Markup Language HTML’s replacement for marking up Web content), MathML (for mathematics), VoiceXML™ (for speech), CML (Chemical Markup Language for chemistry) and XBRL (Extensible Business Reporting Language for financial data exchange). These markup languages are called XML vocabularies and provide a means for describing particular types of data in standardized, structured ways
  • Massive amounts of data are currently stored on the Internet in a variety of formats (e.g., databases, Web pages, text files). Based on current trends, it is likely that much of this data especially that, which is passed between systems, will soon take the form of XML. Organizations see XML as the future of data encoding. Information technology groups are planning ways to integrate XML into their systems. Industry groups are developing custom XML vocabularies for most major industries that will allow computer-based business applications to communicate in common languages. For example, Web servicesallow Web-based applications to exchange data seamlessly through standard protocols based on XML
  • XML documents are highly portable. Viewing or modifying an XML document which is a text file that ends with the .xml filename extension does not require special software, although many software tools exist, and new ones are frequently released that make it more convenient to develop XML-based applications. Any text editor that supports ASCII/Unicode characters can open XML documents for viewing and editing. Also, most Web browsers can display XML documents in a formatted manner that makes it easier to see the XML’s structure. One important characteristic of XML is that it is both human readable and machine readable.
  • Processing an XML document requires software called an XML parser (or XML processor). A parser makes the document’s data available to applications. While reading the contents of an XML document, a parser checks that the document follows the syntax rules specified by the W3C’s XML Recommendation (www.w3.org/XML). XML syntax requires a single root element, a start tag and end tag for each element, and properly nested tags (i.e., the end tag for a nested element must appear before the end tag of the enclosing element). Furthermore, XML is case sensitive, so the proper capitalization must be used in elements. A document that conforms to this syntax is a well-formed XML document, and is syntactically correct. If an XML parser can process an XML document successfully, that XML document is well formed. Parsers can provide access to XML-encoded data in well-formed documents only.
  • An XML document can optionally reference a Document Type Definition (DTD) or a schema that defines the proper structure of the XML document. When an XML document references a DTD or a schema, some parsers (called validating parsers) can read the DTD/schema and check that the XML document follows the structure defined by the DTD/schema. If the XML document conforms to the DTD/schema (i.e., the document has the appropriate structure), the XML document is valid. For example, if in the presiding code we were referencing a DTD that specifies that a player element must have firstName, lastName and battingAverage elements, then omitting the lastName element would cause the XML document player.xml to be invalid. However, the XML document would still be well formed, because it follows proper XML syntax (i.e., it has one root element, and each element has a start tag and an end tag). By definition, a valid XML document is well formed. Parsers that cannot check for document conformity against DTDs/schemas are nonvalidating parser sthey determine only whether an XML document is well formed, not whether it is valid
  • XML documents contain only data, not formatting instructions, so applications that process XML documents must decide how to manipulate or display each document’s data. For example, a PDA (personal digital assistant) may render an XML document differently than a wireless phone or a desktop computer. You can use Extensible Stylesheet Language (XSL) to specify rendering instructions for different platforms
  • XML-processing programs can also search, sort and manipulate XML data using technologies such as XSL. Some other XML-related technologies are XPath (XML Path Language a language for accessing parts of an XML document), XSL-FO (XSL Formatting Objects an XML vocabulary used to describe document formatting) and XSLT (XSL Transformations a language for transforming XML documents into other documents).
  • There are 4 languages related to XML: XSL, XPath, XSL-FO and XSLT
  • Structuring Data

    • <?xml version = "1.0"?>: this is the XML declaration
    • Documents should include the XML declaration to identify the version of XML used. A document that lacks an XML declaration might be assumed to conform to the latest version of XMLwhen it does not, errors could result
    • Placing whitespace characters before the XML declaration is an error
    • XML Comments are: <!—"commented code"–!>
    • XML is case sensitive. Using different cases for the start tag and end tag names for the same element is a syntax error
    • The lines that precede the root element are called the XML prolog. In an XML prolog, the XML declaration must appear before the comments and any other markup
    • However, element names must begin with either a letter or an underscore, and they should not begin with "xml" in any combination of uppercase and lowercase letters (e.g., XML, Xml, xMl) as this is reserved for use in the XML standards
    • XML element names should be meaningful to humans and should not use abbreviations
    • Any element that contains other elements is a container element (parent elements) and elements nested inside a container element are child elements (or children) of that container element
    • Parsers often store XML data as tree structures to facilitate efficient manipulation
    • <!DOCTYPE letter SYSTEM "letter.dtd">: is a DTD
    • DTD specifies the elements and parent-child relationships between elements permitted in an XML document
    • The previous DTD reference contains three items, the name of the root element that the DTD specifies (letter); the keyword SYSTEM (which denotes an external DTD a DTD declared in a separate file, as opposed to a DTD declared locally in the same file); and the DTD’s name and location (i.e., letter.dtd in the current directory). DTD document filenames typically end with the .dtd extension
  • XML Namespaces:

    • XML allows document authors to create custom elements. This extensibility can result in naming collisions among elements in an XML document that each have the same nam
    • An XML namespace is a collection of element and attribute names. Like C# namespaces, XML namespaces provide a means for document authors to unambiguously refer to elements with the same name (i.e., prevent collisions)
    • <school:subject>Math</school:subject>, <medical:subject>Cardiology</medical:subject>
      						
    • Both school and medical are namespace prefixes
    • Attempting to create a namespace prefix named xml in any mixture of uppercase and lowercase letters is a syntax error the xml namespace prefix is reserved for internal use by XML itself
    • Each namespace prefix is bound to a series of characters called a Uniform Resource Identifier (URI) that uniquely identifies the namespace. Document authors create their own namespace prefixes and URIs. A URI is a way to identifying a resource, typically on the Internet. Two popular types of URI are Uniform Resource Name (URN) and Uniform Resource Locator (URL).
  • Document Type Definitions (DTDs):

    • Document Type Definitions (DTDs) are one of two main types of documents you can use to specify XML document structure
    • Many organizations and individuals are creating DTDs and schemas for a broad range of applications. These collections called repositories are available free for download from the Web (e.g., www.xml.org, www.oasis-open.org).
    • A DTD expresses the set of rules for document structure using an EBNF (Extended Backus-Naur Form) grammar. [Note: EBNF grammars are commonly used to define programming languages
    • The ELEMENT element type declaration defines the rules for element letter
    • The plus sign (+) occurrence indicator specifies that the DTD allows one or more occurrences of an element. Other occurrence indicators include the asterisk (*), which indicates an optional element that can occur zero or more times, and the question mark (?), which indicates an optional element that can occur at most once
    • ATTLIST attribute-list declaration to define an attribute named type for the contact element

      • <!ATTLIST contact type CDATA #IMPLIED>
      • Keyword #IMPLIED specifies that if the parser finds a contact element without a type attribute, the parser can choose an arbitrary value for the attribute or can ignore the attribute
    • Keyword #REQUIRED specifies that the attribute must be present in the element, and keyword #FIXED specifies that the attribute (if present) must have the given fixed value
    • Keyword CDATA specifies that attribute type contains character data (i.e., a string)
    • DTD syntax does not provide a mechanism for describing an element’s (or attribute’s) data type. For example, a DTD cannot specify that a particular element or attribute can contain only integer data
    • Keyword #PCDATA specifies that an element (e.g., name) may contain parsed character data (i.e., data that is processed by an XML parser). Elements with parsed character data cannot contain markup characters, such as less than (<), greater than (>) or ampersand (&).
    • The document author should replace any markup character in a #PCDATA element with the character’s corresponding character entity reference. For example, the character entity reference &lt; should be used in place of the less-than symbol (<), and the character entity reference &gt; should be used in place of the greater-than symbol (>). A document author who wishes to use a literal ampersand should use the entity reference &amp; instead parsed character data can contain ampersands (&) only for inserting entities
    • Keyword EMPTY specifies that the element does not contain any data between its start and end tags. Empty elements commonly describe data via attributes
    • A well-formed document is syntactically correct (i.e., each start tag has a corresponding end tag, the document contains only one root element, etc.), and a valid document contains the proper elements with the proper attributes in the proper sequence. An XML document cannot be valid unless it is well formed
  • W3C XML Schema Documents:

    • Many developers in the XML community believe that DTDs are not flexible enough to meet today’s programming needs. For example, DTDs lack a way of indicating what specific type of data (e.g., numeric, text) an element can contain and DTDs are not themselves XML documents. These and other limitations have led to the development of schemas
    • A DTD describes an XML document’s structure, not the content of its elements
    • An XML document that conforms to a schema document is schema valid, and one that does not conform is schema invalid. Schemas are XML documents and therefore must themselves be valid
    • By convention, schemas use the .xsd extension
    • The validator knows the target xml document for testy because it will have the same name of the schema reference
    • element specifies the actual elements that can be used to mark up data
    • Two categories of data type exist in XML Schema simple types and complex types. Simple and complex types differ only in that simple types cannot contain attributes or child elements and complex types can
    • A user-defined type that contains attributes or child elements must be defined as a complex type
    • Every simple type defines a restriction on an XML Schema-defined type or a restriction on a user-defined type. Restrictions limit the possible values that an element can hold
    • Complex types are divided into two groups those with simple content and those with complex content

XML Schema Data Type(s)

Description

Ranges or Structures

Examples

string

A character string.

"hello"

boolean

True or false.

True, False

true

decimal

A decimal numeral.

i * (10n), where i is an integer and n is an integer that is less than or equal to zero.

5, -12, -45.78

float

A floating-point number.

m * (2e), where m is an integer whose absolute value is less than 224 and e is an integer in the range -149 to 104. Plus three additional numbers: positive infinity, negative infinity and not-a-number (NaN).

0, 12, -109.375, NaN

double

A floating-point number.

m * (2e), where m is an integer whose absolute value is less than 253 and e is an integer in the range -1075 to 970. Plus three additional numbers: positive infinity, negative infinity and not-a-number (NaN).

0, 12, -109.375, NaN

long

A whole number.

-9223372036854775808 to 9223372036854775807, inclusive

1234567890, -1234567890

int

A whole number.

-2147483648 to 2147483647, inclusive

1234567890, -1234567890

short

A whole number.

-32768 to 32767, inclusive

12, -345

date

A date consisting of a year, month and day.

yyyy-mm with an optional dd and an optional time zone, where yyyy is four digits long and mm and dd are two digits long.

2005-05-10

time

A time consisting of hours, minutes and seconds.

hh:mm:ss with an optional time zone, where hh, mm and ss are two digits long.

16:30:25-05:00

  • Both -simple and complex contents- can contain attributes, but only complex content can contain child elements. Complex types with simple content must extend or restrict some other existing type. Complex types with complex content do not have this limitation.
  • Extensible Stylesheet Language and XSL Transformations:

    • Extensible Stylesheet Language (XSL) documents specify how programs are to render XML document data. XSL is a group of three technologies XSL-FO (XSL Formatting Objects), XPath (XML Path Language) and XSLT (XSL Transformations). XSL-FO is a vocabulary for specifying formatting, and XPath is a string-based language of expressions used by XML and many of its related technologies for effectively and efficiently locating structures and data (such as specific elements and attributes) in XML documents.
    • The third portion of XSL is XSL Transformations (XSLT) a technology for transforming XML documents into other documents i.e., transforming the structure of the XML document data to another structure. XSLT provides elements that define rules for transforming one XML document to produce a different XML document. This is useful when you want to use data in multiple applications or on multiple platforms, each of which may be designed to work with documents written in a particular vocabulary. For example, XSLT allows you to convert a simple XML document to an XHTML (Extensible HyperText Markup Language) document that presents the XML document’s data (or a subset of the data) formatted for display in a Web browser
    • Transforming an XML document using XSLT involves two tree structuresthe source tree (i.e., the XML document to be transformed) and the result tree (i.e., the XML document to be created). XPath is used to locate parts of the source tree document that match templates defined in an XSL style sheet. When a match occurs (i.e., a node matches a template), the matching template executes and adds its result to the result tree. When there are no more matches, XSLT has transformed the source tree into the result tree. The XSLT does not analyze every node of the source tree; it selectively navigates the source tree using XPath’s select and match attributes. For XSLT to function, the source tree must be properly structured. Schemas, DTDs and validating parsers can validate document structure before using XPath and XSLTs
    • A processing instruction is embedded in an XML document and provides application-specific information to whichever XML processor the application uses. In this particular case, the processing instruction specifies the location of an XSLT document with which to transform the XML document.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s