Web Developer: Are XML Databases Necessary?
Are XML Databases Necessary?
XML databases may help development efforts in some situations — but an RDBMS may still be the best choice.
By Michael S. Dougherty
Quarter 1, 2003

Printer-Friendly Version
Email this Story
Bookmark to del.ico.us
Digg It!

Resources

DB2 XML Extender
DB2 Developer Domain Xperanto Demo
IBM DeveloperWorks XML Zone
XML: DB Initiative for XML Databases
XML and Databases

XML has hit the big time as a method of data exchange in Web-based technology. And when a technology becomes as ubiquitous as XML is today, the number of possible variations, combinations, and permutations increases exponentially.

XML databases are one of the products that have sprung up in XML's wake. Sometimes called "native" XML databases, they store information in an XML format. Sound useful? They are, under the right circumstances.

But what about your company's RDBMS? Most likely (and certainly in the case of DB2), your RDBMS can store XML data in some form. So, do you need both?

Let's discuss some of the factors you might want to consider before deciding one way or another. (This column assumes a basic understanding of XML concepts. If you need more information about XML itself, try some of the resources listed above.)

XML DATABASES

Native XML databases store XML objects and classes and have flourished in situations where managing message flow is key (such as in Web services and document and content management). Some of the most widely used XML databases are Apache Xindice, Software AG's Tamino XML Server, X-Hive Corp.'s X-Hive/DB, Excelon Corp.'s XIS, and the open source eXist.

The characteristics of native XML databases expose their fundamental difference from a standard RDBMS without additional XML expansions. XML databases:

  • Define a logical model for an XML document (vs. the data in the document) and store or retrieve documents according to that model. At minimum, the model will contain XML elements, attributes, PCDATA, and document order. These models are implemented via the Document Object Model (DOM) and events in the Simple API for XML (SAX).
  • Have an XML document as the fundamental unit of logical storage. (In a relational database, the fundamental unit of logical storage is a row in a table.)
  • Need no particular underlying physical storage model (such as relational, hierarchical, or object oriented). XML databases can even use a proprietary storage format (such as indexed or compressed files). RDBMSs follow a clear relational model between data objects.

RDBMSs have direct relationships between data elements, rows, and columns; therefore, they don't easily map to varying XML structures without having an increasing number of relationships based on the number of different permutations and combinations of XML that exist. Heterogeneity, or the ability to work with diverse data, can lead to difficulty in maintenance and expandability for new XML structures. And those limitations can defeat the purpose of XML's flexibility regarding design.

But XML databases don't replace current database models. Those models are the most efficient when it comes to static or preset data structures. Instead, XML databases are simply tools that provide robust storage and manipulation of free-form XML documents. They're ideally suited for use in corporate information portals, membership databases, product catalogs, publishing management systems, patient information tracking, and B2B document exchange — all circumstances in which the data structure will likely change and expand as the system grows.

DATA AND DOCUMENTS

Native XML databases have two main architectures: text-based and model-based. Text-based architectures focus on a model that is data-centric; model-based XML databases use a model that's document-centric. Purchase entries, phone book information, and data that's used more with applications (not directly by end users) are examples of the data-centric approach. A model based on documents might be more interesting to users (for example, patient information records at a hospital). Although these examples are simplified, the general rule holds true.

With a text-based XML database, the content is stored as text, either in a file system, as a binary large object (BLOB) in a relational database, or in a proprietary text format. Many XML-enabled databases also provide BLOB support with character large objects for storing character data. Text-based architectures perform best when content is retrieved in a predefined hierarchy; varied hierarchies will result in slower response times. IBM's DB2 Extender fits the data-centric model as a text-based XML database.

A model-based XML database builds the internal object model from the document itself and stores this model. This model can vary based on the type of database. Again, the best performance occurs when the data is accessed in the model designed. A primarily document-based system (such as one that stores medical publications) would be a typical use for model-based XML databases. Often, the distinguishing characteristics of a document-driven model is that each object (a medical publication in our example) may have different sets of attributes specific to it with a unique structure that may include multivalued attributes (more that one entry) for a single property.

XML IN RDBMSs

XML databases lag behind RDBMSs in terms of the functionality of their queries and in updating content, which currently requires more user-centric document management rather than automatic management. Another major limitation is that native XML databases, with some recent exceptions, don't allow the data to be returned in a non-XML format.

Today, nearly all major RDBMSs support XML in one or both of these possible forms:

  • XML columns, which store the entire XML file in one referenced field
  • XML collection, which decomposes the XML data into database rows and columns.

For DB2 Universal Database (UDB) v.7.2 and the new v.8.1, IBM offers the DB2 XML Extender, which provides both types of storage and makes stored XML available as new data types or as character data and external files. This approach makes DB2 an XML-enabled database.

The disadvantage of the XML database is the difficulty of holding "documents" with mixed data, recursive content models, and a complex mix of elements and attributes. This difficulty, which colors most modern-day data integration efforts, is something IBM is tackling head-on with initiatives such as the Xperanto project.

Xperanto is an IBM effort to combine mixed structure, content, and information exchange standards. It's also part of IBM's work to make DB2 a "bilingual" database that provides native support for XML and enables queries in both SQL and the new XQuery language for XML. To learn more about Xperanto, please see the project demo on the DB2 Developer Domain (see Resources). Related Xperanto components and functions are planned for early 2003.

WHO NEEDS THEM?

If both kinds of databases are options for XML data, how do you decide? First, take a look at the kind of XML document that's most typical in your environment.

XML documents can be data- or document-centric. Data-centric XML documents are used for data transport. Because structure isn't as important for these documents, a standard RDBMS will work perfectly fine. Document-centric XML documents are used for document and content management on Internet sites. Because of the focus on structure and variance of content, an XML database is usually required.

Document-centric XML documents require XML databases because of the high demand for XML fragments such as abstracts, procedures, chapters, glossary data, and so on. Document-centric XML may also include document metadata such as authors, revision dates, and document numbers.

When the focus is on the data itself, an RDBMS is the most efficient way to store and access XML. When the structure of the data is also important in the retrieval process, using a native XML database becomes more direct and maintainable.

Listing 1 shows a sales order document in a data-centric XML format.

Listing 1 : A data-centric XML sales order.

<SalesOrder SONumber="12345">
<Customer CustNumber="2112">
<CustName>ACME Corporation</CustName>
<Street>123 42nd St.</Street>
<City>New York</City>
<State>NY</State>
<PostCode>10017</PostCode>
</Customer>
<OrderDate>10152002</OrderDate>
<Item ItemNumber="1">
<Part PartNumber="123">
<Description>
<p><b>Widget</b><br />
Stainless steel, one-piece construction,
Five year warranty</p>
</Description>
<Price>19.95</Price>
</Part>
<Quantity>1000</Quantity>
</Item>
<Item ItemNumber="2">
<Part PartNumber="456">
<Description>
<p><b>Can Opener<b><br />
Aluminum, one-year guarantee.</p>
</Description>
<Price>15.99</Price>
</Part>
<Quantity>5</Quantity>
</Item>
</SalesOrder>


An XML-enabled RDBMS would be the most appropriate storage method for an application built using this type of XML information because it focuses on data structure and not a user's request for the meaning of heterogeneous data.

Many B2C and B2B Web sites use XML databases, including eBay and Amazon.com. However, Amazon.com has documents that aren't just data-centric, but also irregular and format rich. For example, consider a page on Amazon.com that displays information about a book. Although it's mainly text, the structure is highly varied, with a distinct preface, introduction, table of contents, index, glossary, and so on that are often specific to the featured book. This information could be stored in the format shown in Listing 2 .

Listing 2 : An example of XML with a varied structure.

<Product>
<Book>
<Intro>
The <Title>DB2 Developers Guide</Title> from <Publisher>SAMS
</Publisher> by <Author>Craig Mullins</Author> is <Summary>
A bundled purchase with <Title>DB2 High Performance and Tuning</Title>,
For a discounted price</Summary>
</Intro>
<Description>
<Para>Buy this book with DB2 High Performance and Tuning today!
<b>Buy Together Today!</b><Para>
<Para>You can:</Para>
<List>
<Item><Link URL="Order.html">Order both books now!</Link></Item>
<Item><Link URL="DB2HighPerf.htm">Read more about DB2
High Performance and Tuning</Link></Item>
<Item><Link URL="DB2search.zip">Review other related DB2
products</Link></Item>
</List>
<Para>The DB2 Developer's Guide costs <b>just $49.99</b> and, if you order now, comes with a <b>a CD-ROM of the book with samples</b> as a bonus gift.</Para>
</Description>
</Title>
</Book>
</Product>


Sometimes it's difficult to choose between using a data- or document-centric system. The general rule is that if the structure is consistent (or homogeneous), use a standard RDBMS. If the structure is irregular (or heterogeneous), use a native XML database or a document or content management system. Choose the one that would gain the most benefit if both regular and irregular data were present. If the size and extent of the system warrant it, an RDBMS and an XML database can be used together. In fact, the combination of RDBMS and XML databases can have a positive impact on convenience, ease of development, and performance.

However, that brings up another point in favor of RDBMSs: budget constraints. We're in a climate of "Use what you have, buy only what you need," and most companies already have RDBMSs in place. An RDBMS with XML capability capitalizes on that existing investment stream by extending existing application environments.

FINDING THEIR PLACE

Many people don't realize how common XML databases are today, because they're integrated as an engine on top of existing databases with popular content management and document management systems such as Documentum, Vignette, and Stellent. Most document management systems have both fat-client (local executable) and thin-client (browser-based) interfaces. Nevertheless, XML databases are just emerging as a second-generation product, and plenty of improvements are needed in the third and future generations.

Native XML databases aren't a panacea, and they're definitely not going to replace existing database systems. They're simply tools for XML developers that can yield significant benefits in the right circumstances. If you need document-centric storage, then an XML database is certainly worth a look. It might just prove to be the right tool for the job.


Michael S. Dougherty has been consulting for 10 years in systems-level programming, application development, and project management. He recently partnered with D&L Consulting. You can reach him at mikjza@kingwoodcable.com .


Comments? Questions?

Give us your feedback or ask a question of the author.

Please enter your e-mail address below:

CAREER CENTER
Ready to take that job and shove it?
SEARCH JOBS
RECENT JOB POSTINGS
CAREER NEWS
10 Search Engines You Don't Know About
Go beyond Google and get vertical. These specialized search sites will help you find the business information you need -- fast.

Subscribe to the new digital version of IBM Database Magazine
New Digital Version

Sponsored links:



Subscribe to the IBM Database Magazine Newsletter

Email Address *
First Name
Last Name
HTML Preference
HTML Text
 

Fields with * are required.

 


:: IBM Database Magazine ::