XML Best Practices

Article
12/03/2008

SQL Server 2005 provides extensive support for XML data processing. XML values can be stored natively in an xml data type column that can be typed according to a collection of XML schemas, or left untyped. You can index the XML column. Additionally, fine-grained data manipulation is supported by using XQuery and XML DML. XML DML is an extension for data modification.

SQL Server 2000 and SQLXML Web releases provide powerful XML data management capabilities. These features focus on mapping between relational and XML data. XML views of relational data can be defined by using annotated XSD (AXSD) to provide an XML-centric approach that supports bulk load of data, query, and update capabilities on XML data. Transact-SQL extensions provide a SQL-centric approach for mapping relational query results to XML by using FOR XML, and for generating relational views from XML by using OPENXML. These supports have been extended in SQL Server 2005. Together with the newly added native XML support, SQL Server 2005 provides a powerful platform for developing rich applications for semi-structured and unstructured data management.

This topic provides guidelines for XML data modeling and use in SQL Server 2005. It is divided into the following sections:

Data modeling
XML data can be stored in multiple ways in SQL Server 2005 by using native xml data type and XML shredded into tables. This topic provides guidelines for making the appropriate choices for modeling your XML data. It also covers indexing XML data, property promotion, and typing of XML instances.
Use
This section discusses use-related topics, such as loading XML data into the server and type inference in query compilation. This section also explains and differentiates closely related features and suggests appropriate use of these features. These are illustrated with examples.

Data Modeling

This section outlines the reasons why you should use XML in SQL Server 2005. The section also provides guidelines for choosing between native XML storage and XML view technology, and gives data modeling suggestions.

Relational or XML Data Model

If your data is highly structured with known schema, the relational model is likely to work best for data storage. SQL Server provides the required functionality and tools you may need. On the other hand, if the structure is semi-structured or unstructured, or unknown, you have to give consideration to modeling such data.

XML is a good choice if you want a platform-independent model in order to ensure portability of the data by using structural and semantic markup. Additionally, it is an appropriate option if some of the following properties are satisfied:

Your data is sparse or you do not know the structure of the data, or the structure of your data may change significantly in the future.
Your data represents containment hierarchy, instead of references among entities, and may be recursive.
Order is inherent in your data.
You want to query into the data or update parts of it, based on its structure.

If none of these conditions is met, you should use the relational data model. For example, if your data is in XML format but your application just uses the database to store and retrieve the data, an [n]varchar(max) column is all you require. Storing the data in an XML column has additional benefits. This includes having the engine determine that the data is well formed or valid, and also includes support for fine-grained query and updates into the XML data.

Reasons for Storing XML Data in SQL Server 2005

Following are some of the reasons to use native XML features in SQL Server 2005 instead of managing your XML data in the file system:

You want to share, query, and modify your XML data in an efficient and transacted way. Fine-grained data access is important to your application. For example, you may want to extract some of the sections within an XML document, or you may want to insert a new section without replacing your whole document.
You have relational data and XML data and you want interoperability between both relational and XML data within your application.
You need language support for query and data modification for cross-domain applications.
You want the server to guarantee that the data is well formed and also optionally validate your data according to XML schemas.
You want indexing of XML data for efficient query processing and good scalability, and the use of a first-rate query optimizer.
You want SOAP, ADO.NET, and OLE DB access to XML data.
You want to use administrative functionality of the database server for managing your XML data. For example, this would be backup, recovery, and replication.

If none of these conditions is satisfied, it may be better to store your data as a non-XML, large object type, such as [n]varchar(max) or varbinary(max).

XML Storage Options

The storage options for XML in SQL Server 2005 include the following:

Native storage as xml data type
The data is stored in an internal representation that preserves the XML content of the data. This includes containment hierarchy, document order, and element and attribute values. Specifically, the InfoSet content of the XML data is preserved. For more information about InfoSet, visit http://www.w3.org/TR/xml-infoset. The InfoSet content may not be an identical copy of the text XML, because the following information is not retained: insignificant white spaces, order of attributes, namespace prefixes, and XML declaration.
For typed xml data type, an xml data type bound to XML schemas, the post-schema validation InfoSet (PSVI) adds type information to the InfoSet and is encoded in the internal representation. This improves parsing speed significantly. For more information, see the W3C XML Schema specifications at http://www.w3.org/TR/xmlschema-1 and http://www.w3.org/TR/xmlschema-2.
Mapping between XML and relational storage
By using an annotated schema (AXSD), the XML is decomposed into columns in one or more tables. This preserves fidelity of the data at the relational level. As a result, the hierarchical structure is preserved although order among elements is ignored. The schema cannot be recursive.
Large object storage, [n]varchar(max) and varbinary(max)
An identical copy of the data is stored. This is useful for special-purpose applications such as legal documents. Most applications do not require an exact copy and are satisfied with the XML content (InfoSet fidelity).

Generally, you may have to use a combination of these approaches. For example, you may want to store your XML data in an xml data type column and promote properties from it into relational columns. Or, you may want to use mapping technology to store nonrecursive parts in non-XML columns and only the recursive parts in xml data type columns.

Choice of XML Technology

The choice of XML technology, native XML versus XML view, generally depends upon the following factors:

Storage options
Your XML data may be more appropriate for large object storage (for example, a product manual), or more amenable to storage in relational columns (for example, a line item converted to XML). Each storage option preserves document fidelity to a different extent.
Query capabilities
You may find one storage option more appropriate than another, based on the nature of your queries and on the extent to which you query your XML data. Fine-grained query of your XML data, for example, predicate evaluation on XML nodes, is supported to varying degrees in the two storage options.
Indexing XML data
You may want to index the XML data to speed up XML query performance. Indexing options vary with the storage options; you have to make the appropriate choice to optimize your workload.
Data modification capabilities
Some workloads involve fine-grained modification of XML data. For example, this can include adding a new section within a document, while other workloads, such as Web content, do not. Data modification language support may be important for your application.
Schema support
Your XML data may be described by a schema that may or may not be an XML schema document. The support for schema-bound XML depends upon the XML technology.

Different choices also have different performance characteristics.

Native XML Storage

You can store your XML data in an xml data type column at the server. This is an appropriate choice if the following applies:

You want a straightforward way to store your XML data at the server and, at the same time, preserve document order and document structure.
You may or may not have a schema for your XML data.
You want to query and modify your XML data.
You want to index the XML data for faster query processing.
Your application needs system catalog views to administer your XML data and XML schemas.

Native XML storage is useful when you have XML documents that have a range of structures, or you have XML documents that conform to different or complex schemas that are too hard to map to relational structures.

Example: Modeling XML Data Using the xml Data Type

Consider a product manual in XML format that is made up of a separate chapter for each topic and that has multiple sections within each chapter. A section can contain subsections. As a result, <section> is a recursive element. Product manuals contain a large amount of mixed content, diagrams, and technical material; the data is semi-structured. Users may want to perform a contextual search for topics of interest such as searching for the section on "clustered index" within the chapter on "indexing", and query technical quantities.

An appropriate storage model for your XML documents is an xml data type column. This preserves the InfoSet content of your XML data. Indexing the XML column benefits query performance.

Example: Retaining Exact Copies of XML Data

For illustration, assume that government regulations require you to retain exact textual copies of your XML documents. For example, these could include signed documents, legal documents, or stock transaction orders. You may want to store your documents in a [n]varchar(max) column.

For querying, convert the data to xml data type at run time and execute Xquery on it. The run-time conversion may be costly, especially when the document is large. If you query frequently, you can redundantly store the documents in an xml data type column and index it while you return exact document copies from the [n]varchar(max) column.

The XML column may be a computed column, based on the [n]varchar(max) column. However, you cannot create an XML index on a computed, XML column, nor can an XML index be built on [n]varchar(max) or varbinary(max) columns.

XML View Technology

By defining a mapping between your XML schemas and the tables in a database, you create an "XML view" of your persistent data. XML bulk load can be used to populate the underlying tables by using the XML view. You can query the XML view by using XPath version 1.0; the query is translated to SQL queries on the tables. Similarly, updates are also propagated to those tables.

This technology is useful in the following situations:

You want to have an XML-centric programming model using XML views over your existing relational data.
You have a schema (XSD, XDR) for your XML data that an external partner may have provided.
Order is not important in your data, or your query table data is not recursive, or the maximal recursion depth is known in advance.
You want to query and modify the data through the XML view by using XPath version 1.0.
You want to bulk load XML data and decompose them into the underlying tables by using the XML view.

Examples include relational data exposed as XML for data exchange and Web services, and XML data with fixed schema. For more information, see the MSDN Online Library.

Example: Modeling Data Using an Annotated XML Schema (AXSD)

For illustration, assume that you have existing relational data, such as customers, orders, and line items, that you want to handle as XML. Define an XML view by using AXSD over the relational data. The XML view allows you to bulk load XML data into your tables and query and update the relational data by using the XML view. This model is useful if you have to exchange data that contains XML markup with other applications while your SQL applications work uninterrupted.

Hybrid Model

Frequently, a combination of relational and xml data type columns is appropriate for data modeling. Some of the values from your XML data can be stored in relational columns, and the rest, or the whole XML value stored in an XML column. This may yield better performance in that you have more control over the indexes created on the relational columns and locking characteristics.

The values to store in relational columns depend on your workload. For example, if you retrieve all the XML values based on the path expression, /Customer/@CustId, promoting the value of the CustId attribute into a relational column and indexing it may yield faster query performance. On the other hand, if your XML data is extensively and nonredundantly decomposed into relational columns, the re-assembly cost may be significant.

For highly structured XML data, for example, the content of a table has been converted into XML; you can map all values to relational columns, and possibly use XML view technology.

Data Modeling Using xml Data Type

This section discusses data modeling topics for native XML storage. These include indexing XML data, property promotion, and typed xml data type.

Same or Different Table

An xml data type column can be created in a table that contains other relational columns, or in a separate table with a foreign key relationship to a main table.

Create an xml data type column in the same table when one of the following conditions is true:

Your application performs data retrieval on the XML column and does not require an XML index on the XML column.
You want to build an XML index on the xml data type column and the primary key of the main table is the same as its clustering key. For more information, see Indexing an xml Data Type Column.

Create the xml data type column in a separate table if the following conditions are true:

You want to build an XML index on the xml data type column, but the primary key of the main table is different from its clustering key, or the main table does not have a primary key, or the main table is a heap (no clustering key). This may be true if the main table already exists.
You do not want table scans to slow down because of the presence of the XML column in the table. This uses space whether it is stored in-row or out-of-row.

Granularity of XML Data

The granularity of the XML data stored in an XML column is very important for locking and, to a lesser degree, it is also important for updates. SQL Server uses the same locking mechanism for both XML and non-XML data. Therefore, row-level locking causes all XML instances in the row to be locked. When the granularity is large, locking large XML instances for updates causes throughput to decline in a multiuser scenario. On the other hand, severe decomposition loses object encapsulation and increases reassembly cost.

A balance between data modeling requirements and locking and update characteristics is important for good design. However, in SQL Server 2005, the size of actual stored XML instances is not as critical.

For example, updates to an XML instance are performed by using new support for partial binary large object (BLOB) and partial index updates in which the existing stored XML instance is compared to its updated version. Partial binary large object (BLOB) update performs a differential comparison between the two XML instances and updates only the differences. Partial index updates modify only those rows that must be changed in the XML index.

Untyped, Typed, and Constrained xml Data Type

The SQL Server 2005 the xml data type implements the ISO SQL-2003 standard xml data type. Therefore, it can store well-formed XML version 1.0 documents and also so-called XML content fragments with text nodes and an arbitrary number of top-level elements in an untyped XML column. The system checks that the data is well formed, does not require the column to be bound to XML schemas, and rejects data that is not well formed in the extended sense. This is true also of untyped XML variables and parameters.

If you have XML schemas describing your XML data, you can associate the schemas with the XML column to yield typed XML. The XML schemas are used to validate the data, perform more precise type checks during compilation of query and data modification statements than untyped XML, and optimize storage and query processing.

Use untyped xml data type in the following situations:

You do not have a schema for your XML data.
You have schemas, but you do not want the server to validate the data. This is sometimes the case when an application performs client-side validation before storing the data at the server, or temporarily stores XML data that is invalid according to the schema, or uses schema components unsupported at the server (for example, key/keyref).

Use typed xml data type in the following situations:

You have schemas for your XML data and you want the server to validate your XML data according to the XML schemas.
You want to take advantage of storage and query optimizations based on type information.
You want to take better advantage of type information during compilation of your queries.

Typed XML columns, parameters, and variables can store XML documents or content. However, you have to specify with a flag whether you are storing a document or content at the time of declaration. Additionally, you have to provide the collection of XML schemas. Specify DOCUMENT if each XML instance has exactly one top-level element. Otherwise, use CONTENT. The query compiler uses the DOCUMENT flag in type checks during query compilation to infer singleton top-level elements.

Besides typing an XML column, you can use relational (column or row) constraints on typed or untyped xml data type columns. Use constraints in the following situations:

Your business rules cannot be expressed in XML schemas. For example, the delivery address of a flower shop must be within 50 miles of its business location. This can be written as a constraint on the XML column. The constraint may involve xml data type methods.
Your constraint involves other XML or non-XML columns in the table. An example is the enforcement of the ID of a Customer (/Customer/@CustId) found in an XML instance to match the value in a relational CustomerID column.

Document Type Definition (DTD)

The xml data type columns, variables, and parameters can be typed by using XML schema, but not by using DTD. However, inline DTD can be used for both untyped and typed XML to supply default values and to replace entity references with their expanded form.

You can convert DTDs to XML schema documents by using third-party tools, and load the XML schemas into the database.

Indexing an xml Data Type Column

XML indexes can be created on xml data type columns. It indexes all tags, values and paths over the XML instances in the column and benefits query performance. Your application may benefit from an XML index in the following situations:

Queries on XML columns are common in your workload. XML index maintenance cost during data modification must be considered.
Your XML values are relatively large and the retrieved parts are relatively small. Building the index avoids parsing the whole data at run time and benefits index lookups for efficient query processing.

The first index on an XML column is the primary XML index. In using it, three types of secondary XML indexes can be created on the XML column to speed up common classes of queries, as described in the following section.

Primary XML Index

This indexes all tags, values, and paths within the XML instances in an XML column. The base table, the table in which the XML column occurs, must have a clustered index on the primary key of the table. The primary key is used to correlate index rows with the rows in the base table. Full XML instances are retrieved from the XML columns, for example, SELECT *. Queries use the primary XML index, and return scalar values or XML subtrees by using the index itself.

Example: Creating a Primary XML Index

Table T (pk INT PRIMARY KEY, xCol XML) with an untyped XML column is used in most of the examples. These can be extended to typed XML in a straightforward way. For more information about how to use typed XML, see xml Data Type.) For simplicity, queries are described for XML data instances as shown in the following:

<book genre="security" publicationdate="2002" ISBN="0-7356-1588-2">
   <title>Writing Secure Code</title>
   <author>
      <first-name>Michael</first-name>
      <last-name>Howard</last-name>
   </author>
   <author>
      <first-name>David</first-name>
      <last-name>LeBlanc</last-name>
   </author>
   <price>39.99</price>
</book>

The following statement creates an XML index, called idx_xCol, on the XML column xCol of table T:

CREATE PRIMARY XML INDEX idx_xCol on T (xCol)

Secondary XML Indexes

After the primary XML index has been created, you may want to create secondary XML indexes to speed up different classes of queries within your workload. Three types of secondary XML indexes, PATH, PROPERTY and VALUE, benefit path-based queries, custom property management scenarios, and value-based queries, respectively. The PATH index builds a B+-tree on the (path, value) pair of each XML node in document order over all XML instances in the column. The PROPERTY index creates a B+-tree clustered on the (PK, path, value) pair within each XML instance, where PK is the primary key of the base table. Finally, the VALUE index creates a B+-tree on the (value, path) pair of each node in document order across all XML instances in the XML column.

Following are some guidelines for creating one or more of these indexes:

If your workload uses path expressions significantly on XML columns, the PATH secondary XML index is likely to speed up your workload. The most common case is the use of the exist() method on XML columns in the WHERE clause of Transact-SQL.
If your workload retrieves multiple values from individual XML instances by using path expressions, clustering paths within each XML instance in the PROPERTY index may be helpful. This scenario typically occurs in a property bag scenario when properties of an object are fetched and its primary key value is known.
If your workload involves querying for values within XML instances without knowing the element or attribute names that contain those values, you may want to create the VALUE index. This typically occurs with descendant axes lookups, such as //author[last-name="Howard"], where <author> elements can occur at any level of the hierarchy. It also occurs in wildcard queries, such as /book [@* = "novel"], where the query looks for <book> elements that have some attribute having the value "novel".

Example: Path-based Lookup

For illustration, assume that the following query is common in your workload:

SELECT pk, xCol
FROM   T
WHERE  xCol.exist ('/book/@genre[.="novel"]') = 1

The path expression /book/@genre and the value "novel" corresponds to the key fields of the PATH index. As a result, secondary XML index of type PATH is helpful for this workload:

CREATE XML INDEX idx_xCol_Path on T (xCol)
   USING XML INDEX idx_xCol FOR PATH

Example: Fetching the Properties of an Object

Consider the following query that retrieves the properties genre, title, and ISBN of a book from each row in table T:

SELECT xCol.value ('(/book/@genre)[1]', 'varchar(50)'),
    xCol.value ('(/book/title/text())[1]', 'varchar(50)'),
    xCol.value ('(/book/@ISBN)[1]', 'varchar(50)')
FROM    T

The property index is useful in this case and is created as follows:

CREATE XML INDEX idx_xCol_Property on T (xCol)
   USING XML INDEX idx_xCol FOR PROPERTY

Example: Value-based Query

In the following query, the descendant-or-self axis (//) specifies a partial path so that the lookup based on the value of ISBN benefits from the use of the VALUE index:

SELECT xCol
FROM    T
WHERE    xCol.exist ('//book/@ISBN[. = "0-7356-1588-2"]') = 1

The VALUE index is created as follows:

CREATE XML INDEX idx_xCol_Value on T (xCol)
   USING XML INDEX idx_xCol FOR VALUE

Full-Text Index on an XML Column

You can create a full-text index on XML columns that indexes the content of the XML values, but ignores the XML markup. Attribute values are not full-text indexed, because they are considered part of the markup, and element tags are used as token boundaries. When possible, you can combine full-text search with XML index in the following way:

First, filter the XML values of interest by using SQL full-text search.
Next, query those XML values that use XML index on the XML column.

Example: Combining Full-text Search with XML Querying

After the full-text index has been created on the XML column, the following query checks that an XML value contains the word "custom" in the title of a book:

SELECT * 
FROM   T 
WHERE  CONTAINS(xCol,'custom') 
AND    xCol.exist('/book/title/text()[contains(.,"custom")]') =1

The contains() method uses the full-text index to subset the XML values that contain the word "custom" anywhere in the document. The exist() clause ensures that the word "custom" occurs in the title of a book.

A full-text search that uses contains() and XQuery contains() has different semantics. The latter is a substring match and the former is a token match that uses stemming. Therefore, if the search is for the string that has "run" in the title, the matches will include "run", "runs", and "running", because both the full-text contains() and the Xquery contains() are satisfied. However, the query does not match the word "customizable" in the title in that the full-text contains() fails, but the Xquery contains() is satisfied. Generally, for pure substring match, the full-text contains() clause should be removed.

Additionally, full-text search uses word stemming, but XQuery contains() is a literal match. This difference is illustrated in the next example.

Example: Full-text Search on XML Values Using Stemming

The XQuery contains() check that was performed in the previous example generally cannot be eliminated. Consider this query:

SELECT * 
FROM   T 
WHERE  CONTAINS(xCol,'run')

The word "ran" in the document matches the search condition because of stemming. Additionally, the search context is not checked by using XQuery.

When XML is decomposed into relational columns by using AXSD that are full-text indexed, XPath queries that occur over the XML view do not perform full-text search on the underlying tables.

Property Promotion

If queries are made principally on a small number of element and attribute values, you may want to promote those quantities into relational columns. This is helpful when queries are issued on a small part of the XML data while the whole XML instance is retrieved. Creating an XML index on the XML column is not required. Instead, the promoted column can be indexed. Queries must be written to use the promoted column. That is, the query optimizer does not target again the queries on the XML column to the promoted column.

The promoted column can be a computed column in the same table or it can be a separate, user-maintained column in a table. This is sufficient when singleton values are promoted from each XML instance. However, for multi-valued properties, you have to create a separate table for the property, as described in the following section.

Computed Column Based on the xml Data Type

A computed column can be created by using a user-defined function that invokes xml data type methods. The type of the computed column can be any SQL type, including XML. This is illustrated in the following example.

Example: Computed Column Based on the xml Data Type Method

Create the user-defined function for a book ISBN number:

CREATE FUNCTION udf_get_book_ISBN (@xData xml)
RETURNS varchar(20)
BEGIN
   DECLARE @ISBN   varchar(20)
   SELECT @ISBN = @xData.value('/book[1]/@ISBN', 'varchar(20)')
   RETURN @ISBN 
END

Add a computed column to the table for the ISBN:

ALTER TABLE      T
ADD   ISBN AS dbo.udf_get_book_ISBN(xCol)

The computed column can be indexed in the usual way.

Example: Queries on a Computed Column Based on xml Data Type Methods

To obtain the <book> whose ISBN is 0-7356-1588-2:

SELECT xCol
FROM   T
WHERE  xCol.exist('/book/@ISBN[. = "0-7356-1588-2"]') = 1

The query on the XML column can be rewritten to use the computed column as follows:

SELECT xCol
FROM   T
WHERE  ISBN = '0-7356-1588-2'

You can create a user-defined function to return the xml data type and a computed column by using the user-defined function. However, you cannot create an XML index on the computed, XML column.

Creating Property Tables

You may want to promote some of the multivalued properties from your XML data into one or more tables, create indexes on those tables, and target again your queries to use them. A typical scenario is one in which a small number of properties covers most of your query workload. You can do the following:

Create one or more tables to hold the multivalued properties. You may find it convenient to store one property per table and duplicate the primary key of the base table in the property tables for back joining with the base table.
If you want to maintain the relative order of the properties, you have to introduce a separate column for the relative order.
Create triggers on the XML column to maintain the property tables. Within the triggers, do one of the following:
- Use xml data type methods, such as nodes() and value(), to insert and delete rows of the property tables.
- Create streaming table-valued functions in the common language runtime (CLR) to insert and delete rows of the property tables.
- Write queries for SQL access to the property tables and for XML access to the XML column in the base table, with joins between the tables by using their primary key.

Example: Create a Property Table

For illustration, assume that you want to promote the first name of the authors. Books have one or more authors, so that first name is a multivalued property. Each first name is stored in a separate row of a property table. The primary key of the base table is duplicated in the property table for back join.

create table tblPropAuthor (propPK int, propAuthor varchar(max))

Example: Create a User-defined Function to Generate a Rowset from an XML Instance

The following table-valued function, udf_XML2Table, accepts a primary key value and an XML instance. It retrieves the first name of all authors of the <book> elements and returns a rowset of primary key, first name pairs.

create function udf_XML2Table (@pk int, @xCol xml)
returns @ret_Table table (propPK int, propAuthor varchar(max))
with schemabinding
as
begin
      insert into @ret_Table 
      select @pk, nref.value('.', 'varchar(max)')
      from   @xCol.nodes('/book/author/first-name') R(nref)
      return
end

Example: Create Triggers to Populate a Property Table

The insert trigger inserts rows into the property table:

create trigger trg_docs_INS on T for insert
as
      declare @wantedXML xml
      declare @FK int
      select @wantedXML = xCol from inserted
      select @FK = PK from inserted

   insert into tblPropAuthor
   select * from dbo.udf_XML2Table(@FK, @wantedXML)

The delete trigger deletes the rows from the property table based on the primary key value of the deleted rows:

create trigger trg_docs_DEL on T for delete
as
   declare @FK int
   select @FK = PK from deleted
   delete tblPropAuthor where propPK = @FK

The update trigger deletes the existing rows in the property table corresponding to the updated XML instance and inserts new rows into the property table:

create trigger trg_docs_UPD
on T
for update
as
if update(xCol) or update(pk)
begin
      declare @FK int
      declare @wantedXML xml
      select @FK = PK from deleted
      delete tblPropAuthor where propPK = @FK

   select @wantedXML = xCol from inserted
   select @FK = pk from inserted

   insert into tblPropAuthor 
      select * from dbo.udf_XML2Table(@FK, @wantedXML)
end

Example: Find XML Instances Whose Authors Have the First Name of "David"

The query can be formed on the XML column. Alternatively, it can search the property table for first name "David" and perform a back join with the base table to return the XML instance. For example:

SELECT xCol 
FROM     T JOIN tblPropAuthor ON T.pk = tblPropAuthor.propPK
WHERE    tblPropAuthor.propAuthor = 'David'

Example: Solution Using the CLR Streaming Table-valued Function

This solution is made up of the following steps:

Define a CLR class, SqlReaderBase, that implements ISqlReader and generates a streaming, table-valued output by applying a path expression on an XML instance.
Create an assembly and a Transact-SQL user-defined function to start the CLR class.
Define the insert, update, and delete triggers by using the user-defined function to maintain a property tables.

To do this, you first create the streaming CLR function. The xml data type is exposed as a managed class SqlXml in ADO.NET and supports the CreateReader() method that returns an XmlReader.

Note

The example code in this section uses XPathDocument and XPathNavigator. These force you to load all the XML documents into memory. If you are using similar code in your application to process several large XML documents, this code is not scalable. Instead, keep memory allocations small and use streaming interfaces whenever possible. For more information about performance, see Architecture of CLR Integration.

public class c_streaming_xml_tvf {
   public static ISqlReader streaming_xml_tvf 
(SqlXml xmlDoc, string pathExpression) {
      return (new TestSqlReaderBase (xmlDoc, pathExpression));
   }
}

// Class that implements ISqlReader
public class TestSqlReaderBase : ISqlReader {
XPathNodeIterator m_iterator;         
   public SqlChars FirstName;
// Metadata for current resultset
private SqlMetaData[] m_rgSqlMetaData;      

   public TestSqlReaderBase (SqlXml xmlDoc, string pathExpression) {   
      // Variables for XPath navigation
      XPathDocument xDoc;
      XPathNavigator xNav;
      XPathExpression xPath;
   
      // Set sql metadata
      m_rgSqlMetaData = new SqlMetaData[1];
      m_rgSqlMetaData[0] = new SqlMetaData ("FirstName",  
SqlDbType.NVarChar,50);   
   
      //Set up the Navigator
      if (!xmlDoc.IsNull)
          xDoc = new XPathDocument (xmlDoc.CreateReader());
      else
          xDoc = new XPathDocument ();
      xNav = xDoc.CreateNavigator();
      xPath = xNav.Compile (pathExpression);
      m_iterator = xNav.Select(xPath);
   }
   public bool Read() {
      bool moreRows = true;
      if (moreRows = m_iterator.MoveNext())
         FirstName = new SqlChars (m_iterator.Current.Value);
      return moreRows;
   }
}

Next, create an assembly and a Transact-SQL user-defined function, SQL_streaming_xml_tvf (not shown), that corresponds to the CLR function, streaming_xml_tvf. The user-defined function is used to define the table-valued function, CLR_udf_XML2Table, for rowset generation:

create function CLR_udf_XML2Table (@pk int, @xCol xml)
returns @ret_Table table (FK int, FirstName varchar(max))
with schemabinding
as
begin
      insert into @ret_Table 
   select @pk, FirstName 
   FROM   SQL_streaming_xml_tvf (@xCol, '/book/author/first-name')
      return
end

Finally, define triggers as shown in the example, "Create triggers to populate a property table", but replace udf_XML2Table with the CLR_udf_XML2Table function. The insert trigger is shown in the following example:

create trigger CLR_trg_docs_INS on T for insert
as
   declare @wantedXML xml
   declare @FK int
   select @wantedXML = xCol from inserted
   select @FK = PK from inserted

   insert into tblPropAuthor
      select *
   from    dbo.CLR_udf_XML2Table(@FK, @wantedXML)

The delete trigger is identical to the non-CLR version. However, the update trigger just replaces the function udf_XML2Table() with CLR_udf_XML2Table().

XML Schema Collections

An XML schema collection is a metadata entity that is scoped by a relational schema. It contains one or more XML schemas that may be related, such as through <xs:import>), or that may be unrelated. Individual XML schemas within an XML schema collection are identified by using their target namespace.

An XML schema collection is created by using CREATE XML SCHEMA COLLECTION (Transact-SQL) syntax and providing one or more XML schemas. More XML schema components can be added to an existing XML schema, and more schemas can be added to an XML schema collection by using ALTER XML SCHEMA COLLECTION (Transact-SQL) syntax. XML schema collections can be secured like any SQL object by using the security model in SQL Server 2005.

Multi-Typed Column

An XML schema collection C types an XML column, xCol, according to multiple XML schemas. Additionally, the DOCUMENT and CONTENT flags specify whether XML trees or fragments, respectively, can be stored in column xCol.

For DOCUMENT, each XML instance specifies the target namespace of its top-level element in the instance, and which is typed and validated according to it. For CONTENT, on the other hand, each top-level element can specify any one of the target namespaces in C. The XML instance is validated and typed according to all the target namespaces occurring in an instance.

Schema Evolution

XML schema collection is used to type XML columns, variables, and parameters. It provides a mechanism for XML schema evolution. For illustration, assume that you add an XML schema with target namespace BOOK-V1 to an XML schema collection C. An XML column, xCol typed by using C, can store XML data that conforms to the BOOK-V1 schema.

Next assume that an application wants to extend the XML schema with new schema components, such as complex type definitions and top-level element declarations. These new schema components can be added to the BOOK-V1 schema and do not require revalidation of the existing XML data in column xCol.

Assume that the application later wants to provide a new version of the XML schema and it selects the target namespace BOOK-V2. This XML schema can be added to C. The XML column can store instances of both BOOK-V1 and BOOK-V2, and execute queries and data modification on XML instances that conform to these namespaces.

Loading XML Data

Transferring XML Data from SQL Server 2000 to SQL Server 2005

You can transfer XML data to SQL Server 2005 in several ways. For example:

If you have your data in an [n]text or image column in a SQL Server 2000 database, you can import the table into a SQL Server 2005 database by using SQL Server 2005 Integration Services (SSIS). Change the column type to XML by using the ALTER TABLE statement.
You can bulk copy your data from SQL Server 2000 by using bcp out, and then bulk insert the data into the SQL Server 2005 database by using bcp in.
If you have data in relational columns in a SQL Server 2000 database, create a new table with an [n]text column and, optionally, a primary key column for a row identifier. Use client-side programming to retrieve the XML that is generated at the server with FOR XML and write it into the [n]text column. Then, use the previously mentioned techniques to transfer data to a SQL Server 2005 database. You can choose to write the XML into an XML column in the SQL Server 2005 database directly.

Example: Changing Column Type to XML

Assume that you want to change the type of an [n]text or image column, XYZ in table R, to untyped XML. The following statement performs this type change:

ALTER TABLE R ALTER COLUMN XYZ XML

The target can be typed XML, if it is required, by specifying an XML schema collection.

Bulk loading XML data

You can bulk load XML data into the server by using the bulk loading capabilities of SQL Server, such as bcp. OPENROWSET allows you to load data into an XML column from files. The following example illustrates this point.

Example: Loading XML from Files

This example shows how to insert a row in table T. The value of the XML column is loaded from file C:\MyFile\xmlfile.xml as CLOB, and the integer column is supplied the value 10.

INSERT INTO T
SELECT 10, xCol
FROM    (SELECT *    
    FROM OPENROWSET (BULK 'C:\MyFile\xmlfile.xml', SINGLE_CLOB) 
 AS xCol) AS R(xCol)

Text Encoding

SQL Server 2005 stores XML data in Unicode (UTF-16). XML data retrieved from the server comes out in UTF-16 encoding. If you want a different encoding, you have to perform the required conversion on the retrieved data. Sometimes, the XML data may be in a different encoding. If it is, you have to use care during data loading. For example:

If your text XML is in Unicode (UCS-2, UTF-16), you can assign it to an XML column, variable, or parameter without any problems.
If the encoding is not Unicode and is implicit, because of the source code page, the string code page in the database should be the same as or compatible with the code points that you want to load. If required, use COLLATE. If no such server code page exists, you have to add an explicit XML declaration with the correct encoding.
To use an explicit encoding, use either the varbinary() type, which has no interaction with code pages, or use a string type of the appropriate code page. Then, assign the data to an XML column, variable, or parameter.

Example: Explicitly Specifying an Encoding

Assume that you have an XML document, vcdoc, stored as varchar(max) that does not have an explicit XML declaration. The following statement adds an XML declaration with the encoding "iso8859-1", concatenates the XML document, casts the result to varbinary(max) so that the byte representation is preserved, and then finally casts it to XML. This enables the XML processor to parse the data according to the specified encoding "iso8859-1" and generate the corresponding UTF-16 representation for string values.

SELECT CAST( 
CAST (('<?xml version="1.0" encoding="iso8859-1"?>'+ vcdoc) AS VARBINARY (MAX)) 
 AS XML)

XQuery and Type Inference

XQuery embedded in Transact-SQL is the language that is supported for querying xml data type. The language is in development by the World Wide Web Consortium (W3C), with Microsoft all major database vendors participating. It includes XPath version 2.0 as the navigation language. Language constructs for data modification are also available on the xml data type. For more information about the XQuery constructs, functions, and operators supported in SQL Server, see XQuery Functions Against the xml Data Type.

Error Model

Compilation errors are returned from syntactically incorrect Xquery expressions and XML DML statements. The compilation phase checks static type correctness of XQuery expressions and DML statements, and uses XML schemas for type inferences for typed XML. It raises static type errors if an expression could fail at run time because of a type safety violation. Examples of static error are the addition of a string to an integer and querying for a nonexistent node for typed data.

As a deviation from the W3C standard, XQuery run-time errors are converted into empty sequences. These sequences may propagate as empty XML or NULL to the query result, depending upon the invocation context.

Explicit casting to the correct type allows users to work around static errors, although run-time cast errors will be transformed to empty sequences.

The following sections describe type checking in more detail.

Singleton Checks

Location steps, function parameters, and operators that require singletons will return an error if the compiler cannot determine whether a singleton is guaranteed at run time. This problem occurs frequently with untyped data. For example, the lookup of an attribute requires a singleton parent element. An ordinal that selects a single parent node is sufficient. The evaluation of a node()-value() combination to extract attribute values may not require the ordinal specification. This is shown in the next example.

Example: Known Singleton

In this example, the nodes() method generates a separate row for each <book> element. The value() method that is evaluated on a <book> node extracts the value of @genre and, being an attribute, is a singleton.

SELECT nref.value('@genre', 'varchar(max)') LastName
FROM   T CROSS APPLY xCol.nodes('//book') AS R(nref)

XML schema is used for type checking of typed XML. If a node is specified as a singleton in the XML schema, the compiler uses that information and no error occurs. Otherwise, an ordinal that selects a single node is required. In particular, the use of descendant-or-self axis (//) axis, such as in /book//title, looses singleton cardinality inference for the <title> element, even if the XML schema specifies it to be so. Therefore, you should rewrite it as (/book//title)[1].

It is important to remain aware of the difference between //first-name[1] and (//first-name)[1] for type checking. The former returns a sequence of <first-name> nodes in which each node is the leftmost <first-name> node among its siblings. The latter returns the first singleton <first-name> node in document order in the XML instance.

Example: Using value()

The following query on an untyped XML column results in a static, compilation error.This is because value() expects a singleton node as the first argument and the compiler cannot determine whether only one <last-name> node will occur at run time:

SELECT xCol.value('//author/last-name', 'nvarchar(50)') LastName
FROM   T

Following is a solution that you could consider:

SELECT xCol.value('//author/last-name[1]', 'nvarchar(50)') LastName
FROM   T

However, this solution does not solve the error, because multiple <author> nodes may occur in each XML instance. The following rewrite works:

SELECT xCol.value('(//author/last-name/text())[1]', 'nvarchar(50)') LastName
FROM   T

This query returns the value of the first <last-name> element in each XML instance.

Parent Axis

If the type of a node cannot be determined, it becomes anyType. This is not implicitly cast to any other type. This occurs most notably during navigation by using the parent axis, for example, xCol.query('/book/@genre/../price'). The parent node type is determined to be anyType. An element may also be defined as anyType in an XML schema. In both cases, the loss of more precise type information frequently leads to static type errors and requires explicit casting of atomic values to their specific type.

Data(),text() and string() Accessors

XQuery has a function fn:data() to extract scalar, typed values from nodes, a node test text() to return text nodes, and the function fn:string() that returns the string value of a node. Their use can be confusing. Following are the guidelines for using them correctly in SQL Server 2005. The XML instance <age>12</age> is used for the purpose of illustration.

Untyped XML: The path expression /age/text() returns the text node "12". The function fn:data(/age) returns the string value "12" and so does fn:string(/age).
Typed XML: The expression /age/text() returns a static error for any simple typed <age> element. On the other hand, fn:data(/age) returns integer 12. The fn:string(/age) yields the string "12".

Functions and Operators Over Union Types

Union types require careful handling because of type checking. Two of the problems are illustrated in the following examples.

Example: Function over Union Type

Consider an element definition for <r> of a union type:

<xs:element name="r">
<xs:simpleType>
   <xs:union memberTypes="xs:int xs:float xs:double"/>
</xs:simpleType>
</xs:element>

Within XQuery context, the "average" function fn:avg (//r) returns a static error, because the XQuery compiler cannot add values of different types (xs:int, xs:float or xs:double) for the <r> elements in the argument of fn:avg(). To solve this, rewrite the function invocation as fn:avg(for $r in //r return $r cast as xs:double ?).

Example: Operator over Union Type

The addition operation ('+') requires precise types of the operands. As a result, the expression (//r)[1] + 1 returns a static error that has the previously described type definition for element <r>. One solution is to rewrite it as (//r)[1] cast as xs:int? +1, where the "?" indicates 0 or 1 occurrences. SQL Server 2005 requires "cast as" with "?", because any cast can cause the empty sequence as a result of run-time errors.

Value(), Nodes(), and OpenXML()

You can use multiple value() methods on xml data type in a SELECT clause to generate a rowset of extracted values. The nodes() method yields an internal reference for each selected node that can be used for additional query. The combination of the nodes() and value() methods can be more efficient in generating the rowset when it has several columns and, perhaps, when the path expressions used in its generation are complex.

The nodes() method yields instances of a special xml data type, each of which has its context set to a different selected node. This kind of XML instance supports query(), value(), nodes(), and exist() methods and can be used in count(*) aggregations. All other uses cause an error.

Example: Using nodes()

Assume that you want to extract the first and last names of authors, and the first name is not "David". Additionally, you want to extract this information as a rowset that contains two columns, FirstName and LastName. By using nodes() and value() methods, you can accomplish this as shown in the following:

SELECT nref.value('(first-name/text())[1]', 'nvarchar(50)') FirstName,
       nref.value('(last-name/text())[1]', 'nvarchar(50)') LastName
FROM   T CROSS APPLY xCol.nodes('//author') AS R(nref)
WHERE  nref.exist('first-name[. != "David"]') = 1

In this example, nodes('//author') yields a rowset of references to <author> elements for each XML instance. The first and last names of authors are obtained by evaluating value() methods relative to those references.

SQL Server 2000 provides the capability for generating a rowset from an XML instance by using OpenXml(). You can specify the relational schema for the rowset and how values inside the XML instance map to columns in the rowset.

Example: Using OpenXml() on the xml Data Type

The query can be rewritten from the previous example by using OpenXml() as shown in the following. This is done by creating a cursor that reads each XML instance into an XML variable and then applies OpenXML to it:

DECLARE name_cursor CURSOR
FOR
   SELECT xCol 
   FROM   T
OPEN name_cursor
DECLARE @xmlVal XML
DECLARE @idoc int
FETCH NEXT FROM name_cursor INTO @xmlVal

WHILE (@@FETCH_STATUS = 0)
BEGIN
   EXEC sp_xml_preparedocument @idoc OUTPUT, @xmlVal
   SELECT   *
   FROM   OPENXML (@idoc, '//author')
          WITH (FirstName  varchar(50) 'first-name',
                LastName   varchar(50) 'last-name') R
   WHERE  R.FirstName != 'David'

   EXEC sp_xml_removedocument @idoc
   FETCH NEXT FROM name_cursor INTO @xmlVal
END
CLOSE name_cursor
DEALLOCATE name_cursor

OpenXml() creates an in-memory representation and uses work tables instead of the query processor. It relies on the XPath version 1.0 processor of MSXML version 3.0 instead of the XQuery engine. The work tables are not shared among multiple calls to OpenXml(), even on the same XML instance. This limits its scalability. OpenXml() allows you to access an edge table format for the XML data when the WITH clause is not specified. Also, it allows you to use the remaining XML value in a separate, "overflow" column.

The combination of nodes() and value() functions uses XML indexes effectively. As a result, this combination can exhibit more scalability than OpenXml.

Using FOR XML to Generate XML from Rowsets

You can generate an xml data type instance from a rowset by using FOR XML with the new TYPE directive.

The result can be assigned to an xml data type column, variable, or parameter. Also, FOR XML can be nested to generate any hierarchical structure. This makes nested FOR XML much more convenient to write than FOR XML EXPLICIT, but it may not perform as well for deep hierarchies. FOR XML also introduces a new PATH mode. This new mode specifies the path in the XML tree where a column's value appears.

The new FOR XML TYPE directive can be used to define read-only XML views over relational data with SQL syntax. The view can be queried with SQL statements and embedded XQuery, as shown in the following example. You can also refer to these SQL views in stored procedures.

Example: SQL View Returning Generated xml Data Type

The following SQL view definition creates an XML view over a relational column, pk, and book authors retrieved from an XML column:

CREATE VIEW V (xmlVal) AS
SELECT pk, xCol.query('/book/author')
FROM   T
FOR XML AUTO, TYPE

The V view contains a single row with a single columnxmlVal of XML type. It can be queried like a regular xml data type instance. For example, the following query returns the author whose first name is "David":

SELECT xmlVal.query('//author[first-name = "David"]')
FROM   V

SQL view definitions are somewhat similar to XML views that are created by using annotated schemas. However, there are important differences. The SQL view definition is read-only and must be manipulated with embedded XQuery. The XML views are created by using annotated schema. Additionally, the SQL view materializes the XML result before applying the XQuery expression, while the XPath queries on XML views evaluate SQL queries on the underlying tables.

Adding Business Logic

Your business logic can be added to XML data in several ways:

You can write row or column constraints to enforce domain-specific constraints during insertion and modification of XML data.
You can write a trigger on the XML column that fires when you insert or update values in the column. The trigger can contain domain-specific validation rules or populate property tables.
You can write SQLCLR functions in managed code to which you pass XML values, and use XML processing capabilities provided by the System.Xml namespace. An example is to apply XSL transformation to XML data. Alternatively, you can deserialize the XML into one or more managed classes and operate on them by using managed code.
You can write Transact-SQL stored procedures and functions that start the processing on the XML column for your business needs.

Example: Applying XSL Transformation

Consider a CLR function TransformXml() that accepts an xml data type instance and an XSL transformation stored in a file, applies the transformation to the XML data, and then returns the transformed XML in the result. Following is a skeleton function that is written in C#:

public static SqlXml TransformXml (SqlXml XmlData, string xslPath) {
   // Load XSL transformation
   XslCompiledTransform xform = new XslCompiledTransform();
   XPathDocument xslDoc = new XPathDocument (xslPath);
   xform.Load(xslDoc);

   // Load XML data 
   XPathDocument xDoc = new XPathDocument (XmlData.CreateReader());

   // Return the transformed value
   MemoryStream xsltResult = new MemoryStream();
   xform.Transform(xDoc, null, xsltResult);
   SqlXml retSqlXml = new SqlXml(xsltResult);
   return (retSqlXml);
}

After the assembly is registered and a user-defined Transact-SQL function is created, SqlXslTransform() corresponding to TransformXml(), the function can be invoked from Transact-SQL as shown in the following query:

SELECT SqlXslTransform (xCol, 'C:\MyFile\xsltransform.xsl')
FROM    T
WHERE  xCol.exist('/book/title/text()[contains(.,"custom")]') =1

The query result contains a rowset of the transformed XML.

SQLCLR expands the possibilities for decomposing XML data into tables or property promotion, and querying XML data by using managed classes in the System.Xml namespace. For more information, see SQL Server Books Online and the .NET Framework SDK documentation.

Cross-Domain Queries

When your data resides in a combination of relational and xml data type columns, you may want to write queries that combine relational and XML data processing. For example, you can convert the data in relational and XML columns into an xml data type instance by using FOR XML and query it by using XQuery. Conversely, you can generate a rowset from XML values and query it by using Transact-SQL.

A more convenient and efficient way of writing cross-domain queries is to use the value of a SQL variable or column within XQuery or XML DML expressions:

You can use sql:variable() to use the value of a SQL variable in your XQuery or XML DML expression.
You can use sql:column() to use the values from a relation column in your XQuery or XML DML expression.

These two approaches enable applications to parameterize queries, as shown in the next example. However, XML and user-defined types are not permitted in sql:variable() and sql:column().

Example: Cross-domain Query Using sql:variable()

The following query is a modified version of the one shown in "Example: Queries on a Computed Column Based on xml Data Type Methods". In the following version, this particular ISBN is passed in by using a SQL variable @isbn. By replacing the constant with sql:variable(), the query can be used to search for any ISBN and not just the one whose ISBN is 0-7356-1588-2.

DECLARE @isbn varchar(20)
SET     @isbn = '0-7356-1588-2'
SELECT  xCol
FROM    T
WHERE   xCol.exist ('/book/@ISBN[. = sql:variable("@isbn")]') = 1

sql:column() can be used in a similar manner and provides additional benefits. Indexes over the column may be used for efficiency, as decided by the cost-based query optimizer. Also, the computed column may store a promoted property.

Catalog Views for Native XML Support

Catalog views exist to provide metadata information about XML use. Some of these are discussed in the following section.

XML Indexes

XML index entries appear in the catalog view, sys.indexes, with the index "type" 3. The name column contains the name of the XML index.

XML indexes are also recorded in the catalog view, sys.xml_indexes. This contains all the columns of sys.indexes and some specific ones that are useful for XML indexes. The value NULL in the column, secondary_type, indicates a primary XML index; the values 'P', 'R' and 'V' stand for PATH, PROPERTY, and VALUE secondary XML indexes, respectively.

The space use of XML indexes can be found in the table-valued function sys.dm_db_index_physical_stats. It provides information, such as the number of disk pages occupied, average row size in bytes, and number of records, for all index types.. This also includes XML indexes. This information is available for each database partition. XML indexes use the same partitioning scheme and partitioning function of the base table.

Retrieving XML Schema Collections

XML schema collections are enumerated in the catalog view, sys.xml_schema_collections. The XML schema collection "sys" is defined by the system. It contains the predefined namespaces that can be used in all user-defined XML schema collections without having to load them explicitly. This list contains the namespaces for xml, xs, xsi, fn, and xdt. Two other catalog views are sys.xml_schema_namespaces, which enumerates all namespaces within each XML schema collection, and sys.xml_components, which enumerates all XML schema components within each XML schema.

The built-in function XML_SCHEMA_NAMESPACE, schemaName, XmlSchemacollectionName, namespace-uri, yields an xml data type instance.. This instance contains XML schema fragments for schemas that are contained in an XML schema collection, except the predefined XML schemas.

You can enumerate the contents of an XML schema collection in the following ways:

Write Transact-SQL queries on the appropriate catalog views for XML schema collections.
Use the built-in function XML_SCHEMA_NAMESPACE(). You can apply xml data type methods on the output of this function. However, you cannot modify the underlying XML schemas.

These are illustrated in the following examples.

Example: Enumerate the XML Namespaces in an XML Schema Collection

Use the following query for the XML schema collection "myCollection":

SELECT XSN.name
FROM    sys.xml_schema_collections XSC JOIN sys.xml_schema_namespaces XSN
    ON (XSC.xml_collection_id = XSN.xml_collection_id)
WHERE    XSC.name = 'myCollection'

Example: Enumerate the Contents of an XML Schema Collection

The following statement enumerates the contents of the XML schema collection "myCollection" within the relational schema, dbo.

SELECT XML_SCHEMA_NAMESPACE (N'dbo', N'myCollection')

Individual XML schemas within the collection can be obtained as xml data type instances by specifying the target namespace as the third argument to XML_SCHEMA_NAMESPACE(). This is shown in the following example.

Example: Output a Specified Schema from an XML Schema Collection

The following statement outputs the XML schema with the target namespace "https://www.microsoft.com/books" from the XML schema collection "myCollection" within the relational schema, dbo.

SELECT XML_SCHEMA_NAMESPACE (N'dbo', N'myCollection', 
N'https://www.microsoft.com/books')

Querying XML Schemas

You can query XML schemas that you have loaded into XML schema collections in the following ways:

Write Transact-SQL queries on catalog views for XML schema namespaces.
Create a table that contains an xml data type column to store your XML schemas and also load them into the XML type system. You can query the XML column by using the xml data type methods. Also, you can build an XML index on this column. However, with this approach, the application must maintain consistency between the XML schemas stored in the XML column and the XML type system. For example, if you drop the XML schema namespace from the XML type system, you also have to drop it from the table in order to preserve consistency.