Code Generation in the .NET Framework Using XML Schema

Article
09/04/2007

Daniel Cazzulino

May 2004

Applies to:
Microsoft® Visual Studio® .NET

Requirements This download requires Microsoft .NET Framework 1.1 to be installed.

Summary: Learn the difference between typed datasets and classes generated by the xsd.exe tool, as well as how to extend this code generation process by reusing the infrastructure classes supporting it, while remaining compatible with the XmlSerializer. (23 printed pages)

Download the code samples.

Introduction
The Fundamental Classes
Extending XSD Processing
Inside XmlSerializer
Customizing with CodeDom
Mapping Tips
Conclusion

Introduction

Automatic code generation, be it data access layers, business entity classes, or even user interfaces, can greatly increase developer productivity. This process of generation can be based on a number of inputs, such as a database, an arbitrary XML file, a UML diagram, etc. Microsoft® Visual Studio® .NET comes with built-in support for code generation from W3C XML Schema files (XSD) in two forms: typed datasets, and custom classes for use with XmlSerializer.

XSD files describe the content that is allowed in an XML document in order to be considered valid according to the document. The need to handle the data, which will ultimately be serialized as XML for consumption, in a type-safe manner led to various approaches to converting the XSD into classes. Recall that XSD was NOT created as a means to describe objects and their relationships. A better format already exists for that, UML, and it's widely used to model applications and to generate code from it. There are, therefore, some (expected) mismatches between .NET and its object-oriented programming (OOP) concepts, and those of XSD. Keep this in mind when you walk the path of XSD -> classes mapping.

That said, the CLR type system can be seen as a subset of XSD: it supports features that can't be mapped to regular OO concepts. If you're using XSD only to model your classes as opposed to modeling your documents, therefore, you will probably find almost no conflicts.

During the rest of this article, we will discuss the approach of typed datasets, how the custom classes generated by the xsd.exe tool make for a better solution, and how to extend and customize the output of XSD -> Classes generation.

A basic understanding of CodeDom is required in order to get the most out this article.

What's Wrong with Typed Datasets?

Typed datasets are increasingly being used to represent business entities; that is, to act as a transport for entity data between layers of an application, or even as the output of WebServices. Typed datasets, as opposed to the "normal" Dataset, are appealing because you also gain typed access to tables, rows, columns, etc. They don't come without costs and/or limitations, however:

Overhead of implementation: datasets include a host of features that may not be required for your entities, such as change tracking, SQL-like queries, data views, a lot of events, etc.
Performance: serialization to/from XML is not the fastest around. XmlSerializer outperforms it easily.
Interoperability: it may be problematic with non-.NET clients of WebServices returning typed datasets.
XML structures: many hierarchical (and perfectly valid) documents and their schemas can't be flattened to a table model.

Get more information about typed datasets.

Unless the additional features of datasets in general are relevant to you, therefore, using typed datasets for data passing may not be the best choice. Fortunately, there's another option of which you can take advantage.

XmlSerializer and Custom Classes

The XmlSerializer improves the story for XML data handling. By means of serialization attributes, the XmlSerializer is able to rehydrate objects from their XML representation, as well as serializing back to XML form. Additionally, it is able to do so in a highly efficient way, because it generates a dynamically compiled XmlReader-based (therefore streaming) class specially targeted at (de)serializing the concrete type. That is, it's really, really fast.

Read more about XML serialization attributes.

Of course, guessing which attributes to use in order to comply with an XSD is no fun at all. To solve this problem, the .NET SDK comes with a utility that does the hard work for you: xsd.exe. It's a command line application that is able to generate both typed datasets and custom classes from an XSD file. The custom classes are generated with the corresponding XML serialization attributes, so that, upon serialization, full fidelity to the schema is honored.

Read Don Box's intro to XSD and the CLR mapping and attributes.

So far, so good. We have an efficient and performant way to convert XML to or from objects, and we have a tool to generate the classes for us. The problem is, we sometimes want something slightly different than what's being generated. For example, xsd.exe-generated classes can't be data-bound to a Windows Forms grid because it looks for properties, instead of public fields, to show. We may want to add our own custom attributes here and there, change arrays to typed collections, and so on. Of course, we should do this while remaining compatible with the XSD upon serialization.

Customizing the XSD will obviously change the shape of generated classes. If you're just looking to turn PascalCase into the de-facto XML standard for using camelCase, I'd suggest you think twice. Upcoming products from MS suggest they are moving into PascalCase for XML representations, to make them more .NET friendly.

If you need further customizations like the ones mentioned above, what's your option? It's almost common belief that xsd.exe is not extensible and there's no way to customize it. This isn't accurate, because the .NET XML team actually made available to us the very classes used by that tool. You'll have to get your hands dirty with CodeDom in order to take advantage of them, but the level of customization is only limited by your needs!

You can read about the CodeDom in the following articles:

Generating and Compiling Source Code Dynamically in Multiple Languages

Generate .NET Code in Any Language Using CodeDOM

The Fundamental Classes

One way of generating code from XSD would be to simply iterate the Schema Object Model (SOM) and write code directly from that, in a sort of monolithic way. This is the approach taken by numerous code generators that were created to overcome xsd.exe tool limitations. This involves substantial effort and code, however, because we should consider XSD->CLR type mappings, XSD type inheritance, XML Serialization attributes, and so on. Mastering the SOM is not a trivial task, either. Instead of doing all by ourselves, wouldn't it be great if we could just add or modify the built-in code generation of the xsd.exe tool?

Like I said above, and contrary to common belief, the very classes xsd.exe uses to generate output are public and available in the System.Xml.Serialization namespace, even if the xsd.exe tool certainly doesn't allow any kind of customizations. It's true they are for the most part undocumented, but I'll show you how they can be used in this section. Don't let the following statement in the MSDN help intimidate you: "The [TheTopSecretClassName] type supports the Microsoft® .NET Framework infrastructure and is not intended to be used directly from your code". I'll use them without resorting to hacks or any reflection code.

A much better approach than the fairly usual "StringBuilder.Append" code generation is to take advantage of the classes at the System.CodeDom namespace, and that's exactly what the built-in code generation classes (simply codegen from now on) do. The CodeDom contains classes that allow us to represent almost any programming construct in a language-independent way, in a so-called AST (abstract syntax tree). At a later time, another class, the code generator, can interpret it and generate the raw code you expect, such as Microsoft® Visual C# or Microsoft® Visual Basic®.NET. This is the way most code generation happens in the .NET framework.

The codegen approach does not only take advantage of this, but the schema analysis and actual CodeDom generation are also decoupled by means of a mapping process. This mapping must be performed for each schema element for which we want to generate code. Basically, it constructs a new object that represents the results of the analysis, such as its structure, which will be the type name to generate for it, its members, their CLR type, etc.

In order to use these classes we follow a basic workflow summarized as follows:

Load the schema (one, in principle).
Derive a series of mappings for each of the top-level XSD elements.
Export those mappings to a System.CodeDom.CodeDomNamespace.

Four classes are involved in this process, all of them defined in the System.Xml.Serialization namespace:

Figure 1. Classes used to get a CodeDom tree

Getting a CodeDom tree using these classes can be achieved as follows:

namespace XsdGenerator
{
  public sealed class Processor
  {
    public static CodeNamespace Process( string xsdFile, 
       string targetNamespace )
    {
      // Load the XmlSchema and its collection.
      XmlSchema xsd;
      using ( FileStream fs = new FileStream( xsdFile, FileMode.Open ) )
      {
        xsd = XmlSchema.Read( fs, null );
        xsd.Compile( null );
      }
      XmlSchemas schemas = new XmlSchemas();
      schemas.Add( xsd );
      // Create the importer for these schemas.
      XmlSchemaImporter importer = new XmlSchemaImporter( schemas );
      // System.CodeDom namespace for the XmlCodeExporter to put classes in.
      CodeNamespace ns = new CodeNamespace( targetNamespace );
      XmlCodeExporter exporter = new XmlCodeExporter( ns );
      // Iterate schema top-level elements and export code for each.
      foreach ( XmlSchemaElement element in xsd.Elements.Values )
      {
        // Import the mapping first.
        XmlTypeMapping mapping = importer.ImportTypeMapping( 
          element.QualifiedName );
        // Export the code finally.
        exporter.ExportTypeMapping( mapping );
      }
      return ns;
    }
  }
}

The code is pretty straightforward, although you may want to add exception management code here and there. One thing to notice is that the XmlSchemaImporter imports a type by using its qualified name, which is then located in the appropriate XmlSchema. All of the global elements in the schema must therefore be passed to it, which are iterated using the XmlSchema.Elements collection. This collection, as well as the XmlSchemaElement.QualifiedName, is a member of the so-called Post Schema Compilation Infoset (PSCI, see MSDN help), which gets filled after schema compilation. It has the effect of filling and organizing the schema information after resolving references, schema types, inheritance, inclusions, etc. It's functionally similar to the DOM Post Validation Infoset (PSVI, see Dare Obasanjo's MSDN article and the XSD specification).

You may have noticed one side effect (a drawback actually) of the way the XmlSchemaImporter works: you can only retrieve (import) the mapping for globally defined elements. Any other elements defined locally anywhere in the schema will not be accessible through this mechanism. This has some consequences I will discuss later, which can limit the customizations you can apply or affect our schema design.

The XmlCodeExporter class populates, with the type definitions according to the imported mappings, the CodeDomNamespace passed in to its constructor, building what's known as a CodeDom tree. The resulting CodeDom from the method above is exactly what's generated internally by the xsd.exe tool. With this tree at hand, it's possible to either directly compile it to an assembly, or generate source code.

If I want to get rid of the xsd.exe tool, I can easily build a console application that uses this class. For that purpose, I need to generate a source code file from the CodeDom tree I received. I do this by creating a CodeDomProvider appropriate to a target language selected by the user:

static void Main( string[] args )
{
  if ( args.Length != 4 )
  {
    Console.WriteLine(
      "Usage: XsdGenerator xsdfile namespace outputfile [cs|vb]" );
    return;
  }
  // Get the namespace for the schema.
  CodeNamespace ns = Processor.Process( args[0], args[1] );
  // Create the appropriate generator for the language.
  CodeDomProvider provider;
  if ( args[3] == "cs" )
    provider = new Microsoft.CSharp.CSharpCodeProvider();
  else if ( args[3] == "vb" )
    provider = new Microsoft.VisualBasic.VBCodeProvider();
  else
    throw new ArgumentException( "Invalid language", args[3] );
  // Write the code to the output file.
  using ( StreamWriter sw = new StreamWriter( args[2], false ) )
  {
    provider.CreateGenerator().GenerateCodeFromNamespace(
      ns, sw, new CodeGeneratorOptions() );
  }
  Console.WriteLine( "Finished" );
  Console.Read();

}

I can further customize the generated code formatting and other options by using the properties of the CodeGeneratorOptions instance received by the generator. Look at the MSDN Documentation for the available options.

After compiling this console application, I can generate code that is exactly the same as the one from the xsd.exe tool. Having this functionality at hand frees me completely from relying on that tool, and I no longer need to know whether it's installed or where it's located, to start a new process for it, etc. However, running it from the command-line over and over again, each time I modify a schema, is far from ideal. Microsoft® Visual Studio®.NET allows developers to take advantage of design-time code generation through so-called custom tools. One example is the typed dataset, where (although you don't need to specify it) a custom tool processes the dataset XSD file each time you save it, automatically generating the appropriate "code behind" class.

Building custom tools is out of the scope of this article, but you can read more about turning the code I wrote so far into the one at this weblog post. The code for the tool is included in this article download, and you can use it simply by assigning the "XsdCodeGen" custom tool name to an XSD file properties. Registration is explained in the accompanying readme file.

Even if I go and get the easier-to-use custom tool, replacing the xsd.exe tool with another one that does exactly the same doesn't look quite compelling, does it? After all, the reason we started all this was to change exactly that behavior! So, let's move on to customizing it from this baseline.

Extending XSD Processing

In order to customize the processing, I need to pass information onto the tool so that it knows what to change or process. There are two primary options here:

Add (potentially many) attributes to the XSD root <xs:schema> element that would be understood by my processor to apply customizations, similar to the typed dataset approach.

More on this here.
Use built-in XSD extensibility through schema annotations to allow arbitrary customizations. It simply adds types, to a sort of code generation pipeline, to execute after the basic generation has taken place.

The first approach is initially appealing because of its simplicity. I can just add an attribute and modify the processor accordingly to check for it:

Schema:

<xs:schema elementFormDefault="qualified" 
  xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns:code="https://weblogs.asp.net/cazzu"
  code:fieldsToProperties="true">

Code:

XmlSchema xsd;
// Load the XmlSchema.
...
foreach (XmlAttribute attr in xsd.UnhandledAttributes)
{
  if (attr.NamespaceURI == "https://weblogs.asp.net/cazzu")
  {
    switch (attr.LocalName)
    {
      case "fieldsToProperties":
        if (bool.Parse(attr.Value)) ConvertFieldsToProperties(ns);
        break;
      ...
    }
  }
}

This is the approach you will generally see in other xsd->classes generators (you can find plenty of them at the Code Generation Network). Unfortunately, this approach leads to long switch statements, endless attributes, and ultimately to unmaintainable code and lack of extensibility.

The second approach is more robust, as it considers extensibility right from the beginning. XSD provides such extension facility through the <xs:annotation> element, which can be a child of almost any item in the schema. I will take advantage of it and its <xs:appinfo> child element in order to allow the developer to specify which (arbitrary) extensions to run, and in which order. Such an extended schema would look like the following:

<xs:schema elementFormDefault="qualified"
           xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:annotation>
    <xs:appinfo>
      <Code xmlns="https://weblogs.asp.net/cazzu">
        <Extension 
Type="XsdGenerator.Extensions.FieldsToPropertiesExtension, 
XsdGenerator.CustomTool" />
      </Code>
    </xs:appinfo>
  </xs:annotation>

Of course, each extension will need to implement a common interface, so that the custom tool can easily execute each one:

public interface ICodeExtension
{
  void Process( System.CodeDom.CodeNamespace code, 
                System.Xml.Schema.XmlSchema schema );
}

By providing such extensibility up-front, it's easy to move forward as new customization needs arise. Even the most basic ones can be implemented as extensions right from the beginning.

Extensible Code Generation Tool

I'll modify the Processor class to allow for this new feature, and will simply retrieve each <Extension> element from the schema. There's one caveat here, though: unlike the Post Schema Compilation Infoset properties exposed for elements, attributes, types, and so on, there's no typed property for annotations at the schema-level. That is, there's no XmlSchema.Annotations property. Iteration of the XmlSchema.Items generic pre-compilation property is therefore required to look for the annotations. What's more, after detecting an XmlSchemaAnnotation item, iteration of its own Items generic collection is required again, because there can be <xs:appinfo> as well as <xs:documentation> children, which also lack a typed property. When the content of the appinfo is finally reached, through the XmlSchemaAppInfo.Markup property, an array of XmlNode objects is all we get. You can imagine how the story goes; iterate the nodes, iterate its children again, and so on. This makes for pretty ugly code.

Luckily, an XSD file is nothing more than an XML file; XPath can therefore be used to query it.

For faster execution, I will keep a static compiled expression for the XPath in the Processor class, initialized in its static constructor:

public sealed class Processor
{
  public const string ExtensionNamespace = "https://weblogs.asp.net/cazzu";
  private static XPathExpression Extensions;
  static Processor() 
  {
    XPathNavigator nav = new XmlDocument().CreateNavigator();
    // Select all extension types.
    Extensions = nav.Compile
    ("/xs:schema/xs:annotation/xs:appinfo/kzu:Code/kzu:Extension/@Type");
    
    // Create and set namespace resolution context.
    XmlNamespaceManager nsmgr = new XmlNamespaceManager(nav.NameTable);
    nsmgr.AddNamespace("xs", XmlSchema.Namespace);
    nsmgr.AddNamespace("kzu", ExtensionNamespace);
    Extensions.SetContext(nsmgr);
  }

Note more information on the benefits, details and advanced applications of XPath precompilation and execution can be found at Performant XML (I): Dynamic XPath expressions compilation and Performant XML (II): XPath execution tips.

The Process() method needs to perform this query, and execute each ICodeExtension type it finds, just before returning the CodeNamespace to the caller:

XPathNavigator nav;
using ( FileStream fs = new FileStream( xsdFile, FileMode.Open ) )
{ nav = new XPathDocument( fs ).CreateNavigator(); }
XPathNodeIterator it = nav.Select( Extensions );
while ( it.MoveNext() )
{
  Type t = Type.GetType( it.Current.Value, true );
  // Is the type an ICodeExtension?
  Type iface = t.GetInterface( typeof( ICodeExtension ).Name );
  if (iface == null)
    throw new ArgumentException( "Invalid extension type '" + 
       it.Current.Value + "'." );
  ICodeExtension ext = ( ICodeExtension ) Activator.CreateInstance( t );
  // Run it!
  ext.Process( ns, xsd );
}
return ns;

I'm using Type.GetInterface() to test for interface implementation instead of Type.IsAssignableFrom(), because it seems to have less overhead, as it jumps to unmanaged code quickly. The effect is the same, however, with the latter returning a boolean instead of a Type (or null if no interface is found).

Inside XmlSerializer

Having the CodeDom at hand brings a lot of power and flexibility to the developer looking for customizations, but it comes with greater responsibility, too. There's a risk of modifying the code in such a way that it will cease to serialize in a compatible way with the schema, or that the XmlSerializer functionality is broken altogether, with exceptions being thrown for unexpected nodes and attributes, failure to retrieve values, etc.

It's therefore absolutely required to know the internals of the XmlSerializer before playing with the generated code, and a way to know what's going on under the hoods is certainly needed.

When an object is about to be XML-serialized, a temporary assembly is created by reflecting the type you pass to the XmlSerializer constructor (that's why you have to do so). Wait! Don't panic because of the "reflecting" word! This is done only once per type, and an extremely efficient pair of Reader and Writer classes are created to handle serialization and deserialization during the life of the AppDomain.

These classes inherit the public XmlSerializationReader and XmlSerializationWriter classes in the System.Xml.Serialization namespace. They are also [TheTopSecretClassName]. If you want to take a look at these dynamically generated classes, all you need to do is to add the following setting to the application configuration file (web.config for a web application):

<system.diagnostics>
  <switches>
    <add name="XmlSerialization.Compilation" value="4"/>
  </switches>
</system.diagnostics>

Now the serializer won't delete the temporary files generated in the process. For a web application, the files will be located in C:\Documents and Settings\[YourMachineName]\ASPNET\Local Settings\Temp; otherwise they will be located in your current user Local Settings\Temp folder.

You will see code that is exactly what you would have to do if you wanted to efficiently load XML in .NET: use nested while and if as you read, use XmlReader methods to move down the stream, etc. All the ugly code is there to make it really fast.

Problems in these generated classes can also be diagnosed by using Chris Sells' XmlSerializerPreCompiler tools.

We may look at this code in order to analyze the effect of a change in the serializer-generated classes.

Customizing with CodeDom

A number of customizations will be immediately appealing, as they are regular concerns with regards to xsd.exe tool-generated classes.

Turn Fields into Properties

One of the issues about which most developers complain is that the xsd.exe tool generates classes with public fields, instead of properties backed with private fields. The XmlSerializer-generated classes read and write values from the class' instance by using the regular [object].[member] notation. Of course, from the compilation and source code point of view, there's no difference if [member] is a field or a property.

So with the help of CodeDom, it's possible to change the default classes for the XSD. Thanks to the extensibility built into the custom codegen tool, all that is needed is to implement a new ICodeExtension. The extension will process each type in the CodeDom tree, either if it's a class or a struct:

public class FieldsToPropertiesExtension : ICodeExtension
{
  #region ICodeExtension Members
  public void Process( System.CodeDom.CodeNamespace code, 
                       System.Xml.Schema.XmlSchema schema )
  {
    foreach ( CodeTypeDeclaration type in code.Types )
    {
      if ( type.IsClass || type.IsStruct )
      {
         // Turn fields to props

Now I need to iterate each member of the type (can be fields, properties, methods, etc.) and process only the CodeMemberField ones. I can't simply perform a foreach over the type.Members collection, however, because for each field, I will need to add a property to the same collection. This would result in an exception, as the underlying enumerator used by the foreach construct would become invalid. I therefore need to copy the current members into an array, and iterate the array instead:

CodeTypeMember[] members = new CodeTypeMember[type.Members.Count];
type.Members.CopyTo( members, 0 );
foreach ( CodeTypeMember member in members )
{
  // Process fields only.
  if ( member is CodeMemberField )
  {
    // Create property

Next, I create the new property:

CodeMemberProperty prop = new CodeMemberProperty();
prop.Name = member.Name;
prop.Attributes = member.Attributes;
prop.Type = ( ( CodeMemberField )member ).Type;
// Copy attributes from field to the property.
prop.CustomAttributes.AddRange( member.CustomAttributes );
member.CustomAttributes.Clear();
// Copy comments from field to the property.
prop.Comments.AddRange( member.Comments );
member.Comments.Clear();
// Modify the field.
member.Attributes = MemberAttributes.Private;
Char[] letters = member.Name.ToCharArray();
letters[0] = Char.ToLower( letters[0] );
member.Name = String.Concat( "_", new string( letters ) );

Note that I copy over to the new property the field name, its member attributes, and type. I move the comments and the custom attributes (the XmlSerialization attributes) out of the field and into the property (AddRange() and Clear()). Finally, I make the field private and turn its first letter to lower case, prepending the "_" character, which is a fairly common naming convention for property backing fields.

The most important piece in a property is still missing: its get and set accessors implementations. As they are simply a pass-through to the field value, they're pretty straightforward:

prop.HasGet = true;
prop.HasSet = true;
// Add get/set statements pointing to field. Generates:
// return this._fieldname;
prop.GetStatements.Add( 
  new CodeMethodReturnStatement( 
    new CodeFieldReferenceExpression( 
      new CodeThisReferenceExpression(), member.Name ) ) );
// Generates:
// this._fieldname = value;
prop.SetStatements.Add(
  new CodeAssignStatement(
    new CodeFieldReferenceExpression( 
      new CodeThisReferenceExpression(), member.Name ), 
    new CodeArgumentReferenceExpression( "value" ) ) );

Finally, we just add the new property to the type:

  type.Members.Add( prop );
}

Now, a previous schema generating the following through the tool:

/// <remarks/>
[System.Xml.Serialization.XmlRootAttribute(Namespace="", IsNullable=false)]
public class Publisher
{
  /// <remarks/>
  public string pub_id;

After adding the corresponding extension to the schema:

<xs:schema elementFormDefault="qualified"  
xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:annotation>
    <xs:appinfo>
      <Code xmlns="https://weblogs.asp.net/cazzu">
        <Extension 
Type="XsdGenerator.Extensions.FieldsToPropertiesExtension,
       XsdGenerator.CustomTool" /> 
      </Code>
    </xs:appinfo>
  </xs:annotation>
  ...

Will now generate:

/// <remarks/>
[System.Xml.Serialization.XmlRootAttribute(Namespace="", IsNullable=false)]
public class Publisher
{
  private string _pub_id;
  /// <remarks/>
  public string pub_id
  {
    get
    {
      return this._pub_id;
    }
    set
    {
      this._pub_id = value;
    }
  }

Use Collections Instead of Arrays

In order for any decent read and write (having get+set properties) object model to be programmer-friendly, its multi-valued properties should be collection-based, not array-based. Doing so makes it easier to modify the values and manipulate the object graph. The usual approach involves deriving a new typed collection class from CollectionBase.

Before committing to change the CodeDom, the XmlSerializer supports for collections must be checked. Deep inside the classes that analyze and reflect the type to be serialized, there's an internal class named TypeScope. TypeScope is responsible for ensuring that serialization code can be generated. If contains an interesting method, ImportTypeDesc, which performs most checks and builds information for supported types. Here's where we find the special support for IXmlSerializable (it checks for security attributes in its members), Arrays (must have rank equal to 1), Enums, XmlNode, XmlAttribute and XmlElement, etc.

Specifically for collections, the import method checks for types implementing ICollection, which must satisfy the following rules:

Must have an Add method, which is not defined by the interface as it's usually created for the specialized type the collection will hold.
Must not implement IDictionary by the collection.
Must have a default member (i.e. an indexer) with a single parameter of type System.Int32 (C# int). Such a member will be searched for in all the type hierarchy.
Must not have any security attributes in Add, Count and indexer.

With this information verified, the generated specialized XmlSerializationWriter-derived class uses the Count property to iterate while writing the XML output for our type, instead of Lenth for array-based properties:

MyAssembly.MyCollection a = (MyAssembly.MyCollection)o.@CollectionProperty;
if (a != null) {
    for (int ia = 0; ia < a.Count; ia++) {
        Write10_MyCollectionItem(@"MyCollectionItem", 
          @"https://weblogs.asp.net/cazzu/", 
          ((MyAssembly.MyCollectionItem)a[ia]), false, false);
    }
}

Note that the indexed access to a collection and to an array is the same, given the previous check on the indexer, so there are no changes there.

The corresponding XmlSerializationReader-derived class uses the typed Add method to fill the collection:

MyAssembly.MyCollection a_2 = (MyAssembly.MyCollection)o.@CollectionProperty;
...
while (Reader.NodeType != System.Xml.XmlNodeType.EndElement) 
{
  if (Reader.NodeType == System.Xml.XmlNodeType.Element) 
  {
    if (((object) Reader.LocalName == (object)id8_MyCollectionItem && 
       (object) Reader.NamespaceURI == (object)id9_httpweblogsaspnetcazzu)) 
    {
      if ((object)(a_2) == null) 
        Reader.Skip(); 
      else 
        a_2.Add(Read10_MyCollectionItem(false, true));
    }
    ...

The read method shown above returns the appropriate type the collection expects:

MyAssembly.MyCollectionItem Read1_MyCollectionItem(bool isNullable, 
  bool checkType)

Now that XmlSerializer support for and proper handling of collection-based properties has been checked, it's safe to change all arrays to the corresponding strongly typed collections.

This new extension can be designed to be run before or after the previous one. The difference is significant, as iteration would change from fields to the new properties, respectively. In order to make this extension independent of the previous one, I will code it to work against the fields. Note, however, that if it's configured to run AFTER the FieldsToPropertiesExtension, the code will be incorrect.

Let's initially analyze the method that will build the custom collection first. The collection should look like the following:

public class PublisherCollection : CollectionBase
{
  public int Add(Publisher value)
  {
    return base.InnerList.Add(value);
  }
  public Publisher this[int idx]
  {
    get { return (Publisher) base.InnerList[idx]; }
    set { base.InnerList[idx] = value; }
  }
}

The code to build this typed collection is:

public CodeTypeDeclaration GetCollection( CodeTypeReference forType )
{
  CodeTypeDeclaration col = new CodeTypeDeclaration( 
    forType.BaseType + "Collection" );
  col.BaseTypes.Add(typeof(CollectionBase));
  col.Attributes = MemberAttributes.Final | MemberAttributes.Public;
  // Add method
  CodeMemberMethod add = new CodeMemberMethod();
  add.Attributes = MemberAttributes.Final | MemberAttributes.Public;
  add.Name = "Add";
  add.ReturnType = new CodeTypeReference(typeof(int));
  add.Parameters.Add( new CodeParameterDeclarationExpression (
    forType, "value" ) );
  // Generates: return base.InnerList.Add(value);
  add.Statements.Add( new CodeMethodReturnStatement (
    new CodeMethodInvokeExpression( 
      new CodePropertyReferenceExpression( 
        new CodeBaseReferenceExpression(), "InnerList"), 
      "Add", 
      new CodeExpression[] 
        { new CodeArgumentReferenceExpression( "value" ) } 
      )
    )
  );
  // Add to type.
  col.Members.Add(add);
  // Indexer property ('this')
  CodeMemberProperty indexer = new CodeMemberProperty();
  indexer.Attributes = MemberAttributes.Final | MemberAttributes.Public;
  indexer.Name = "Item";
  indexer.Type = forType;
  indexer.Parameters.Add( new CodeParameterDeclarationExpression (
    typeof( int ), "idx" ) );
  indexer.HasGet = true;
  indexer.HasSet = true;
  // Generates: return (theType) base.InnerList[idx];
  indexer.GetStatements.Add( 
    new CodeMethodReturnStatement (
      new CodeCastExpression( 
        forType, 
        new CodeIndexerExpression( 
          new CodePropertyReferenceExpression( 
            new CodeBaseReferenceExpression(), 
            "InnerList"), 
          new CodeExpression[] 
            { new CodeArgumentReferenceExpression( "idx" ) } ) 
        )
      )
    );
  // Generates: base.InnerList[idx] = value;
  indexer.SetStatements.Add( 
    new CodeAssignStatement( 
      new CodeIndexerExpression( 
        new CodePropertyReferenceExpression( 
          new CodeBaseReferenceExpression(), 
          "InnerList"), 
        new CodeExpression[] 
          { new CodeArgumentReferenceExpression("idx") }), 
      new CodeArgumentReferenceExpression( "value" )
    )
  );
  // Add to type.
  col.Members.Add(indexer);
  return col;
}

At this point, you should take into account a useful tip when programming to the CodeDom; See those seemingly endless Statements.Add lines? Of course we could have broken them in separate lines, each one creating a temporary variable to hold the object and pass to the next one. That would make them even more endless! Well, once you get used to it, the following tip is a good way to break those lines into its pieces:

To generate CodeDom nested statements, contiguous property/indexer/method accesses will usually be constructed from right to left.

In practice: to generate the following:

base.InnerList[idx]

You start from the indexer expression [idx], continue with the property access InnerList, and finish with the object reference base. This makes for the following CodeDom nested statement:

CodeExpression st = new CodeIndexerExpression( 
  new CodePropertyReferenceExpression( 
    new CodeBaseReferenceExpression(), 
    "InnerList"
  ), 
  new CodeExpression[] 
    { new CodeArgumentReferenceExpression( "idx" ) } 
);

Note that I create the statement from right to left, finally completing the appropriate constructor parameters. It's a usually a good idea to manually indent and split the lines in such a way that it's easier to see where each object constructor ends, and which are its parameters.

Finally, the ICodeExtension.Process method implementation involves iterating types and their fields looking for array-based ones:

public class ArraysToCollectionsExtension : ICodeExtension
{
  public void Process( CodeNamespace code, XmlSchema schema )
  {
    // Copy as we will be adding types.
    CodeTypeDeclaration[] types = 
      new CodeTypeDeclaration[code.Types.Count];
    code.Types.CopyTo( types, 0 );
    foreach ( CodeTypeDeclaration type in types )
    {
      if ( type.IsClass || type.IsStruct )
      {
        foreach ( CodeTypeMember member in type.Members )
        {
          // Process fields only.
          if ( member is CodeMemberField && 
            ( ( CodeMemberField )member ).Type.ArrayElementType != null )
          {
            CodeMemberField field = ( CodeMemberField ) member;
            CodeTypeDeclaration col = GetCollection( 
              field.Type.ArrayElementType );
            // Change field type to collection.
            field.Type = new CodeTypeReference( col.Name );
            code.Types.Add( col );
          }
        }
      }
    }
  }

Just as I did before, I copied the collection I need to modify; in this case, the CodeNamespace.Types.

Further customizations could include adding [Serializable] to generated classes, adding DAL methods (i.e. LoadById, FindByKey, Save, Delete, etc.), generating members that are ignored for serialization and are used by your code (apply XmlIgnoreAttribute), omit generation of classes that belong to external imported schemas, etc.

Mapping Tips

In case you are going to dig deeper in the code generation tool itself, or want to customize even further the schema processing, you may be interested in the following advanced issues with the codegen classes. If you're only going to develop extensions and manipulate the CodeDom, they will not add much value to you, and you can safely skip this section.

I have processed elements by retrieving their XmlTypeMapping; I haven't used any of its properties, though you may need them if you have to locate the CodeTypeDeclaration corresponding to an element. XmlTypeMapping properties and a brief description of their meaning can be found in the MSDN documentation. This class, however, is used in a number of scenarios, such as the SoapReflectionImporter mapping imports shown in the documentation. As for the XmlSchemaImporter I'm using, I've found the XmlTypeMapping.TypeFullName and XmlTypeMapping.TypeName behave incorrectly with one particular schema element design: if it contains a single unbounded child inside a sequence, both incorrectly assume the type of the child property.

So, for the following schema element:

<xs:element name="pubs">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="publishers" type="Publisher" maxOccurs="unbounded" />
    </xs:sequence>
  </xs:complexType>
</xs:element>

Instead of having the "pubs" value, which is the type that will be generated, both XmlTypeMapping.TypeFullName and XmlTypeMapping.TypeName have the value "Publisher[]", which is the type of its only property. Should the sequence have more than one element, everything would work as expected. Note that this (apparent) bug applies whether the element's type is a named global type or not, or if the element itself is a reference or not.

Besides the type mapping, the XmlSchemaImporter can also retrieve the mappings that will apply to its members (fields). This is very useful, as the XSD/CLR type mapping—including XSD custom derived types—is resolved and you can be sure that it's the appropriate one used by the XmlSerializer. You can get the member mappings as follows:

XmlMembersMapping mmap = importer.ImportMembersMapping( 
  element.QualifiedName );
int count = mmap.Count;
for (int i = 0; i < count; i++)
{
  XmlMemberMapping map = mmap[i];
  //You have now: 
  //  map.ElementName
  //  map.MemberName
  //  map.TypeFullName
  //  map.TypeName
}

XmlMemberMapping.TypeFullName holds the namespace qualified CLR type, whereas XmlMemberMapping.TypeName has the XSD type name. For example, for a member of XSD type "xs:positiveInteger", the former will be "System.String" and the latter "positiveInteger." If you didn't have access to this member mapping retrieval, you would have to know all the XSD-to-CLR type conversion rules used by the XmlSerializer. Note that these rules are not necessarily the same used for XSD validation and the DOM PSVI.

There's one important caveat (again, apparently a bug) to member importing. You can't reuse the XmlSchemaImporter or you will get an InvalidCastException thrown by the importing code, somewhere at XmlMembersMapping construction time. This can be solved by using a new instance of the importer each time.

With this information at hand, you can completely change the class appearance—for example, renaming properties to make the first letter upper case—without putting the serialization infrastructure at risk.

I said when I discussed the basis of the codegen classes that you can only retrieve (import) the mapping for globally defined elements; if you create your own custom attributes to modify the resulting classes, you will only be able to retrieve and analyze them for top-level elements, because you will only have the mappings for those. For example, let's say you add a code:className attribute, used by some extension to change the generated class name:

<xs:schema xmlns:code="https://weblogs.asp.net/cazzu" ...>
  <xs:element name="pubs" code:className="PublishersData">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="publishers" code:className="Publishers">
          <xs:complexType>

You will be able to retrieve the mappings for the pubs element, but not for the publishers child element. It won't be safe to process it, therefore, as the codegen classes may change in the future. Without the mapping at hand, you can't simply assume that the corresponding CodeTypeDeclaration will have the same name as the element (in order to locate it and change it). You can take the risk as an acceptable one, of course.

Conclusion

Reusing the built-in code generation features created for the XmlSerializer ensures that slight changes to the generated code do not break XML serialization. Direct manipulation of its output through CodeDom is also a plus for flexibility. I showed how a flexible processing of XML schemas can allow arbitrary extensions to alter the output, and developed some useful examples.

With this solid base, you can move to more advanced scenarios; external (imported/included) XSD schemas and their relation to code generation, manipulation of code output to reuse type definitions (both XSD and the corresponding generated .NET types) in application or company-wide repositories, etc.

I hope this is the kick start to your use of novel approaches to XSD storage and management, and the corresponding code generation and reuse.