Introducing Enterprise Metadata Management

Article
06/22/2010

Hi there, my name is Pat Miller, and I am the development lead for the Enterprise Metadata / Taxonomy features in SharePoint 2010. I've been working on the ECM team and its fore-bearers for the better part of 11 years now, first with NCompass Labs which was acquired by Microsoft in 2001, then on the Content Management Server team, then with the CMS team as part of MOSS 2007. This is the first of many blog posts on the Enterprise Metadata Management (EMM) system in the 2010 release. This will be the overview of the system, and future posts will drill into specific areas like event receivers, field editing and search refinements.

First, some background. At one point during the development of Content Management Server 2002, we spent some time with the folks that run the Microsoft.com set of websites. One of the things they were very keen on was this taxonomy system that they had built. It seemed fairly useful, and we considered implementing something like it, but didn't have the time, and there was a general concern that no one would actually do the work of tagging data. During the development of MOSS 2007, we were spending most of our time rewriting our feature set to run on top of SharePoint, and once again, taxonomy fell off the list of things we were willing to tackle (and still, people would consistently say that people just don't tag).

Around this time people started tagging things in their own world. The rise of digital cameras and mp3 players brought a huge amount of data that for the most part, had to be marked up with metadata in order to be searchable. Some metadata was added to the files automatically (things like date, size, camera model, etc.), but specific user information wasn't there. You quickly learned that if you categorized the images (either through folder location or tags) you could navigate your way through 10's of thousands of files (images, music, etc.) the way that works for you personally, rather than relying on default information like date the picture was taken. People became more familiar with the concept of navigating their content via metadata - "Let's listen to all my Pearl Jam albums, I feel like listening to Electronica, find me photos of Dad". It's only a small step from that to wanting to impose some sort of hierarchy - find me photos of my whole family, my extended family, I want to listen to all classical music, or perhaps just from the Baroque period. Tagging all that data really unlocked a lot of potential.

Perhaps the landscape had changed...

We decided to run with it in the 2010 release. There were a few main tenets that we tried to let guide us:

No one (well, almost no one) apply metadata for the shear joy. It's always for a purpose.
#1 means that the reason for the system has to be for the end user benefit. What can you do if you have this rich metadata applied?
In order for #2 to come to realization, the metadata has to be present, which means that applying consistent metadata needs to be as easy and ubiquitous as possible.

To that end, we set out to enable a bunch of new user scenarios for SharePoint 2010.

We started out the release with a blank sheet of paper and some very knowledgeable people in the information management space. We also found that most people started twitching uncontrollably when the word "ontology" was mentioned. 'Tagging' was fine, 'metadata' was OK, at 'taxonomy' they started looking for an exit. Telling people that a taxonomy was just a hierarchy calmed them down, but the whole ontology thing was too much of a stretch. It also complicated things considerably, and we could still get a huge amount of value out of a taxonomy, so this was our starting point.

Some features were very obvious - filtering list views based on hierarchy inclusion, search refinement, etc. Some were a small step from this - if you have a consistent vocabulary across an enterprise, you can start to do some interesting things. You can match areas of expertise to specific content or workflows. You can start to relate content in totally different systems based on something with more context than a simple string. What if you could relate your analytics content to your taxonomy system and get a real-time view of what topics people are viewing instead of simply guessing based on their position in a URL namespace? How about overlaying your security model with your metadata so that certain people had rights to view content based on the metadata applied to it? How about we get down to business and focus our resources and ship a compelling collection of features.

To that end, we came up with the following components in the system:

The taxonomy repository itself, we call it the Term Store. Some companies have very top down strict taxonomies, so some term stores might have a very few people allowed to edit them. We'll have to support having multiple term stores.

The taxonomy system needs to be able to support a complex enterprise. A simple flat list of strings isn't going to be sufficient. To that end, we support the following concepts and behaviors:

Terms - A term is the central object in the taxonomy system. It's the concept itself. It's very hard to come up with a name for a concept and have it be sufficiently descriptive and not too vague. Term is what we came up with.
Labels - Terms have to be known by a bunch of different names. When someone types "check" it should be the same thing as someone that types "cheque". "USA" and "United States" and "United States of America" are all referring to the same term. We call these names labels.
Default Label - It's a whole lot easier if one label is the default. You can find it through any of its synonyms, but we'll display the default label in most circumstances.
Termset - A collection of related terms in a hierarchy is a termset. Things like "locations" and "products".
Term Reuse - This is a key point to the system. If you have two termsets "Capitol Cities" and "Locations", the term "London" and all of it's synonyms, etc. should be the same in both. We don't allow a term to have two parents in the same termset, but it can have two parents in different termsets.
Homographs - A homograph is a word that is spelt the same, but has a different meaning. You should be able to have a hierarchy that has "Paris" existing in both France and Texas. To keep things a bit more sane for the user, we don't allow homographs to have the same parent.
Multiple language support - A given term has a bunch of meaning associated with it. The translations belong to the term in the same way that synonyms do. If a term doesn't have a translation, we use the default language.
Groups - Groups in the taxonomy system are simply collections of termsets that share a common security assignment. Termsets and terms aren't ACL'd, groups are.
Deprecated terms - if a term shouldn't be used any more, it can be deprecated. This doesn't remove it from the system, you just can't apply it to new content moving forward.
Terms that are unavailable for tagging - this is slightly different from deprecated terms. A deprecated term is deprecated in all occurrences in the taxonomy and isn't shown to the user when tagging. Unavailable terms are only unavailable in a specific termset, and are still displayed when browsing the hierarchy at tagging time. The purpose of this is to allow things to be hierarchical without allowing people to tag with the wrong term. For example, in the Capitol Cities termset, you might have continents in it so that people can find a particular city, but they would be marked as unavailable for tagging (with respect to Capitol Cities) because they should not be selectable at tagging time.
Merging terms - at times, you might get multiple terms in the system that really are the same thing. They might be in the same termset, or they might be in different termsets. When you merge them, you get a single term with all of the properties, and this new term will be reused in all termsets that the original terms existed.
Open Termsets - There are times when a highly managed taxonomy makes sense. You shouldn't be able to add random countries to the list of known countries. However, you probably don't want to give taxonomy editing permissions to everyone that is creating a new codeword. Open termsets allow content editors to add new terms to a hierarchy at content authoring time. It's a bit of a meeting point between bottom up folksonomies and top down taxonomies.
Keywords - The degenerate case of a folksonomy is a simply flat list of strings. They have no extra semantic meaning. This is the enterprise keywords termset. Terms here don't have a hierarchy, definitions, synonyms or translations. However it is possible to move a keyword into a managed termset and add this additional data.
Local termsets - The taxonomy field type gives you all sorts of useful features, but you probably don't want "places to order food from" to wind up in your enterprise taxonomy. Local termsets are only visible within a single site collection.

OK, that's a nice set of features in the taxonomy system. What do we want to do with all those terms and termsets?

The next set of features involve integrating the taxonomy system with SharePoint. The primary place this happens is in the new managed metadata field type. Think of it as a choice field that went to the gym. It's much more powerful. The metadata field type is a normal field that can be applied to any content type (list or document library). However it has a few nice things associated with it:

Termset binding - You can specify what termset a field should be bound to. You can have lots of fields bound to the same termset. When you update the termset, all of the bound fields use the changes immediately.
Path or node display - You can choose to display the default label of the term by itself "Paris" or its path "Europe > France > Paris".
Multi-lingual rendering - If a given term has been translated to a given language, when your UI is set to that language, the term translations are displayed.
Content type syndication - This isn't a taxonomy feature per se, but it's part of the enterprise metadata feature set. We allow a term store to have a site collection defined as it's "hub". On that hub you can publish content types, and these content types will be pushed out to all consuming site collections. This means that in addition to having a consistent vocabulary across your enterprise, you can have a consistent set of content types using all that goodness.
Rich editing - when you are applying a term to an item, you can search across the entire termset (including synonyms) or view the tree itself. It makes it possible to choose from thousands of choices, which would normally break lookup and choice fields.
Editing support in the rich client applications - the document information panel in the Office client applications allows for applying terms.
Offline editing in the rich client applications - when you edit in the rich client applications, a copy of the bound termsets is cached locally. You can tag on the plane.

Once data is in SharePoint, other SharePoint features can deliver additional goodness:

Better listview filtering - not only can you filter in the normal "everything with value X" but you can also do inclusive filtering, displaying everything tagged with X or a child of X.
Better metadata navigation behavior - The metadata navigation feature allows you to navigate through libraries using hierarchies other than the folder hierarchy. The termset is one of the allowed hierarchy types, meaning that you can browse your libraries along multiple axes. You can now free your data from the tyranny of the URL or folder namespace.
Routing and policy - The document routing feature can direct your content based on the metadata applied to it. Taxonomy fields can even be used to create folder hierarchies at the routing destination. Retention policies can be driven off of taxonomy fields as well.
File open / save - Can't remember exactly where your document is stored in a large library? You can use the taxonomy field to filter the open dialog display.

Now that we have all that nice consistent metadata on our content, we can do a few more things:

Content by query Web Part enhancements - You can configure the CBQ to filter based on taxonomy fields, including descendent inclusion.
Automatic search refinement - The search system is aware of all taxonomy fields, and if a result set has a sufficient amount of data with the same taxonomy fields, a search refinement will appear, allowing users to filter their data.
Power user profile and social tagging - it doesn't make much sense to have a corporate taxonomy and then do your social tagging using just string matching. All of the social properties are actually sourced from the taxonomy system, meaning that you won't get people asking you where a good place to stay in Paris, France when you are an expert on Paris, Texas.

And since we know that we can't possibly implement every feature that everyone would want, everything is accessible through our API. In future blog posts, we'll go over how to use this API to deliver some compelling features.

Hopefully this is a nice introduction to the work we did around taxonomies and enterprise metadata. We had a lot of fun coming up with the design and implementation, and hope that it resonates with you.

Thanks for reading.

Pat.Miller at Microsoft.com

Introducing Enterprise Metadata Management

Additional resources