3.1.1 Abstract Data Model

This protocol supports crawling sites that conform to a hierarchical pattern of content organization. This section specifies the hierarchical pattern in detail.

The elements of this hierarchy include Web pages, list items inside of various lists, and annotated documents inside of document libraries. The elements should conform to the following organization:

  • A root site, also called a site collection, consists of subsites or sites that are sometimes called "Webs." Both terms denote the same conceptual entity, but "site" emphasizes the unit of administrative management and User community while "Web" is the unit of content organization. Each site corresponds to one Web called the "top-level Web site."

  • A site can contain other subsites.

  • Sites contain both predefined and custom lists. Each list has a schema that prescribes the fields in every list item.

  • Lists contain list items. Each list item has the same fields (shown as "Columns") prescribed by the list schema. Optionally, list items can have any number of attachments. There are predefined fields in every list. For example, ID and Last Modified Time. Those fields obtain their values automatically and have predefined meanings.

  • Document libraries are a special form of list. They contain documents annotated with fields prescribed by the document library schema.

  • Lists can contain folders and folders can contain subfolders. Folders help end users to visually organize contained material.

All elements of a site are identifiable either by a URL or by their identifiers, which are typically GUIDs. Identifiers might not be human-readable, but they are immutable for the whole life-cycle of an element. In contrast, the URL might change over time. For example, a list or document can be renamed and that would, in turn, change the URL.