A bit about WinInet's Index.dat
Since a recent digg article and its underlying Wikipedia entry seems a little confused about index.dat, I’d like to give some more detail about what it is and what we have changed with it in IE7/Vista’s version of WinInet. As Jeffdav explained a while back, the index.dat file is a store for web related things; the URL content cache, cookies, RSS feeds, and visited links. Each of these collections, called a container, has their own index.dat file that lives in the user profile.
First, let’s talk a bit about these containers a bit more:
On most machines the biggest and most important container is the URL content cache index.dat. It lives (on vista) at \Users\<user>\AppData\Local\Microsoft\Windows\Temporary Internet Files\Content.IE5\index.dat. Content such as pages and images that we fetch from the web and that are cacheable get placed into this cache until they expire. The rules for if it is cacheable and when the entries expire from the cache are complex enough to warrant its own blog posting, but the common reasons that content doesn’t go in the cache is due to the server telling us not to via response headers, or the user telling us to not save any SSL resources to disk via the “Do not save encrypted pages to disk” option in Internet Options->Security->Advanced. Each cache entry has the URL and a file name to allow us to quickly find previously retrieved URLs and serve that content out of the content container. If a user just deletes all the files in the directory, the index.dat file will still contain all the URLs and paths until we realize that the cache entry is missing the file, and should be deleted from the index.dat.
The visited container is a listing of the URLs that you click on when web browsing, which is how IE can do URL auto completion and mark the links that you have visited a different color. This container is located on my vista box in \Users\<user>\AppData\Local\Microsoft\Windows\History\History.IE5\index.dat. Visited only needs to know about each URL once, since you have either visited the site or you haven’t.
The history containers are a set of containers for the different date ranges that IE displays, like today, yesterday, last week, etc. These containers are in \Users\<user>\AppData\Local\Microsoft\Windows\History\History.IE5\MShist01<date><date>\index.dat. Again top level links that you visit are stored in these containers. When the date shifts, IE does the bookkeeping often through merging these buckets.
The cookie container maps the cookie URLs to individual cookie files. It is stored in \Users\<user>\AppData\Roaming\Microsoft\Windows\Cookies\index.dat. The index.dat contains the associated URL, path to the cookie data and other cookie metadata information. You might notice that unlike the other containers this container is under a path called Roaming. This has to do with a domain feature that copy around your preferences from machine to machine on a domain. Cookies are one of those types of settings.
You might have also seen that starting in Vista almost all the containers have a Low\ directory with another index.dat. That is because these are specially marked directories that IE in protected mode can access. We completely partition off IE between the protected mode and normal modes. By design, normal only accesses the normal cache, cookies, etc. and by design and OS protection, protected mode only accesses the Low\ versions. The “how” of this partitioning is talked about on MSDN.
It’s important to note that pretty much all modern web browsers has to store these types of data stores. Firefox (184.108.40.206 at least) uses different types of file formats for each of its index.dat equivalents but they are there. The equivalent of the cache container index.dat is in Users\<user>\AppData\Local\Mozilla\Firefox\Profiles\59kuzm1n.default\Cache with the _CACHE_* files. The other containers are in the Roaming version of the directory over in Users\<user>\AppData\Local\Mozilla\Firefox\Profiles\59kuzm1n.default\. The history and visited are probably combined into one container; history.dat and the cookies container is cookies.txt.
There is one thing pretty special about WinInet and hence the index.dat files; they are OS components that many applications use, including explorer. That means that they were highly optimized for sharing data between processes. Each application’s copy of WinInet opens up the file for sharing read and write, but not for delete. As long as any program is using WinInet, the index.dat file can’t be deleted. If you could delete it, the applications actively using the file would probably crash or start corrupting data in memory. This also means that many applications leave their own footprints in the different containers. For Example: when Windows Music Player downloads an mp3 from the web from an URL, that file can end up in WinInet’s content cache.
So what’s new in IE7? Well the first thing is that IE made the interface for clearing up these files much simpler with “Cover My Tracks”. Under this idea WinInet made a bunch of improvements. The first improvement was in entry deletion. Those of you who remember the FAT file system on DOS might find the concepts behind this problem familiar. In DOS when you delete a file, the file is still around and special tools can undelete them unless some new files have already written over the old files. The way we use to delete entries in the index.dat file was pretty similar, the old URL data was marked free, but was still there, at least until it was overwritten by a new entry. In IE7 we now zero out the entry. Another problem was that some applications (cough Outlook Express cough) would write temporary files, like attachments, into the cache file directory to allow other applications to open them. If the index.dat file didn’t know about the file, we wouldn’t clean it up. Now when you use the “Delete Files…” button we delete everything in the directory regardless of if it’s in index.dat or not. There is one more feature in this area that I should mention even though it is not new. When we attempt to delete an entry from the cache, but can’t delete the actual storage file, we will still remove the entry from index.dat and stick the file on a list of things to periodically try to clean up.
-- Ari Pernick