Understanding the Excel .xls Binary File Format

Summary: Learn about the MS-XLS binary file format that is used in previously released Microsoft Excel products. Included in this article are the basic structures and key concepts for interacting with this file format programmatically.

上次修改时间: 2011年6月23日

适用范围: Excel | Excel 2010 | Office 2007 | Office 2010 | SharePoint Server 2010 | VBA

本文内容
Overview of the MS-XLS Binary File Format
Conclusion
Additional Resources

Applies to:   Microsoft Office Excel 2003 | Microsoft Excel 2002 | Microsoft Excel 2000 | Microsoft Excel 97

Published:   February 2011 | Provided by:   Microsoft Corporation

Contents

  • Overview of the MS-XLS Binary File Format

  • Key Components of the MS-XLS File Format

  • Extracting Data from Excel Files

  • Conclusion

  • Additional Resources

This article describes the structures and some procedures for working with MS-XLS files. It is the part of a series of articles that introduce the binary file formats used by Microsoft Office products. These articles are designed to be used in conjunction with the Office File Format Documents available on MSDN.

Overview of the MS-XLS Binary File Format

The MS-XLS binary file format is used by Microsoft Office Excel 2003, Microsoft Excel 2002, Microsoft Excel 2000, and Microsoft Excel 97. The format is organized into streams and substreams. Each spreadsheet worksheet is stored in its own substream. All of the data is contained in records that have headers, which give the record type and length. Cell records, which contain the actual cell data as well as formulas and cell properties, reside in the cell table. String values are not stored in the cell record, but in a shared strings table, which the cell record references. Row records contain property information for row and cell locations. Only cells that contain data or individual formatting are stored in the substream.

Microsoft Office Excel 2007 uses the MS-XLSB binary file format. This format is similar to MS-XLS but is not explicitly discussed in this article.

备注

The recommended way to perform most programming tasks in Microsoft Excel is to use the Excel Primary Interop Assemblies. These are a set of .NET classes that provide a complete object model for working with Microsoft Excel. This article series deals only with advanced scenarios, such as where Microsoft Excel is not installed.

Key Components of the MS-XLS File Format

The MS-XLS file format contains streams, substreams, and records. All the records in an MS-XLS document start with a 2-byte unsigned integer to specify Record Type (rt), and another for Count of Bytes (cb). Records may be read or skipped by reading these values, then either reading or skipping the number of bytes specified by cb, depending on the record type specified by rt.

A record cannot exceed 8224 bytes. If the data the record applies to is larger than that, the rest is stored in one or more continue records.

备注

The record descriptions in the [MS-XLS]: Excel Binary File Format (.xls) Structure Specification do not include mention of the Record Type (rt) and Count of Bytes (cb) values that make up the first four bytes of the record. For more information, see section 2.1.4 of the MS-XLS specification.

These are the main streams, substreams, and records in an MS-XLS format file. Specific byte locations within a record are counted from the end of the cb field.

  • Workbook stream

    The Workbook stream is the primary stream in an .xls file. It contains multiple substreams, each of which starts with a Beginning of File (BOF) record and ends with an End of File (EOF) record. The first stream is always the Globals substream, and the rest are sheet substreams. These include worksheets, macro sheets, chart sheets, dialog sheets, and VBA module sheets.

  • Globals substream

    The Globals substream specifies global properties and data in a workbook. It also includes a BoundSheet8 record for each substream in the Workbook stream.

    • BoundSheet8 record

      A BoundSheet8 record gives information about a sheet substream. This includes name, location, type, and visibility. The first 4 bytes of the record, the lbPlyPos FilePointer, specifies the position in the Workbook stream where the sheet substream starts.

  • Worksheet substream

    The Worksheet substream specifies a sheet in a workbook.

    • Cell Table

      The cell table is the part of a sheet stream where cells are stored. It contains a series of row blocks, each of which has a capacity of 32 rows of cells, and are filled sequentially. Each row block starts with a series of Row records, followed by the cells that go in the rows, and ends with a DBCell record, which gives the starting offset of the first cell of each row in the block.

    • Row record

      A Row record defines a row in a sheet. This is a complex structure, but only the first 6 bytes are needed for basic content retrieval. These give the row index and the columns of the first cells and last cells that contain data or unique formatting in the row.

    • Cells

      All of the cells in a row block are stored after the last row in the block. There are seven kinds of records that represent actual cells in a worksheet. Most cell records begin with a 6-byte Cell structure. The first 2 of those bytes specify the row, the next 2 bytes specify the column, and the last 2 bytes specify an XF record in the Globals substream that contains formatting information.

      The following records represent the different kinds of cells. Unless specified otherwise, the first 6 bytes are taken up by the cell structure, and the remaining bytes contain the value.

      • Blank cell record

        A Blank cell record specifies a blank cell that has no formula or value. This record type is used only for cells that contain individual formatting; otherwise, blank cells are stored in MulBlank records or not at all.

      • RK cell record

        An RK cell record contains a 32-bit number. Excel automatically converts numbers that can be represented in 32 bits or less to this format for storage as a way to reduce file size. Instead of a 6-byte cell structure, the first 2 bytes specify the row and the second 2 bytes specify the column. The remaining 6 bytes define the number in an RkRec structure for disk and memory optimization.

      • BoolErr cell record

        A BoolErr cell record contains a 2-byte Bes structure that may be either a Boolean value or an error code.

      • Number cell record

        A Number cell record contains a 64-bit floating-point number.

      • LabelSst cell record

        A LabelSst cell record contains a 4-byte integer that specifies a string in the Shared Strings Table (SST). Specifically, the integer corresponds to the array index in the RGB field of the SST.

      • Formula cell record

        A Formula cell record contains both the formula and the resulting data. The value displayed in the cell is defined in a FormulaValue structure in the 8 bytes that follow the cell structure. The next 6 bytes can be ignored, and the rest of the record is a CellParsedFormula structure that contains the formula itself.

      • MulBlank record

        A MulBlank record specifies a series of blank cells in a row.

        The first 2 bytes give the row, and the next 2 bytes give the column that the series of blanks starts at. Next, a variable length array of cell structures follows to store formatting information, and the last 2 bytes show what column the series of blanks ends on.

      • MulRK record

        A MulRk record is like a MulBlank record, but instead of blank cells, a MulRk record consists of RK data in RkRec structures.

    • Shared String Table (SST)

      The Shared String Table (SST) contains all of the string values in the workbook. These values are referenced in the worksheet by LabelSst cell records. The first 8 bytes of the SST give the number of references to strings in the workbook and the number of unique string values in the SST. The rest is an array of XLUnicodeRichExtendedString structures that contain the strings themselves as arrays of characters. Bit 16 of this structure specifies whether the characters are 1 byte or 2 bytes each. You can extend both the SST structure and the XLUnicodeRichExtendedString structure by using Continue records if the number or length of strings exceed their bounds.

Extracting Data from Excel Files

All MS-XLS format file content lives in the sheet substreams. Although you could load every sheet substream indiscriminately, you gain more control and efficiency by using the BoundSheet8 records to locate just the sheets you want to read. Parsing of formulas and formatting information is beyond the scope of this article.

The following procedure shows how to access all of the data from a worksheet.

备注

Specific byte locations within a record are counted from the end of the cb field.

To read the content of an Excel worksheet

  1. Create an internal data structure to hold the worksheet content.

  2. Define objects to represent each of the eight cell record types in memory.

  3. Open the Workbook stream and scan for the first instance of a BOF record. This is the beginning of the Globals substream.

  4. Read the Globals substream, loading the BoundSheet8 records and the SST into memory. For more details, see Globals.

  5. From the BoundSheet8 record that corresponds to the substream you want to open, read the first 4 bytes, which contains the lbPlyPos FilePointer.

  6. Go to the offset in the stream specified by the lbPlyPos FilePointer. This is the BOF record for the worksheet.

  7. Read the next record in the substream, which is the Index record, and load the array of pointers that starts at byte 16 of the Index record. Each pointer points to the stream position of a DBCell record.

  8. For each pointer in the array:

    • Read the corresponding DBCell record.

    • Go to the offset specified by the bytes 5–6 of the DBCell record and read into memory all of the cell records, starting at that point and ending with the last byte before the DBCell.

    • Copy the cell records to the objects that you defined in your internal data structure by record type.

    • Parse the cell data.

Conclusion

This is just a sampling of the MS-XLS file format. By using the tools that are provided in this article, simple data recovery should be within your reach. With additional exploration, you can start to recover formulas, formatting information, and other metadata. And eventually, even save operations become possible.

Additional Resources

For more information, see the following resources: