CeLog corruption

Posted by: Sue Loh

CeLog file corruption is a squirrely problem I haven't tracked down a reason for yet. I have never had a device that produced corruption for me, not in any reproducible manner (if ever, I can't remember hitting corruption myself but my memory is not good). But I look at log files that other people have gathered, and once in a while someone sends me a file that is corrupt. I have no way of knowing whether it's a problem in the logging engine, or in the flushing thread, or in the file system. If you happen to hit it regularly, let's talk.

What's corruption? There are three possible answers. A chunk of data may have been omitted. A chunk of garbage data may have been added. Or a portion of data may just be overwritten with garbage. Whichever way, the corruption is preceded by a series of valid CeLog events, and also eventually followed by another series of valid events. So the goal is to recognize the invalid data, and to recover as much of the surrounding valid data as possible.

The CeLog log file format dictates that every event starts with a CEL_HEADER value. This value contains the event ID (type), event size, and a flag saying whether the event has a timestamp. If the file is corrupt, then log viewing tools like Readlog and Remote Kernel Tracker will be walking along from header to header, and jump from a real event header into data that is not a header. It will read this value as if it was a header, though. As a result, the tools will get a garbage size, ID and timestamp flag. Typically what happens at this point is that the viewer tools jump a distance based on the garbage size, read another garbage header, jump another distance, etc. until by some chance they happen to land on a valid header. Through all of this jumping the tools may cross over quite a bit of valid data, because all data seems invalid if you don't actually read it on properly aligned boundaries. Even a single bit of header corruption can send the tools jumping off into the wrong places in the data.

On each invalid header that the tools read, different things may happen:

  • If the event ID in the garbage header happens to match a real event ID, then the tools will try to print the event as if it was of that type. Otherwise if the tools see an ID they don't understand, they don't do much with the event. They don't know how to process it. Readlog prints the event ID and size, while Remote Kernel Tracker just doesn't display the event.
  • If the timestamp flag in an invalid header happens to be TRUE, then the tools will print a garbage timestamp value. Otherwise they will print "--:--:--.---.---" to indicate that there was no timestamp.
  • Remote Kernel Tracker imposes a maximum size on the events it can display. If the size from the garbage header is larger than that maximum size, then Kernel Tracker will decide that the log is corrupt and refuse to display any more data.

If you look at the region of corruption in the output from Readlog, you'd see something like this:

 --:--:--.---.--- : SystemTLB, Count=17
 0:02:25.690.241 : ThreadSwitch,      hThread=0xA3E7BAAA
--:--:--.---.--- : SystemTLB, Count=10
 0:02:25.690.316 : ThreadSwitch,      hThread=0xA3DD20A2
 0:02:25.690.374 : SemaphoreRelease, hSem=0xC3DE5352
 0:02:25.690.392 : WaitForMultipleObjects, Timeout=66650111ms, WaitAll=0, NumObjects=1
 Object 0: 0x03F85880
--:--:--.---.--- : ID=1014, Len=27932
--:--:--.---.--- : ID=3456, Len=42897
--:--:--.---.--- : ID=3459, Len=21443
 0:02:26.566.560 : ThreadSwitch,      hThread=0x83E88DD2
 0:02:26.566.563 : Sleep, Timeout=0ms
 0:02:26.566.568 : ThreadSwitch,      hThread=0x83E88DD2
 0:02:26.566.570 : Sleep, Timeout=0ms
 0:02:26.566.574 : ThreadSwitch,      hThread=0x83E88DD2
 0:02:26.566.577 : Sleep, Timeout=0ms

From the Readlog output above, we can see that the corruption left Readlog jumping 3 times across invalid headers before it landed on a valid header again. All 3 invalid headers that Readlog jumped over happened to have the timestamp flag turned off. Note that you can tell approximately how much data was lost due to the corruption, because the timestamps jump approximately 900ms from before the corruption to after. Also the last event before the corruption, the WaitForMultipleObjects, may also be partially corrupted. The wait-timeout value (6650111ms) looks suspicious. The reason I don't think it is completely corrupted is because the timestamp looks real. The timestamp follows the header in the log file format, so the header is probably okay. The corruption probably in the CEL_WAIT_MULTI structure that follows the timestamp.

In the 3 invalid header jumps Readlog made, it jumped over about 90KB of data. That's a fair amount of data, even though the 3 little "ID=" events make it look small. Most of that data was probably valid. Usually I just write off the skipped data as an unfortunate casualty. However if you're desperate to see it, you have some chance of loading the file into your favorite hex editor, hacking around until you find the position of the corrupt data, looking to see where you think the valid data starts, and hand-editing the corrupt event header to make the tools jump to the next valid event header. It's not an easy task.

But now that I've started thinking about it I can't resist posting a little info. (Hey, today is a day off, why not waste time on random stuff. Yeah I know I need to GET A LIFE instead of blogging work stuff on my day off...) Below is a screen shot where I've opened up the file in a hex editor, and searched for "3F8FFFF" which is the hex representation of the suspicious timeout value 6650111. I put that value in yellow, and circled several complete events. Look at the doc for CEL_HEADER to understand the headers on the circled events. The 8 at the front means that a timestamp follows the header, and the size (the low WORD of the header) is the number of bytes that follow the timestamp.

The corrupt data lies between the red circled event, and the green one. (Oops sorry red/green colorblind people.) This corruption looks to have started on a 512-byte boundary, though actually I trimmed this log with "readlog -copy -range ..." before opening it in the hex editor so that may not really be true. Anyway I went and changed the header 0x8009000C on the corrupt WaitForMultipleObjects block to 0x800900D4 to add in the size of the chunk of corruption. Now tools step over all of the corruption to a valid header, and give me most of the ~90KB of otherwise lost data. I am not pasting in the fixed Readlog output, though, because it's not very interesting.

Back to the land of REALITY. To find a region of corruption like this in the Readlog output, I search for "ID=" or sometimes ": ID=". Of course that only works if you're not logging your own events with IDs that Readlog doesn't understand, because those would look very similar. If you use the summary view ("readlog -s") you would also be able to look at the list of events types to see if any unexpected event IDs appeared.

 Event counts:
   EventSet                       1
    WaitForMultipleObjects         2
    Sleep                          4
    SemaphoreRelease               1
    ThreadSwitch                   7
    ThreadCreate                 206
    ProcessCreate                 18
    SystemTLB                     78
    ModuleLoad                   184
  ID=1014                        1
    ID=3456                        1
    ID=3459                        1
 Sync end marker                2

Of course, the corruption might leave the tool thinking that it found an event with a valid type ID. Those wouldn't show up as "ID=" but as a real event type. I have seen it happen.