A case study on a OutOfMemoryException crash investigation

Recently, I encountered an OOM issue in our load runs. It took me a while to repro it and created a full dump. But once we got the dump, it was pretty straightforward to find the "leak".

1. Is this a native memory leak or .NET memory leak?

0:072> !address -summary

-------------------- Usage SUMMARY --------------------------
    TotSize (      KB)   Pct(Tots) Pct(Busy)   Usage
   390e8000 (  934816) : 44.58%    65.56%    : RegionUsageIsVAD
   28f79000 (  671204) : 32.01%    00.00%    : RegionUsageFree
    9c2f000 (  159932) : 07.63%    11.22%    : RegionUsageImage
    30cd000 (   49972) : 02.38%    03.50%    : RegionUsageStack
      49000 (     292) : 00.01%    00.02%    : RegionUsageTeb
   11246000 (  280856) : 13.39%    19.70%    : RegionUsageHeap
          0 (       0) : 00.00%    00.00%    : RegionUsagePageHeap
       1000 (       4) : 00.00%    00.00%    : RegionUsagePeb
       1000 (       4) : 00.00%    00.00%    : RegionUsageProcessParametrs
       2000 (       8) : 00.00%    00.00%    : RegionUsageEnvironmentBlock
       Tot: 7fff0000 (2097088 KB) Busy: 57077000 (1425884 KB)

-------------------- Type SUMMARY --------------------------
    TotSize (      KB)   Pct(Tots)  Usage
   28f79000 (  671204) : 32.01%   : <free>
    9eb7000 (  162524) : 07.75%   : MEM_IMAGE
    1c9c000 (   29296) : 01.40%   : MEM_MAPPED
   4b524000 ( 1234064) : 58.85%   : MEM_PRIVATE

-------------------- State SUMMARY --------------------------
    TotSize (      KB)   Pct(Tots)  Usage
   407c5000 ( 1056532) : 50.38%   : MEM_COMMIT
   28f79000 (  671204) : 32.01%   : MEM_FREE
   168b2000 (  369352) : 17.61%   : MEM_RESERVE

RegionUsageIsVAD is big and VirtualAlloc goes here (e.g., .NET heap).

0:072> !EEHeap -gc
Number of GC Heaps: 4
------------------------------
...
------------------------------
GC Heap Size:            Size: 0x2611db9c (638704540) bytes.

Ok, .NET memory usage is pretty high at 630 MB.

2. Which .NET objects are we leaking

 0:072> !dumpheap -stat
------------------------------
Heap 0
total 0 objects
------------------------------
Heap 1
total 0 objects
------------------------------
Heap 2
total 0 objects
------------------------------
Heap 3
total 0 objects
------------------------------
total 0 objects
Statistics:
      MT    Count    TotalSize Class Name
7aef7c48        1           12 System.Collections.Generic.SortedList`2+KeyList[[System.String, mscorlib],[System.Object, mscorlib]]
...

79b57ee8    84541    117150884 System.Object[]
1d6af048 15013412    420375536 Microsoft.VisualStudio.TestTools.Common.TimerResult

Ok, there are 15 millions of Microsoft.VisualStudio.TestTools.Common.TimerResult objects. !gcroot confirmed they are rooted.

3. Where are we leaking it?  

Use "!dumpheap -mt" and "!do" to examine a few of them:

0:072> !dumpheap -mt 1d6af048
------------------------------
Heap 0
 Address       MT     Size
009aa8e8 1d6af048       28    
009aabf8 1d6af048       28    

...

0133f0e8 1d6af048       28    
0133f104 1d6af048       28

 072> !dumpobj 0133f0e8
Name:        Microsoft.VisualStudio.TestTools.Common.TimerResult
MethodTable: 1d6af048
EEClass:     1e969b88
Size:        28(0x1c) bytes
File:        D:\WINDOWS\assembly\GAC_MSIL\Microsoft.VisualStudio.QualityTools.Common\10.0.0.0__b03f5f7f11d50a3a\Microsoft.VisualStudio.QualityTools.Common.dll
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
79b9d470  40003d2        4        System.String  0 instance 08aa3eb4 m_name
79ba5af0  40003d3        8      System.DateTime  1 instance 0133f0f0 m_startTime
79ba4a94  40003d4       10      System.TimeSpan  1 instance 0133f0f8 m_duration
0:072> !dumpobj -nofields 08aa3eb4
Name:        System.String
MethodTable: 79b9d470
EEClass:     798d5ec0
Size:        58(0x3a) bytes
File:        D:\WINDOWS\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String:      _TWI_GetRandomWorkItem

Almost all TimerResult have the same name " _TWI_GetRandomWorkItem". And this is enough for us to track down the "leak". It turns out that in our test code, we have a poorly written algorithm to get a random sequence of items from a list. And it might leads to a "infinite" loop and keeps calling an method that calls TestContext.BeginTimer/TestContext.EndTimer. So each call will create a new Timer Result object. This will eventually cause the app to crash due to OutOfMemoryException.

We can also drill down further on these 15 million objects. Are they stored in a few collections? Yes, 4 big object[] are holding about 11 million of them.

0:022> !dumpheap -mt 79b57ee8 -min 1000000
------------------------------
Heap 0
 Address       MT     Size
11863098 79b57ee8 16777232    
total 0 objects
------------------------------
Heap 1
 Address       MT     Size
129d3cf8 79b57ee8 30433544    
58fc0038 79b57ee8 33554448    
total 0 objects
------------------------------
Heap 2
 Address       MT     Size
54420038 79b57ee8 33554448    
total 0 objects
------------------------------
Heap 3
 Address       MT     Size
total 0 objects
------------------------------
total 0 objects
Statistics:
      MT    Count    TotalSize Class Name
79b57ee8        4    114319672 System.Object[]
Total 4 objects

Spot check some entries in these huge object[] and verified that they are the same TimerResult objects with the same name _TWI_GetRandomWorkItem:

0:022> !dumparray -start 10000 -length 2 -details 11863098
Name:        Microsoft.VisualStudio.TestTools.Common.TimerResult[]
MethodTable: 79b57ee8
EEClass:     798d69a8
Size:        16777232(0x1000010) bytes
Array:       Rank 1, Number of elements 4194304, Type CLASS
Element Methodtable: 1d6af048
[10000] 0c7fab28
    Name:        Microsoft.VisualStudio.TestTools.Common.TimerResult
    MethodTable: 1d6af048
    EEClass:     1e969b88
    Size:        28(0x1c) bytes
    File:        D:\WINDOWS\assembly\GAC_MSIL\Microsoft.VisualStudio.QualityTools.Common\10.0.0.0__b03f5f7f11d50a3a\Microsoft.VisualStudio.QualityTools.Common.dll
    Fields:
              MT    Field   Offset                 Type VT     Attr    Value Name
        79b9d470  40003d2        4            System.String      0     instance     08aa3eb4     m_name
        79ba5af0  40003d3        8          System.DateTime      1     instance     0c7fab30     m_startTime
        79ba4a94  40003d4       10          System.TimeSpan      1     instance     0c7fab38     m_duration
[10001] 0c7fab44
    Name:        Microsoft.VisualStudio.TestTools.Common.TimerResult
    MethodTable: 1d6af048
    EEClass:     1e969b88
    Size:        28(0x1c) bytes
    File:        D:\WINDOWS\assembly\GAC_MSIL\Microsoft.VisualStudio.QualityTools.Common\10.0.0.0__b03f5f7f11d50a3a\Microsoft.VisualStudio.QualityTools.Common.dll
    Fields:
              MT    Field   Offset                 Type VT     Attr    Value Name
        79b9d470  40003d2        4            System.String      0     instance     08aa3eb4     m_name
        79ba5af0  40003d3        8          System.DateTime      1     instance     0c7fab4c     m_startTime
        79ba4a94  40003d4       10          System.TimeSpan      1     instance     0c7fab54     m_duration

0:022> !do -nofields 08aa3eb4    
Name:        System.String
MethodTable: 79b9d470
EEClass:     798d5ec0
Size:        58(0x3a) bytes
File:        D:\WINDOWS\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
String:      _TWI_GetRandomWorkItem

Anyway, we fixed the issue by switching to a better algorithm of picking a random sequence of items from a collection.

Have a good day.

Jiange