Living the PFE life
First of all, welcome back if already know me. If you don’t know me; Hi, I’m Robin, I’ve been a SharePoint developer/consultant for the last six years and to bring you up to speed, I have an old blog from my previous employer that I would like to invite you to take a look at http://community.zevenseas.com/blogs/robin
What is PFE? It stands for Premier Field Engineer and in short, the main purpose is to provide reactive and proactive support to Microsoft customers. For more info check out https://careers.microsoft.com/
So, how is my life any different than working as SharePoint consultant/developer? Well, for starters, the duration of engagements I have with customers is brought down to a maximum of 4 to 5 days, most of the time it’s just 1 or 2 days. So that’ s quite different than the 3-6 month engagements I was used to. And actually this is one of the things I really do like, the variety amongst the customers, the traveling and the experience of dealing with different cultures in such a fast pace.
In this post I wanted to share my experiences with the infamous (https://www.youtube.com/watch?v=GIGtHhAfe8w) critsits. What does a critsit actually mean? Well, its stands for “Critical Situation” and it means that whenever you’re on-call, you will get an IM/e-mail/phone-call that you have to go as quickly as possible to the customer to solve their problem. For me it means, getting in contact with the internal travel desk to arrange a flight (if I need to fly) and hotel and then just go. Does this mean that the critsit will begin when I arrive on-site? No! Way before we even get called there is another group of engineers working on the case. They are called the Escalation Engineers, these guys are the real heroes, since they are figuring everything out behind the scenes. Giving us, the PFE, an action plan what to do when we arrive on-site and how to solve the problem if we are lost. Also, the data we gather as PFE, is being send these guys to work on behind the scenes. So, to give you some examples, here are two experiences where I couldn’t solve the problem without them.
Critsit #1 : “Search crawler hangs”
At this one, I was send to a customer who was facing problems with their crawler. They had some dashboards to showed information that were search driven so it was mission critical for them to solve the hanging crawler because the dashboards were showing outdated information. So, first things first, I took a look at their Search Administration page and indeed, the crawler was running for 145hours and nothing was happening. Then I did what almost everyone would do in that situation, check the EventViewer and the ULS logs.. and it showed NOTHING related to the search service.. Aargh#1
So, I tried restarting the search service and see what would happen. Well, as you would have guessed, once it was running, it returned to the exact same state the crawler was in previously.. Aargh#2
At this time I was having a call with the Escalation Engineer what to do next, and he pointed me in the direction of looking at SQL. Determining if maybe there is a deadlock between two processes. And guess what, there was a deadlock! Now I will not going into much detail here but it turned out that a stored procedure was in an infinite loop (remember, you should never ever touch the database! ;).
To fix this, we modified this particular stored procedure to able to break out of the loop. Restarted the search service, did a reset of the index, stopped the search service. Restored the stored procedure, started the search service, did a full crawl and everything was working again and the crawl completed within 1 hour.
And everyone was happy.. Now, did it feel dirty? YES! Did it feel unnatural? YES! Why? Well, I was doing something that is on the list of forbidden things to do, even when working for Microsoft, we were still very hesitant to do this action. And it was a last resort action we had to perform..
Critsit #2 : “w3wp process crashes randomly”
When I read the problem description, I immediately thought of custom code that had some ‘dispose’ problems. So, I asked for the custom code, to look at it in my hotel room. And, unfortunately for me, the code was perfectly fine, meaning that something else was crashing the w3wp process. The question of course was.. what was it? After excluding some scenario’s and checking the ULS –, IIS logs and the EventViewer to find out if there was a correlation between all these three. According to the ULS logs, ‘nothing’ was wrong at the moment of crashing. Next thing was maybe the amount of memory being consumed what caused the crash, while the process was peaking at 1.2gb’s of memory, it crashed at 600mb. So that couldn’t be the problem.
Yet again it was time to call the escalation engineer again to help out. Now, I’m not going into a lot of detail again but it turned out to be something was corrupting the memory heap and the garbage collector was detecting this and therefore it crashed (gracefully) the w3wp process. How did we managed to get so far? Well, there is a tool called windbg (check out Download and Install Debugging Tools for Windows) , and you attach this debugger to a process. In our case the ‘faulty’ w3wp process and then set a breakpoint on which it crashed all the time. Then when it hits the breakpoint, we create a dump of the memory and then some magic happens here at Microsoft and it turned out to be another thread was doing this, this thread being the garbage collector thread.
Now to give you some background, this customer was running SharePoint 2007 with a custom implementation of EBS (the predecessor of RBS), which has some known issues but none of them were applicable here. So, we couldn’t work this out, meaning that the customer has to upgrade to the latest CU and then the Product Group is going to take a look at it.. Aargh#3 Conclusion : problem not solved yet, will keep you posted on the outcome!
Now for me, these two critsits were awesome. I mean, I only heard of the tool windbg, I did not know what a corrupt memory heap was, and now I know! Although at times, it felt like being a puppet in a way, I did learn a lot from this. And that’s the thing I wanted to get across, even after being a SharePoint developer for six years, I’m still learning SharePoint! :)