SP2010: Search Query Load Balancing *Explained (part 2)

In part 1, I described the load balancing that occurs among Query Components (e.g. SSA load balancing). In this post, I'm expanding the focus to the load balancing involved when there are multiple instances of the SQ&SS (e.g. Farm level load balancing) and from this, begin to explain why the "Internal Server Error Exception Occurred" failure is intentionally ambiguous.

I'll then wrap up this series in a third installment that builds upon this foundation to describe troubleshooting tactics for Query failures. As noted before, this series of posts will focus on SharePoint 2010, but I'm already working on a similar post that will focus on SharePoint 2013.

Search Query & Site Settings (SQ&SS) Revisited

Let's start with a quick review of the load balancing in SharePoint 2010 within Search. In SharePoint 2010, the SQ&SS acts as the Query Processor for the SSA by:

  • Load balancing queries to the appropriate mirror for each Index Partition
  • Issuing Property Store queries
  • Merging/sorting results, security trimming
  • And removing [near] duplicates.

When a user issues a query (e.g.  https://loadBalancedURL/results.aspx?k=foo), the WFE (more specifically, the SP Web Application) has no idea how to process a Search query. However, the Web App does know how to talk to an SSA's WCF Service EndPoint (implemented by the SQ&SS) defined in the [default] Service Connection (aka “Service App Proxy”). After the results are processed, the SQ&SS returns the result set as XML back to the Search Web Parts, where they are then rendered.

 

Primer on WCF EndPoints and SharePoint Service Applications

When it comes to explaining the SharePoint Service Application architecture - an essential aspect for understanding Farm level load balancing, no one has explained this better than Spencer Harbar's post:

"When you start a service machine instance for which there is an associated Service Application, an IIS Virtual Application will be created within the SharePoint Web Services IIS Web site. This will include the Service Application Endpoint (a WCF or ASMX). Each service application must expose a service application endpoint. The service application endpoint is only created on the machine(s) hosting the service machine instance. "

In short, the WCF EndPoint is the point of interaction between a client (in this case, the WFE) and the application being consumed (in this case, the SSA).

Reference: WCF Endpoints: Addresses, Bindings, and Contracts "All communication with a Windows Communication Foundation (WCF) service occurs through the endpoints of the service. Endpoints provide clients access to the functionality offered by a WCF service. Each endpoint consists of four properties: an address that indicates where the endpoint can be found, a binding that specifies how a client can communicate with the endpoint, a contract that identifies the operations available, and a set of behaviors that specify local implementation details of the endpoint. "

For an overly simplistic analogy for WCF, think of a company [that represents an application]. If you [the client] wanted to talk to a live person, you would have to call 555-555-1234 [the "address"] on the telephone [the binding] and navigate through the automated message tree [the contract]. Further, this company may have multiple phone numbers (555-555-4321, 555-555-6789), but each phone number gets you to the same company (e.g. multiple endpoint addresses to the same application).

Similarly, SharePoint Service Applications can have multiple WCF EndPoints as well. For example, the WCF EndPoint for Search is structured as https://[someServerName]:32843/-ssa-guid-/SearchService.svc (where someServerName is the server hosting this service EndPoint) and is provided when the SQ&SS is started on a SharePoint server. For example, in my farm where the SQ&SS is started on two servers, the two EndPoints are https://initech:32843/066239ec05e347a88106bde8749f8cc9/SearchService.svc and https://swingline:32843/066239ec05e347a88106bde8749f8cc9/SearchService.svc.

  • Note: In SharePoint 2010, the SQ&SS can run on ANY server in the farm (and at a minimum, just needs to be started on at least one server in the farm using either PowerShell or the "Services on Server" page in Central Admin)
    • In SharePoint 2013, the SQ&SS still provides the SearchService.svc WCF EndPoint, but it is essentially just a shim that sits in front of the Query Processing Component. Being said, the SQ&SS should ONLY be started on servers that have a QPC

At the SharePoint Farm level, the Application Load Balancing Service Application (the “Farm Topology” service), keeps track of the state of each Service Application's WCF EndPoint(s) and helps the WFEs load balance across each Service Connection (aka “Service App Proxy”). For all SharePoint Service Applications, a Web Application Service Connection is essentially a reference to the particular Service Application's WCF EndPoint(s).

Reference:  "How I Learned to Stop Worrying and Love the SharePoint Topology Service" provides a great deep dive into the SharePoint Topology load balancing  

 

Multiple SQ&SS in the Farm

Let's now extend our original diagram to illustrate multiple SQ&SS components (to keep it simple, I'm intentionally showing just one query component to focus specifically on the SharePoint load balancing. See part 1 to demonstrate how an SQ&SS load balances at the SSA level when there are multiple Index Partitions and/or multiple Mirrors per Partition): 

In this example, the WFE would send the query to one instance of the SQ&SS via WCF SOAP requests. If a user then submitted a second query, the WFE would then round-robin this second query to the next instance of the SQ&SS such as:

To follow this in ULS (Hint: This is the BEST way to start troubleshooting query failures):

  1. Start at the WFE server that fielded the request (such as https://web01/results.aspx?k=foo)
    • Note:  This is a common point of confusion because it seems logical to start look for Search logging on one of the Search servers. However, remember that the Search web parts are hosted on the WFE, and thus, the WFE is going to be the client reaching out to the Search Service Application. Because the SSA may have multiple WCF EndPoints, we need to start at the WFE to determine *which Search WCF EndPoint got called.
  2. Look for the ULS event with "WcfSendRequest" such as the one below to see the request to the SSA

12/03/2013 09:06:58.21 w3wp.exe (0x2958) 0x2AB4
SharePoint Foundation Topology e5mc Medium
WcfSendRequest:
  RemoteAddress:
    'https://swingline:32843/066239ec05e347a88106bde8749f8cc9/SearchService.svc'
  Channel:
    'Microsoft.Office.Server.Search.Administration.ISearchServiceApplication'
  Action: 'https://tempuri.org/ISearchQueryServiceApplication/Execute'
  MessageId: 'urn:uuid:eb7296b7-ef26-4f7d-87f7-dceb6251d2f5'
f41cb190-945e-458e-b924-77ec2fd066d4

This ULS entry above (to reiterate, from the WFE server) provides a few key pieces of data:

  • First, the "WcfSendRequest" furthers the point that the WFE is the client requesting some action (in this case, "execute" a query) from the SSA. In other words, the WFE is "sending" a request
  • Second, the "RemoteAddress" helps us see that the request was sent to the "swingline" server
  • Third, it provides the Correlation Id for this request

From here, go to the ULS on the "swingline" server and filter by the same Correlation Id (in this case "f41cb190-945e-458e-b924-77ec2fd066d4") to see the corresponding  "WcfReceiveRequest", which is the acknowledgement in ULS that the request has been received at the specified "LocalAddress":

10/03/2013 09:06:58.48 w3wp.exe (0x2A8C) 0x130C
SharePoint Foundation Topology e5mb Medium 
WcfReceiveRequest:
  LocalAddress:
    'https://swingline:32843/066239ec05e347a88106bde8749f8cc9/SearchService.svc'
  Channel: 'System.ServiceModel.Channels.ServiceChannel'
  Action: 'https://tempuri.org/ISearchQueryServiceApplication/Execute'
  MessageId: 'urn:uuid:eb7296b7-ef26-4f7d-87f7-dceb6251d2f5' 
f41cb190-945e-458e-b924-77ec2fd066d4

Troubleshooting: the Basics

Although I'm going to defer troubleshooting strategies for a subsequent blog post, I will note that you should then find all of the related query processing activity (e.g. Tripoli connections to the various Query Component(s), queries to the Property Store, merging/sorting results, security trimming, and removing [near] duplicates) and these will all maintain the same Correlation Id that we found on the WFE. In other words, this Correlation Id will span from server-to-server for all events related to *this query (Russ Maxwell's post here does a great job walking through a query step-by-step)

Also, if you can't find the corresponding WcfReceiveRequest on the SQ&SS server (in this case, "swingline"), then the request from the WFE to the SQ&SS server was most likely lost in-flight (e.g. proxy issues, network failures, and/or a missing SharePoint Web Services IIS Web site are the most common that I've seen).

In other words, if the request were to fail within SharePoint per se, then you would see the WcfReceiveRequest and then some failure. However, if the "WcfReceiveRequest" is missing, then the SharePoint/SQ&SS never received the request.

And finally, if you happen to receive the completely generic "Internal Server Error Exception Occurred" with a query, this simply means that *some error occurred when the WFE sent the request to the SSA, whether in-flight or somewhere further downstream within SharePoint. I've seen countless blog posts, forum threads, and so on suggesting fixes for this error. However, this error is completely ambiguous and not intended to indicate *what error occurred... only that *some error occurred.

Being said, with the error being ambiguous by nature, there is no single fix when this message occurs. If you receive this message, use the corresponding Correlation Id to find the request on the applicable WFE and step through the flow of events from the WFE-to-SQ&SS to see what *actual error occurred.

 

In Summary - Three Layers of Load Balancing

As I began in part 1, seemingly “sporadic” query problems are often just straightforward failures being masked by the three levels of load balancing involved with a SharePoint 2010 Search Query and my goal here has been to help unravel all the moving pieces.

When a user submits a query (e.g. to the Enterprise Search Center), the request is typically first load balanced by a NLB to a particular WFE. The WFE then sends the request to the SSA's WCF Service EndPoint (implemented by the SQ&SS), but it is the SharePoint Farm topology load balancing that determines which WCF EndPoint the WFE will contact - and subsequent requests will be round-robin'd (this isn't a word, but I'm going with it) to the next SQ&SS EndPoint.

Once an SQ&SS receives the request from the WFE, the SQ&SS needs to reach out to one Query Component for each Index Partition (part 1 goes deeper into load balancing at the SSA level). If an Index Partition has multiple Query Components (as with the previous example also illustrated above such that Partition1 => QC1a,QC1b and Partition2 => QC2a,QC2b), then the SQ&SS will round-robin its requests to one QC mirror from each Partition. More specifically, in the first query, the SQ&SS may reach out to Partition1a and Partition2b. Then, in a second query, the SQ&SS cycles to Partion1b and Partition2a. In the third query, the SQ&SS again returns to Partition1a and Partition2b.

In upcoming posts, I plan to dive deeper into troubleshooting Query failures as well as write a related post that focuses on SharePoint 2013.