Configuring load balancing for Windows #Azure #HPC Burst Scenarios
Whilst testing high performance Excel Workbook Offloading our project team found that our substantial amount of Azure muscle provided in our Azure subscription wasn’t being optimised. Since this might be a situation you encounter I thought I’d share this with you.
With some help from the engineering team we were pointed at Configuration Files for HPC 2008 R2. Our scenario would mean that Excel is running on each core distributed across the Azure service (as described in the Node Template). What we found however was the number of Excel.EXE files was not corresponding to the number of HPC Macro partitions being executed. Close but not 1:1 which is what we expected.
Its here that you consider looking at the defining the broker monitoring settings in the service configuration file.
By changing the broker loadbalancing attributes this improved/fixed the core allocation per partition. Typically you will find that there is some “oversubscription” (e.g. 2-3 times)
Specifically: serviceRequestPrefetchCount=”0” in the ExcelService_1.0.xml config file
<loadBalancing messageResendLimit="3" serviceRequestPrefetchCount="1"serviceOperationTimeout="86400000" endpointNotFoundRetryPeriod="300000"/>
According to the SOA Service Configuration Files in Windows HPC Server 2008 R2 if the prefetch count is set to 1, two requests are sent to each “service instance".
This service instance is an instance of the SOA Service host process. In the Excel workbook offloading you will typically use core allocation, this would be one per core. In node allocation this would be one per node.
This is only an optimization if the number of requests is close to the number of service instances.
Normal SOA jobs tend to have many requests per service instance, which is why the default the oversubscription count higher than 0.
By changing this value to 0 we noticed that 1 EXCEL.EXE would result per partition (and therefore per core!)
In summary, its worth bearing in mind these server configuration variables such as the messageResendLimit