question

AshifShaikh-9988 avatar image
0 Votes"
AshifShaikh-9988 asked ErlandSommarskog commented

DIfferent execution plan on different replica in SQL Server Always on 2016

Hello,

I have a strange issue wherein I am getting a different execution plan on different secondary replicas in SQL Server 2016 SP2 Always on.

Architecture overview:
8 Nodes clusters with the primary and 2 secondary replicas in physical data center and 5 secondary replicas in AWS.

The problematic query works perfectly fine on the 3 replica's in Physical Datacenter but doesn't performs well in all the secondary replica's in AWS (We get non-clustered index scan + Index Spool on a particular table) and execution plan of the physical nodes shows Index seek for the same table.

Why I am having a different exec plan and performance problems for the Nodes in AWS.?

FYI, the table in question is a large table which had 200 Million rows and we deleted around 150 million of rows from the table, we did Index re-organize, stats update and everything however the issue remains the same.

I have checked the following but in this case my query is working perfectly fine on the replica's in the physical data center, the problem is only with the replica's in AWS
https://docs.microsoft.com/en-us/troubleshoot/sql/availability-groups/query-performance-degradation

One Strange thing, I am getting huge worktable logical reads
Table 'Worktable'. Scan count 92, logical reads 14109650, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.



Also, everything was working perfectly fine before the deletion, issue started with the deletion.

sql-server-generalsql-server-transact-sql
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered DanGuzman commented

Optimization is always a bit of a gamble. Keep in mind that the optimisation is performed individually on each machine. And even though the statistics is the same on all nodes, the outcome does not have to be the same.

On the top of my head, I can think of two reasons:
1. Available memory. The amount of memory available to SQL Server can affect the plan choice. I recall a story where someone had a query that ran fine when "max server memory" was set to 352 GB. But when they increased the memory to 356 GB, the result was a disaster. SQL Server now thought there was a enough memory for a certain sort or hash operation, but there was a miss-estimate and there were serious disk spilling.
2. A timing issue, so that you get an optimisation timeout before the optimiser arrives to the good plan.

So this is what you should check:
1. The setting for "max server memory".
2. Are there are sort or hash spills in the execution plans? (Look for yellow warning triangles on these operators.)
3. Right-click the left-most operator and select Properties. Check for Optimization Level and reason for early termination of Optimization.

· 6
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

  1. Are there are sort or hash spills in the execution plans? (Look for yellow warning triangles on these operators.)

Yes, I see sort operators which is taking 71% of the total cost which could be due to the index scan however no such sort operators or index scan on the replica's in Physical Datacenter

  1. Right-click the left-most operator and select Properties. Check for Optimization Level and reason for early termination of Optimization.

Optimization level is FULL, no reason for early termination though.

  1. The setting for "max server memory".

Max server memory is set to 350 GB out of 380 GB on the server.

Also, everything was working perfectly fine before the deletion, issue started with the deletion.

One Strange thing, I am getting huge worktable logical reads
Table 'Worktable'. Scan count 92, logical reads 14109650, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.






0 Votes 0 ·

Can you upload the execution plan XML of the primary and secondary plan to Paste The Plan and add the links here?


0 Votes 0 ·

Yes, I see sort operators which is taking 71% of the total cost which could be due to the index scan

Note that 71% is only an estimate. The actual cost can be higher or lower. Seems to be higher in this case.

Max server memory is set to 350 GB out of 380 GB on the server.

On the replicas in AWS? What about the on -prem machines.

One Strange thing, I am getting huge worktable logical reads

Table 'Worktable'. Scan count 92, logical reads 14109650, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

That certainly sounds like a spill. It could also be a spool operator.

Also, everything was working perfectly fine before the deletion, issue started with the deletion.

Yes, when you deleted lots of data, there was a shakeup of statistics and presumptions for the optimiser's work.

0 Votes 0 ·

Updating the statistics with FULL SCAN resolved the issue @DanGuzman @ErlandSommarskog

however any hints on what could be the reason why Query was performing good on Secondary replicas in the physical data center but not on the secondary replicas in AWS??

Thanks in advance..

0 Votes 0 ·
Show more comments
ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered ErlandSommarskog commented

Updating the statistics with FULL SCAN resolved the issue

Yes, you did reorganize, which does not touch the statistics. You ran UPDATE STATISTICS separately, but I guess that was with the default sampling rate. Had you performed an index rebuild instead, you would have gotten fullscan statistics included. (Provided that you had run a separate updatestats operation afterwards, that is!)

however any hints on what could be the reason why Query was performing good on Secondary replicas in the physical data center but not on the secondary replicas in AWS??

So this is what I answered in my initial post. Since you have destroyed all evidence, I don't think we can say much more. The machines have different amount of memory, and as I discussed that could result in different plans.

· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

So this is what I answered in my initial post. Since you have destroyed all evidence, I don't think we can say much more. The machines have different amount of memory, and as I discussed that could result in different plans.



If the problem happens again, you could first try DBCC FREEPROCCACHE for the problem query or plan handle to see if you get a better plan. A side effect of updating stats is the query was recompiled on all nodes. We can't really say if that fixed a sniffing problem or the updated stats were needed for a better plan.


I saw the statistics sampling on the remote nodes was different from the physical nodes, may be this could be one of the reason. But not sure why it just updated the statistics on the physical nodes and not on the remote nodes. I have never seen such issues as of now however this happened only after millions of rows were updated during a short period of time.

cc @ErlandSommarskog @DanGuzman









0 Votes 0 ·

When you say it... I said above that statistics are replicated from the primary to the secondary. But SQL Server can also create temporary statistics on the read-only that are saved in tempdb. Snd it can perform an auto-update of existing statistics on the secondary, if SQL Server deems to be state. Again, saved in tempdb. So maybe something like this happened here.

Since I don't know much about this, I turned to Google which pointed me to this article from Rajendra Gupta, that I found quite informative: https://www.sqlshack.com/sql-server-statistics-in-always-on-availability-groups/

1 Vote 1 ·