IRL #6: Availability Group Performance

Problem:

My client came to me with their biggest company-wide issue. They are a retailer with many independent sales reps. These reps tend to enter a majority of their orders into the system during the last 4 hours of each month, and enter just enough to hit the next commission level. The data-entry part is fine, but the commission reports get way behind on showing the orders just entered.

Background:

2 node SQL 2016 Availability Group. Automatic Failover, Synchronous commit. One readable secondary. The reports traffic has to be pointed to the Primary if it gets more than 10 minutes out of sync.

Investigation:

I’ve only done the basics of Availability Group work at this point – Install, add/remove databases, etc. so I knew I needed to brush up on the various phases of the data movement process, especially since I was under the impression that if data was entered successfully on the Primary replica, it HAD to be entered and visible on the Secondary replica. This impression did not mesh with what the client was telling me. Also, we were only 5 days away from EOM (end-of-month).

The best link I found to learn about these phases was from Microsoft’s CSS SQL Escalation Services team:

Troubleshooting data movement latency between synchronous-commit AlwaysOn Availability Groups

From here, I surmised that the issue was in this process:

“logBlock receive->LogCache->LogFlush->HardenToLDF->AckPrimary”

This is followed directly by the “redo” phase, which makes the data visible to requesting connections.

Alongside this, I was running sp_whoisactive to try to catch any waits or queries that might need some tuning or investigation. I found this one when the system would back up a bit: DIRTY_PAGE_TABLE_LOCK

If you are not aware already, the best resource on the web (In my opinion) for wait types is SQLSkills.com. THIS article for the above wait type gave me this golden nugget:

“This wait type can become prevalent on a replica where there is a lot of read activity on the readable secondary and there is a heavy update workload on the primary replica.”

This matched my scenario perfectly. The wait is specific to parallel redo threads, which are the default in SQL 2016+

Paul links over to this MS Tiger Team post for a further discussion of Parallel vs. Serial redo. Go read it. If you already know this, share with a co-worker.

Findings:

Checkpoint (attempted as anecdotal possible solution by Paul) may give relief
Trace Flag 3459 disables parallel redo without a SQL restart (version/CU specific)
To re-enable parallel redo, disable the trace flag and restart SQL.

Actions Taken:

Checkpoint gave some relief
Enabled TF 3459 globally on the Secondary replica
Continued to monitor Secondary replica using:

Select
	datediff(ms,last_redone_time,last_hardened_time) as [MilliSeconds behind]
	,Cast((datediff(ss,last_redone_time,last_hardened_time)) as decimal(18,2)) as [Seconds behind]
	,Cast((datediff(mi,last_redone_time,last_hardened_time)) as decimal(18,2)) as [Minutes behind]
	,redo_queue_size as [Redo Queue size in KB]
	,redo_rate as [Redo Rate KB/second]
	, getdate() as [datetime]
From sys.dm_hadr_database_replica_states
Where 1=1
	and is_primary_replica = 0

Results:

Trace Flag 3459 completely fixed the issue, and the reporting side of the EOM process worked flawlessly for the first time in many months. As I write this, we have gone through 2 separate EOM periods without issues.

Conclusion:

Understanding the phases of the AG process, or any other offloading of reporting processes is critical to troubleshooting
“Synchronous commit” does not mean “Immediately readable”. It means the transaction is hardened to the log file on all synchronous replicas. Many presenters/bloggers gloss over this distinction.
Know your troubleshooting tools (sp_whoisactive, AG Dashboard) and where to go to learn (MS, SQL Skills, Ozar, etc)

Other posts in the IRL – In Real Life series:
https://dallasdbas.com/category/sql/irl/

IRL #6: Availability Group Performance

Problem:

Like this:

Related

Leave a Comment

Recent Posts

SQL Server Auto-settings: The Good, the Bad, and the Ugly

SQL Server Errorlog: Not Just for Errors!

SQL Server Editions: What’s the Difference and Why It Matters

Sign up for our Newsletter

IRL #6: Availability Group Performance

Problem:

Share this:

Like this:

Related

Leave a Comment

Recent Posts

SQL Server Auto-settings: The Good, the Bad, and the Ugly

SQL Server Errorlog: Not Just for Errors!

SQL Server Editions: What’s the Difference and Why It Matters

Sign up for our Newsletter