My client came to me with their biggest company-wide issue. They are a retailer with many independent sales reps. These reps tend to enter a majority of their orders into the system during the last 4 hours of each month, and enter just enough to hit the next commission level. The data-entry part is fine, but the commission reports get way behind on showing the orders just entered.
2 node SQL 2016 Availability Group. Automatic Failover, Synchronous commit. One readable secondary. The reports traffic has to be pointed to the Primary if it gets more than 10 minutes out of sync.
I’ve only done the basics of Availability Group work at this point – Install, add/remove databases, etc. so I knew I needed to brush up on the various phases of the data movement process, especially since I was under the impression that if data was entered successfully on the Primary replica, it HAD to be entered and visible on the Secondary replica. This impression did not mesh with what the client was telling me. Also, we were only 5 days away from EOM (end-of-month).
The best link I found to learn about these phases was from Microsoft’s CSS SQL Escalation Services team:
From here, I surmised that the issue was in this process:
This is followed directly by the “redo” phase, which makes the data visible to requesting connections.
Alongside this, I was running sp_whoisactive to try to catch any waits or queries that might need some tuning or investigation. I found this one when the system would back up a bit: DIRTY_PAGE_TABLE_LOCK
If you are not aware already, the best resource on the web (In my opinion) for wait types is SQLSkills.com. THIS article for the above wait type gave me this golden nugget:
“This wait type can become prevalent on a replica where there is a lot of read activity on the readable secondary and there is a heavy update workload on the primary replica.”
This matched my scenario perfectly. The wait is specific to parallel redo threads, which are the default in SQL 2016+
Paul links over to this MS Tiger Team post for a further discussion of Parallel vs. Serial redo. Go read it. If you already know this, share with a co-worker.
- Checkpoint (attempted as anecdotal possible solution by Paul) may give relief
- Trace Flag 3459 disables parallel redo without a SQL restart (version/CU specific)
- To re-enable parallel redo, disable the trace flag and restart SQL.
- Checkpoint gave some relief
- Enabled TF 3459 globally on the Secondary replica
- Continued to monitor Secondary replica using:
Select datediff(ms,last_redone_time,last_hardened_time) as [MilliSeconds behind] ,Cast((datediff(ss,last_redone_time,last_hardened_time)) as decimal(18,2)) as [Seconds behind] ,Cast((datediff(mi,last_redone_time,last_hardened_time)) as decimal(18,2)) as [Minutes behind] ,redo_queue_size as [Redo Queue size in KB] ,redo_rate as [Redo Rate KB/second] , getdate() as [datetime] From sys.dm_hadr_database_replica_states Where 1=1 and is_primary_replica = 0
Trace Flag 3459 completely fixed the issue, and the reporting side of the EOM process worked flawlessly for the first time in many months. As I write this, we have gone through 2 separate EOM periods without issues.
- Understanding the phases of the AG process, or any other offloading of reporting processes is critical to troubleshooting
- “Synchronous commit” does not mean “Immediately readable”. It means the transaction is hardened to the log file on all synchronous replicas. Many presenters/bloggers gloss over this distinction.
- Know your troubleshooting tools (sp_whoisactive, AG Dashboard) and where to go to learn (MS, SQL Skills, Ozar, etc)
Other posts in the IRL – In Real Life series: