Performance Tuning

How old are those stats?

September 30, 2021 by Kevin3NF 2 Comments

SQL Server maintains a variety of stats about all sort of performance items.

Index usage (or missing indexes)
Query performance
Corrupt pages
Disk IO performance
Way more than I care to list here

I was trying to troubleshoot some TerriBad * tempdb write performance…almost 3000ms per write, on a server that we recently migrated to.

Our data center vendor of course said the storage was perfect and that we should try troubleshooting SQL Server. (Duh?)

These stats are stored in the sys.dm_io_virtual_file_stats DMF, and I found this gem under the ‘sample_ms’ column description in the result set:

“The data type is int for SQL Server 2008 through SQL Server 2014 (12.x). In these versions, the value will reset to 0 after approximately 25 days of continuous database engine uptime.”

Now, this is a SQL 2017 box, but this tells me that these stats are going to reset at some point.

When I re-ran my IO stalls query…the writes for all 8 tempdb data files are at 7ms. The sample_ms is currently 3,097,351,985 which is about 36 days if I’ve done my math right. Server uptime is approaching 90 days.

The takeaway for you, dear reader, is to know where to find the stats but also to know what timeframe they represent.

* Thanks to Brent Ozar (b|t) for this highly amusing term 🙂

Thanks for reading!

Kevin3NF

Follow @Dallas_DBAs

Virtual Log Files

November 24, 2020 by Kevin3NF Leave a Comment

Today’s post is a guest article from a friend of Dallas DBAs, writer, and fantastic DBA Jules Behrens (B|L)

One common performance issue that is not well known that should still be on your radar as a DBA is a high number of VLFs.

Virtual Log Files are the files SQL Server uses to do the actual work in a SQL log file (MyDatabase_log.LDF). It allocates new VLFs every time the log file grows. Perhaps you’ve already spotted the problem – if the log file is set to grow by a tiny increment, then if your the file ever grows very large, you may end up with thousands of tiny little VLFs, and this can slow down your performance at the database level. Think of it like a room (the log file) filled with boxes (the VLFs). If you just have a few boxes, it is more efficient to figure out where something (a piece of data in the log file) is, than if you have thousands of tiny boxes. (Analogy courtesy of @SQLDork)

It is especially evident there is an issue with VLFs when SQL Server takes a long time to recover from a restart. Other symptoms may be slowness with autogrowth, log shipping, replication, and general transactional slowness. Anything that touches the log file, in other words.

The best solution is prevention – set your log file to be big enough to handle its transaction load to begin with, and set it to have a sensible growth rate in proportion to its size, and you’ll never see this come up. But sometimes we inherit issues where best practices were not followed, and a high number of VLFs is certainly something to check when doing a health assessment on an unfamiliar environment.

Find your VLF counts

The built-in DMV sys.dm_db_log_info is specifically for finding information about the log file, and command DBCC LOGINFO (deprecated) will return a lot of useful information about VLFs as well. There is an excellent script for pulling the count of VLFs that uses DBCC LOGINFO from Kev Riley, on Microsoft Tech Net:

https://gallery.technet.microsoft.com/scriptcenter/SQL-Script-to-list-VLF-e6315249

There is also a great script by Steve Rezhener on SQLSolutionsGroup.com that utilizes the view:

https://sqlsolutionsgroup.com/capture-sql-server-vlf-information-using-a-dmv/

Either one of these will tell you what you ultimately need to know – if your VLFs are an issue. How many VLFs are too many? There isn’t an industry standard, but for the sake of a starting point, let’s say a tiny log file has 500 VLFs. That is high. A 5GB log file with 200 VLFs, on the other hand, is perfectly acceptable. You’ll likely know a VLF problem when you find it; you’ll run a count on the VLFs and it will return something atrocious like 20,000. (ed – someone at Microsoft support told me about one with 1,000,000 VLFs)

Resolution

If the database is in Simple recovery model and doesn’t see much traffic, this is easy enough to fix. Manually shrink the log file as small as it will go, verify the autogrow is appropriate, and grow it back to its normal size. If the database is in Full recovery model and is in high use, it’s a little more complex. Follow these steps (you may have to do it more than once):

Take a transaction log backup .
Issue a CHECKPOINT manually.
Check the empty space in the transaction log to make sure you have room to shrink it.
Shrink the log file as small as it will go.
Grow the file back to its normal size.
Lather, Rinse, Repeat as needed

Now check your VLF counts again, and make sure you are down to a nice low number. Done!

If you need help, contact us

Thanks for reading!

Kevin3NF

IRL #6: Availability Group Performance

October 1, 2019 by Kevin3NF Leave a Comment

Problem:

My client came to me with their biggest company-wide issue. They are a retailer with many independent sales reps. These reps tend to enter a majority of their orders into the system during the last 4 hours of each month, and enter just enough to hit the next commission level. The data-entry part is fine, but the commission reports get way behind on showing the orders just entered.

Background:

2 node SQL 2016 Availability Group. Automatic Failover, Synchronous commit. One readable secondary. The reports traffic has to be pointed to the Primary if it gets more than 10 minutes out of sync.

Investigation:

I’ve only done the basics of Availability Group work at this point – Install, add/remove databases, etc. so I knew I needed to brush up on the various phases of the data movement process, especially since I was under the impression that if data was entered successfully on the Primary replica, it HAD to be entered and visible on the Secondary replica. This impression did not mesh with what the client was telling me. Also, we were only 5 days away from EOM (end-of-month).

The best link I found to learn about these phases was from Microsoft’s CSS SQL Escalation Services team:

Troubleshooting data movement latency between synchronous-commit AlwaysOn Availability Groups

From here, I surmised that the issue was in this process:

“logBlock receive->LogCache->LogFlush->HardenToLDF->AckPrimary”

This is followed directly by the “redo” phase, which makes the data visible to requesting connections.

Alongside this, I was running sp_whoisactive to try to catch any waits or queries that might need some tuning or investigation. I found this one when the system would back up a bit: DIRTY_PAGE_TABLE_LOCK

If you are not aware already, the best resource on the web (In my opinion) for wait types is SQLSkills.com. THIS article for the above wait type gave me this golden nugget:

“This wait type can become prevalent on a replica where there is a lot of read activity on the readable secondary and there is a heavy update workload on the primary replica.”

This matched my scenario perfectly. The wait is specific to parallel redo threads, which are the default in SQL 2016+

Paul links over to this MS Tiger Team post for a further discussion of Parallel vs. Serial redo. Go read it. If you already know this, share with a co-worker.

Findings:

Checkpoint (attempted as anecdotal possible solution by Paul) may give relief
Trace Flag 3459 disables parallel redo without a SQL restart (version/CU specific)
To re-enable parallel redo, disable the trace flag and restart SQL.

Actions Taken:

Checkpoint gave some relief
Enabled TF 3459 globally on the Secondary replica
Continued to monitor Secondary replica using:

Select
	datediff(ms,last_redone_time,last_hardened_time) as [MilliSeconds behind]
	,Cast((datediff(ss,last_redone_time,last_hardened_time)) as decimal(18,2)) as [Seconds behind]
	,Cast((datediff(mi,last_redone_time,last_hardened_time)) as decimal(18,2)) as [Minutes behind]
	,redo_queue_size as [Redo Queue size in KB]
	,redo_rate as [Redo Rate KB/second]
	, getdate() as [datetime]
From sys.dm_hadr_database_replica_states
Where 1=1
	and is_primary_replica = 0

Results:

Trace Flag 3459 completely fixed the issue, and the reporting side of the EOM process worked flawlessly for the first time in many months. As I write this, we have gone through 2 separate EOM periods without issues.

Conclusion:

Understanding the phases of the AG process, or any other offloading of reporting processes is critical to troubleshooting
“Synchronous commit” does not mean “Immediately readable”. It means the transaction is hardened to the log file on all synchronous replicas. Many presenters/bloggers gloss over this distinction.
Know your troubleshooting tools (sp_whoisactive, AG Dashboard) and where to go to learn (MS, SQL Skills, Ozar, etc)

Other posts in the IRL – In Real Life series:
https://dallasdbas.com/category/sql/irl/

T-SQL Tuesday #118: Fantasy SQL Feature

September 10, 2019 by Kevin3NF Leave a Comment

T-SQL Tuesday is a monthly blog party for the SQL Server community. It is the brainchild of Adam Machanic (b|t) and this month’s edition is hosted by Kevin Chant (b|t), SQL dude and fellow cyclist! Kevin wants to know what our “Fantasy SQL Feature” is.

I asked my LinkedIn connections a very similar question in my “Question of the Week” there. Hit the link to see a lot of different responses.

One of the recurring themes and possibly the most common was:

Load Balanced Writes. Unless I’ve completely missed something, there is nothing in SQL Server natively that will allow you to write to different copies of the database (leaving out Merge and Bi-Directional Transactional Replication, because they suck and don’t scale).

Don’t get me wrong…I love read-only replicas in an Availability Group, replication for reporting, or even delayed Log Shipping. They are wonderful for taking the read traffic off an OLTP box, but if you are dying under heavy load in a very optimized setup…oh well.

Bonus Fantasy: Give me a button in the Log Shipping Setup GUI (Database Properties>>Log Shipping) that just says “Re-initialize”. Sometimes LS just falls apart and its easier to tear it down and start over. Even easier is to just back up the db, restore over the Secondary db and go. But, for the small business using LS for poor-man’s DR and no DBA on staff…give me a button. Ideally that button calls sp_ReInit_Log_Shipping @DBName = ‘MyDB’. But someone will have to write that first. I am not that someone. This may already be in the DBATools.io set of toys, but I haven’t looked, and again…small customers need a button in a GUI.

Thanks for reading!

Kevin3NF

Follow @Dallas_DBAs

IRL #4: Instance Tuning

July 22, 2019 by Kevin3NF Leave a Comment

IRL #4 – tuning a server instance in very small increments

Problem:

Main production server is behaving “sluggishly” and additional load makes it worse. Additionally, deadlocks are a problem, at 880 per day on average. Client wants to implement changes in small, tightly controlled steps. Dev, QA, then Prod.

Background:

SQL Server 2012 (RTM) Enterprise
8 cores, 58 GB RAM (32 allocated to SQL Server)
1 TempDB file (default setting)
Cost Threshold for Parallelism – 5 (default setting)
CXPACKET waits – 57.2%
LATCH_EX waits – 37.0%

Investigation:

Multiple queries that run thousands of times per hour or day have a cost at or near 5. Many in the 15-20 range.
Write stalls on tempdb data file avg 87ms, resides on same drive as primary database. 4.3M writes/day avg.

Recommended Steps:

Upgrade to current SP and CU
Move tempdb to dedicated drive
Increase Cost Threshold for Parallelism to 25

Results:

Query Performance (click to enlarge):

TempDB:

Before: avg Read stalls 2ms, avg Write Stalls 87ms
After: avg Read stalls 1ms, avg Write Stalls 275ms
- Investigating drive config (HDD/SSD, RAID level)

Other:

Deadlocks dropped from 880/day to 380/day – 56.8% improvement without touching any code
CXPACKET waits dropped to 45.5% – 20% improvement
LATCH_EX waits dropped to 24.1% – 35% improvement

Conclusions:

Our methodology of working from Instance to Database to Index to Query continues to prove successful.
Verify all drive info before moving files around (tempdb, log files, etc.)

Other posts in the IRL – In Real Life series:

https://dallasdbas.com/category/sql/irl/

Thanks for reading!

Kevin3NF

Follow @Dallas_DBAs

Video: Getting Started with SQL Server Execution Plans

January 17, 2019 by Kevin3NF Leave a Comment

I put this video together in about ten minutes of T-SQL code prep, and one-take of the video. It is not meant to be an all-inclusive lesson…merely an introduction to how to read SQL Server Execution Plans (Right to Left, Top to Bottom, etc.) and some basic things to look at. Senior performance tuning DBAs…you probably don’t want to watch this 🙂

Comments and likes are appreciated if you think it has value…

Thanks for watching!

Kevin3NF

Follow @Dallas_DBAs