SQL Server Time Bombs

Common Reasons for Emergency SQL calls

If you are a production DBA (or Accidental prod DBA) you’ve gotten that frantic call in the middle of the night. Or maybe during little Suzy’s soccer game? Something broke, alerts are firing, nobody can order your widgets and the help desk tech has no idea where to start.

I’ve been on call since 2006. These are the most common things I’ve been called for at the worst times:

Tempdb growth

When tempdb grows out of control, it can easily fill a drive and stop the SQL Server in its tracks. Every database (potentially) uses the same tempdb to do its sorting, aggregating, maintenance and spilling of large transactions. That is the purpose of it.

Nothing in the system will stop a really bad query from filling up the tempdb data files to consume the entire drive. Once it fails the system should be ok/online, but you still have a full drive firing alerts at the sysadmins.

How to prevent: cap the tempdb files to maximum size that will keep alerts from firing for disk space. Also, set up alerts for errors 1105 (data file is full) and 9002 (log file is full). These go to the DBA.

You cannot stop bad queries if you allow users to run Ad Hoc, but you can limit the blast radius.

Disk Space

I don’t have enough toes to count the number of times I’ve been hit with an on-call because disk space was causing issues in SQL Server. Most of the time I search for all files, sort by size descending and look at the biggest files. VERY frequently someone dumped a backup file there, or created a new DB on the wrong drive.

How to prevent: Log the output of exec master..xp_fixeddrives and keep an eye on it. Set permissions for just admins and SQL Service accounts to limit Joe’s ability to break things.

Backup failures

Tend to these quickly, as your RPO might be in jeopardy if backups are missing. Easily 80% of the backup failures I see are related to disk space (see above). 19% are network related. 1% is everything else.

How to Prevent: see Disk Space above. For T-log backups I set the job step to retry once before failing. Weird network blips shouldn’t cause someone to wake me up.

Patching
This is probably the hardest to deal with, but it comes up often. If you are not mostly current on Cumulative Updates and Security Patches, expect this call. Especially during a monthly Windows patch cycle – which doesn’t include SQL Server patches.

How to Prevent: see Disk Space above. For T-log backups I set the job step to retry once before failing. Weird network blips shouldn’t cause someone to wake me up.

The Bottom Line:
An ounce of prevention prevents your phone from ringing. Pay close attention to the setup and maintenance of your SQL Server to stop these issues before they start!

Leave a Comment

Sign up for our Newsletter

%d bloggers like this: