IRL #8: SQL Server Down

Problem:

My PocketDBA® client called in with a non-responsive SQL Server. Application errors, timeout, alerts firing. SQL Server box is up, and the services are running.

Background:

This is a stand-alone bare metal server located at a hosting facility with some local storage as well as NAS storage.  SQL Server 2016 Standard Edition, 8 cores, 128GB RAM.  The primary database is nearing 1TB and is has multiple filegroups (Active, Archive, Indexes, etc.).   This database and apps that touch it ARE the company.

Investigation:

  • Attempts to Login failed – SSMS, Dedicated Admin Connection (taken), SQLCMD
  • Errorlog accessed directly from the file system showed:
    • Log Backup failures
    • Multiple Slow I/O requests: “SQL Server has encountered 112 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Company_LOG\Company_log.ldf] in database id 42”
    • Sp_whoisactive revealed a long running LOG backup reading from the H drive, writing to the N drive.
    • Windows Resource Monitor revealed Disk H Active Time (%) at 100, Disk Queue length – 0.
  • During this process the database went into suspect mode as it lost its connection to the .LDF completely. Verified the file was still in the expected location.
    • Cannot bring offline or online via any method (due to the above log backup)

 

Process:

  • KILL the Log Backup and disable the job. No change after 30 minutes for the KILL to complete, no additional traffic on this database.
  • Contact the hosting firm to open a critical support ticket to check the NAS.
  • Host confirmed cable settings, O/S settings, network settings and restarted the NAS.

 

Results:

  • After the NAS reboot, everything came up and appeared normal.
  • Brought the database online with no issues and a quick run through the Recovery process

 

Conclusions:

  • Always verify:
    • Services are running
    • Drives are available
    • Connections can be made
    • Sp_whoisactive – run it early in the process
  • Often times, SQL Server is a victim of something else.
  • NAS vs SAN…largely a budget issue but has performance implications
  • Make sure you have the phone numbers handy for anyone supporting your systems and an SLA in place.
  • High Availability is a really good idea. This was a 3 hour outage.

 

Other posts in the IRL – In Real Life series:
https://dallasdbas.com/category/sql/irl/

 

Leave a Comment

Sign up for our Newsletter

%d bloggers like this: