My PocketDBA® client called in with a non-responsive SQL Server. Application errors, timeout, alerts firing. SQL Server box is up, and the services are running.
This is a stand-alone bare metal server located at a hosting facility with some local storage as well as NAS storage. SQL Server 2016 Standard Edition, 8 cores, 128GB RAM. The primary database is nearing 1TB and is has multiple filegroups (Active, Archive, Indexes, etc.). This database and apps that touch it ARE the company.
- Attempts to Login failed – SSMS, Dedicated Admin Connection (taken), SQLCMD
- Errorlog accessed directly from the file system showed:
- Log Backup failures
- Multiple Slow I/O requests: “SQL Server has encountered 112 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Company_LOG\Company_log.ldf] in database id 42”
- Sp_whoisactive revealed a long running LOG backup reading from the H drive, writing to the N drive.
- Windows Resource Monitor revealed Disk H Active Time (%) at 100, Disk Queue length – 0.
- During this process the database went into suspect mode as it lost its connection to the .LDF completely. Verified the file was still in the expected location.
- Cannot bring offline or online via any method (due to the above log backup)
- KILL the Log Backup and disable the job. No change after 30 minutes for the KILL to complete, no additional traffic on this database.
- Contact the hosting firm to open a critical support ticket to check the NAS.
- Host confirmed cable settings, O/S settings, network settings and restarted the NAS.
- After the NAS reboot, everything came up and appeared normal.
- Brought the database online with no issues and a quick run through the Recovery process
- Always verify:
- Services are running
- Drives are available
- Connections can be made
- Sp_whoisactive – run it early in the process
- Often times, SQL Server is a victim of something else.
- NAS vs SAN…largely a budget issue but has performance implications
- Make sure you have the phone numbers handy for anyone supporting your systems and an SLA in place.
- High Availability is a really good idea. This was a 3 hour outage.
Other posts in the IRL – In Real Life series: