IRL

IRL #8: SQL Server Down

September 6, 2023 by Kevin3NF Leave a Comment

Problem:

My PocketDBA® client called in with a non-responsive SQL Server. Application errors, timeout, alerts firing. SQL Server box is up, and the services are running.

Background:

This is a stand-alone bare metal server located at a hosting facility with some local storage as well as NAS storage. SQL Server 2016 Standard Edition, 8 cores, 128GB RAM. The primary database is nearing 1TB and is has multiple filegroups (Active, Archive, Indexes, etc.). This database and apps that touch it ARE the company.

Investigation:

Attempts to Login failed – SSMS, Dedicated Admin Connection (taken), SQLCMD
Errorlog accessed directly from the file system showed:
- Log Backup failures
- Multiple Slow I/O requests: “SQL Server has encountered 112 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\Company_LOG\Company_log.ldf] in database id 42”
- Sp_whoisactive revealed a long running LOG backup reading from the H drive, writing to the N drive.
- Windows Resource Monitor revealed Disk H Active Time (%) at 100, Disk Queue length – 0.
During this process the database went into suspect mode as it lost its connection to the .LDF completely. Verified the file was still in the expected location.
- Cannot bring offline or online via any method (due to the above log backup)

Process:

KILL the Log Backup and disable the job. No change after 30 minutes for the KILL to complete, no additional traffic on this database.
Contact the hosting firm to open a critical support ticket to check the NAS.
Host confirmed cable settings, O/S settings, network settings and restarted the NAS.

Results:

After the NAS reboot, everything came up and appeared normal.
Brought the database online with no issues and a quick run through the Recovery process

Conclusions:

Always verify:
- Services are running
- Drives are available
- Connections can be made
- Sp_whoisactive – run it early in the process
Often times, SQL Server is a victim of something else.
NAS vs SAN…largely a budget issue but has performance implications
Make sure you have the phone numbers handy for anyone supporting your systems and an SLA in place.
High Availability is a really good idea. This was a 3 hour outage.

Other posts in the IRL – In Real Life series:
https://dallasdbas.com/category/sql/irl/

IRL #7: Test Restores and CheckDB

April 23, 2020 by Kevin3NF 2 Comments

My Pluralsight course for new SQL Server DBAs

Problem:

My client asked me for a way to automate test restores and CheckDB either ad hoc or on a schedule, but not for certain databases that are close to or over 1TB in size.

Background:

My client has a number of SQL Server instances across multiple versions from 2008R2 up to 2016, Standard and Enterprise edition. We recently standardized all of their maintenance (backups, indexes, stats and checkdb) across the board using the Maintenance Solution from Ola Hallengren.

Investigation:

Digging around a bit revealed several choices:

Roll my own T-SQL ($$$)
Use a 3rd party product ($$)
Use DBATools.io (Free)

I was given the OK to spend up to 4 hours for a Proof of Concept on the DBATools option. DBATools is a set of commands that are all built using PowerShell. You don’t need to know PowerShell to use them.

Process:

I have a lab with SQL Server 2016, 2017 and 2019 installed (all Developer edition). Each instance has a few test databases, including WideWorldImporters, StackOverflow and some other local ones used just to test specific ideas.

I am using PSVersion 5.1.x, with the DBATools module installed.

The core command I am using (out of well over 550 available) is Test-DbaLastBackup. I need to start with getting one instance to restore its databases into another, run CheckDB and then drop the DB.

Test-DbaLastBackup -Destination kbh-precision\SQL2019 -SqlInstance  KBH-Precision\SQL2017 `
                   -IgnoreLogBackup -ExcludeDatabase master,model,msdb, StackOverflow

This allowed me to do the minimum, skipping System databases and StackOverflow (due to its size) and work from just the most recent Full backup plus any Differentials. I only have one database that met the criteria, and I have intentionally corrupted it using DBCC WRITEPAGE. DONT DO THIS UNLESS YOU KNOW WHAT YOU ARE DOING!!!

Results:

SourceServer : KBH-Precision\SQL2017
TestServer : kbh-precision\SQL2019
Database : XE_demo
FileExists : True
Size : 3.20 MB
RestoreResult : Success
DbccResult : Table error: Object ID 565577053, index ID 0, partition ID 72057594041270272, alloc unit ID 72057594046906368 (type In-row data), page (1:560). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed.
Values are 133129 and -4.
Object ID 565577053, index ID 0, partition ID 72057594041270272, alloc unit ID 72057594046906368 (type In-row data): Page (1:560) could not be processed. See other errors for details.
DBCC results for 'RndNums'.
There are 0 rows in 0 pages for object "RndNums".
CHECKDB found 0 allocation errors and 2 consistency errors in table 'RndNums' (object ID 565577053).
DBCC results for 'sys.plan_persist_query_hints'.
There are 0 rows in 0 pages for object "sys.plan_persist_query_hints".
DBCC results for 'sys.plan_persist_query_template_parameterization'.
There are 0 rows in 0 pages for object "sys.plan_persist_query_template_parameterization".
DBCC results for 'sys.plan_persist_wait_stats'.
There are 0 rows in 0 pages for object "sys.plan_persist_wait_stats".
DBCC results for 'sys.persistent_version_store'.
There are 0 rows in 0 pages for object "sys.persistent_version_store".
DBCC results for 'sys.persistent_version_store_long_term'.
There are 0 rows in 0 pages for object "sys.persistent_version_store_long_term".
DBCC results for 'sys.wpr_bucket_table'.
There are 0 rows in 0 pages for object "sys.wpr_bucket_table".
DBCC results for 'sys.queue_messages_1977058079'.
There are 0 rows in 0 pages for object "sys.queue_messages_1977058079".
DBCC results for 'sys.queue_messages_2009058193'.
There are 0 rows in 0 pages for object "sys.queue_messages_2009058193".
DBCC results for 'sys.queue_messages_2041058307'.
There are 0 rows in 0 pages for object "sys.queue_messages_2041058307".
DBCC results for 'sys.filestream_tombstone_2073058421'.
There are 0 rows in 0 pages for object "sys.filestream_tombstone_2073058421".
DBCC results for 'sys.syscommittab'.
There are 0 rows in 0 pages for object "sys.syscommittab".
DBCC results for 'sys.filetable_updates_2105058535'.
There are 0 rows in 0 pages for object "sys.filetable_updates_2105058535".
CHECKDB found 0 allocation errors and 2 consistency errors in database 'dbatools-testrestore-XE_demo'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (dbatools-testrestore-XE_demo).
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
RestoreStart : 2020-04-22 07:36:09.391
RestoreEnd : 2020-04-22 07:36:10.706
RestoreElapsed : 00:00:01
DbccStart : 2020-04-22 07:36:10.772
DbccEnd : 2020-04-22 07:36:12.323
DbccElapsed : 00:00:01
BackupDates : {04/16/2020 10:19:19}
BackupFiles : D:\Backups\KBH-Precision_SQL2017\XE_Demo_20200416.bak

PS C:\WINDOWS\system32>

From there I tested various methods to get the output into something I can look at or, alert on, etc. Not knowing how deep my client wanted to go at the time, I settled on this:

#Import-Module dbatools

Test-DbaLastBackup -Destination kbh-precision\SQL2019 -SqlInstance  KBH-Precision\SQL2017, KBH-Precision\SQL2016 `
                   -IgnoreLogBackup -ExcludeDatabase master,model,msdb, StackOverflow | `
                    Select-Object -Property SourceServer,Database,RestoreResult,DBCCResult | `
                    Write-SQLTableData -ServerInstance "kbh-precision\SQL2019" -DatabaseName "DBA" -SchemaName "dbo" -Tablename "Test_DBALastBackup" -Force

The last step was to get this into a SQL Agent job (or jobs, depending on the flexibility needed).

I assumed a SQL Agent PowerShell job step would be the best, but my friend Andy Levy(b|t) suggested using this approach.

Results:

With these 4 lines of code and less than 3 hours (ignoring the permissions issues I had), I was able to create a flexible process that can be changed by a simple edit of the .ps1 file.

No doubt there are better ways to implement all of this, error handling, etc. but the ideas got my client what they needed.

“That’s exactly what we wanted”

“How well does it work on Availability Groups?” (Don’t know yet, will test that soon).

Conclusion:

DBATools.io has a ton of useful, free, opensource commands…look there before you roll your own T-SQL. Someone else may have already solved your problem.
You don’t have to know PowerShell to use these tools, any more than you have to know how fuel injectors work to drive a car.

Other posts in the IRL – In Real Life series:
https://dallasdbas.com/category/sql/irl/

The video version:

Thanks for reading!

Kevin3NF

Follow @Dallas_DBAs

IRL #6: Availability Group Performance

October 1, 2019 by Kevin3NF Leave a Comment

Problem:

My client came to me with their biggest company-wide issue. They are a retailer with many independent sales reps. These reps tend to enter a majority of their orders into the system during the last 4 hours of each month, and enter just enough to hit the next commission level. The data-entry part is fine, but the commission reports get way behind on showing the orders just entered.

Background:

2 node SQL 2016 Availability Group. Automatic Failover, Synchronous commit. One readable secondary. The reports traffic has to be pointed to the Primary if it gets more than 10 minutes out of sync.

Investigation:

I’ve only done the basics of Availability Group work at this point – Install, add/remove databases, etc. so I knew I needed to brush up on the various phases of the data movement process, especially since I was under the impression that if data was entered successfully on the Primary replica, it HAD to be entered and visible on the Secondary replica. This impression did not mesh with what the client was telling me. Also, we were only 5 days away from EOM (end-of-month).

The best link I found to learn about these phases was from Microsoft’s CSS SQL Escalation Services team:

Troubleshooting data movement latency between synchronous-commit AlwaysOn Availability Groups

From here, I surmised that the issue was in this process:

“logBlock receive->LogCache->LogFlush->HardenToLDF->AckPrimary”

This is followed directly by the “redo” phase, which makes the data visible to requesting connections.

Alongside this, I was running sp_whoisactive to try to catch any waits or queries that might need some tuning or investigation. I found this one when the system would back up a bit: DIRTY_PAGE_TABLE_LOCK

If you are not aware already, the best resource on the web (In my opinion) for wait types is SQLSkills.com. THIS article for the above wait type gave me this golden nugget:

“This wait type can become prevalent on a replica where there is a lot of read activity on the readable secondary and there is a heavy update workload on the primary replica.”

This matched my scenario perfectly. The wait is specific to parallel redo threads, which are the default in SQL 2016+

Paul links over to this MS Tiger Team post for a further discussion of Parallel vs. Serial redo. Go read it. If you already know this, share with a co-worker.

Findings:

Checkpoint (attempted as anecdotal possible solution by Paul) may give relief
Trace Flag 3459 disables parallel redo without a SQL restart (version/CU specific)
To re-enable parallel redo, disable the trace flag and restart SQL.

Actions Taken:

Checkpoint gave some relief
Enabled TF 3459 globally on the Secondary replica
Continued to monitor Secondary replica using:

Select
	datediff(ms,last_redone_time,last_hardened_time) as [MilliSeconds behind]
	,Cast((datediff(ss,last_redone_time,last_hardened_time)) as decimal(18,2)) as [Seconds behind]
	,Cast((datediff(mi,last_redone_time,last_hardened_time)) as decimal(18,2)) as [Minutes behind]
	,redo_queue_size as [Redo Queue size in KB]
	,redo_rate as [Redo Rate KB/second]
	, getdate() as [datetime]
From sys.dm_hadr_database_replica_states
Where 1=1
	and is_primary_replica = 0

Results:

Trace Flag 3459 completely fixed the issue, and the reporting side of the EOM process worked flawlessly for the first time in many months. As I write this, we have gone through 2 separate EOM periods without issues.

Conclusion:

Understanding the phases of the AG process, or any other offloading of reporting processes is critical to troubleshooting
“Synchronous commit” does not mean “Immediately readable”. It means the transaction is hardened to the log file on all synchronous replicas. Many presenters/bloggers gloss over this distinction.
Know your troubleshooting tools (sp_whoisactive, AG Dashboard) and where to go to learn (MS, SQL Skills, Ozar, etc)

Other posts in the IRL – In Real Life series:
https://dallasdbas.com/category/sql/irl/

IRL #5: Massive T-Log File

August 12, 2019 by Kevin3NF Leave a Comment

Problem:

I had a customer call in this week after struggling to resolve an issue for a couple of days, Their Transaction Log file had grown to 400+GB and was filling the drive to the point they had to keep adding space to the drive to keep operations online. This was for a core internal system. There are no staff DBAs here, but plenty of very sharp sysadmins, developers, network folks, etc.

Note: I have written about the most common cause of this issue here, but this was not the resolution in this case. The embedded video is far and away number one in hits of all my SQL videos, showing just how common log file growth issues are.

Background:

Current SQL Server version
Data file around 90GB
Log backups enabled, occurring and verified (250MB or so every 30 minutes)
Replication not in play.

Investigation:

Run DBCC OpenTran to see if some process was left hanging
Check master.sys.databases.Log_reuse_wait_Desc for the database in question. See the “Wrapping Up” section for a proper query.
Review Disk Usage report for the database – verify if the log file is full or just big and empty

Findings:

Found an open transaction, sitting there since 2.5 days ago.
sys.databases showed ‘Active_Transaction’ for our database, confirming Finding 1. Other possible values for log_reuse_wait_desc here
Disk usage report showed the T-log was 99% used

Actions Taken:

Received permission from application team to kill the SPID identified in DBCC Opentran
Sys.Databases description changed to Log_Backup as expected
Disk usage report now shows 99% unused.
Fairly lengthy cycle of running Checkpoint, Log Backups and Shrink File attempts got us back to a 2GB .LDF file. The log file Autogrow was 10MB….we had 32K+ VLFs once the root problem was resolved by killing the spid
Changed the Autogrow to 512MB in case this happens again to reduce the VLF count

Results:

Application performance immediately improved
Reclaimed 400GB of disk space
Educated my client on the basics of log file maintenance, how they are written to, etc.
Everyone slept well that night – no on call wakeups
Total time: 1 hour

Conclusion:

Most of the time this issue comes from a lack of log backups…but not always. Be ready with the tools for deeper research
I was fully prepared to have them install sp_whoisactive if necessary for deeper digging
It pays to have a DBA in your Pocket if you are not going to (or don’t need to) staff one

Other posts in the IRL – In Real Life series:
https://dallasdbas.com/category/sql/irl/

Thanks for reading!

Kevin3NF

Follow @Dallas_DBAs

IRL #4: Instance Tuning

July 22, 2019 by Kevin3NF Leave a Comment

IRL #4 – tuning a server instance in very small increments

Problem:

Main production server is behaving “sluggishly” and additional load makes it worse. Additionally, deadlocks are a problem, at 880 per day on average. Client wants to implement changes in small, tightly controlled steps. Dev, QA, then Prod.

Background:

SQL Server 2012 (RTM) Enterprise
8 cores, 58 GB RAM (32 allocated to SQL Server)
1 TempDB file (default setting)
Cost Threshold for Parallelism – 5 (default setting)
CXPACKET waits – 57.2%
LATCH_EX waits – 37.0%

Investigation:

Multiple queries that run thousands of times per hour or day have a cost at or near 5. Many in the 15-20 range.
Write stalls on tempdb data file avg 87ms, resides on same drive as primary database. 4.3M writes/day avg.

Recommended Steps:

Upgrade to current SP and CU
Move tempdb to dedicated drive
Increase Cost Threshold for Parallelism to 25

Results:

Query Performance (click to enlarge):

TempDB:

Before: avg Read stalls 2ms, avg Write Stalls 87ms
After: avg Read stalls 1ms, avg Write Stalls 275ms
- Investigating drive config (HDD/SSD, RAID level)

Other:

Deadlocks dropped from 880/day to 380/day – 56.8% improvement without touching any code
CXPACKET waits dropped to 45.5% – 20% improvement
LATCH_EX waits dropped to 24.1% – 35% improvement

Conclusions:

Our methodology of working from Instance to Database to Index to Query continues to prove successful.
Verify all drive info before moving files around (tempdb, log files, etc.)

Other posts in the IRL – In Real Life series:

https://dallasdbas.com/category/sql/irl/

Thanks for reading!

Kevin3NF

Follow @Dallas_DBAs

IRL #3: SQL Server HealthCheck

June 5, 2019 by Kevin3NF Leave a Comment

IRL #3 – Slow Server

Problem:

I was recently approached by a firm outside the US for some help with his SQL Server which was “running slow.”

Background:

They don’t have a DBA, and I didn’t support international customers at the time so I told him I would donate 30 minutes of my time out of pure curiosity.

Action Taken:

I sent my HealthCheck tool along with instructions to run it once the server has been up at least 7 days.

Results:

After reviewing the gathered info for 30 minutes, these are the results I sent back as recommendations to look into:

Backups and Data on the same drive
Other databases never backed up
No CheckDB since 2011, if ever
Max Server Memory: 24,000 MB of 32GB installed
- Memory used –3.4GB
Windows Server 2008
Auto-Shrink enabled on prod databases
No alerts when bad things happen
No Operators
SQL Agent offline
Balanced Power Plan
Priority Boost enabled
TempDB on C drive, only 1 data file
10 years of backup history
PLE – 292 seconds
32-bit SQL Server 2014 (unsupported build)
I didn’t even get back any index information, as those queries timed out.

Solution:

Immediate:
- Upgrade to 64-bit O/S (assuming 32 bit here) and SQL Server
- Patch to supported build
- Turn off Priority boost
- Set up proper alerts and maintenance
- High performance power plan
Next
- Address everything else

Lessons:

Start at the outside and work your way in. The SQL version can be the biggest issue. Start at Hardware and O/S, then Instance settings, then indexing, then queries (in most cases).

All of this took less than 30 minutes to find. I can do similar (but much more thorough) work for you.

Other posts in the IRL – In Real Life series:

https://dallasdbas.com/category/sql/irl/

Thanks for reading!

Kevin3NF

Follow @Dallas_DBAs