SQL

“SQL Server is Slow” Part 3 of 4

October 1, 2025 by Kevin3NF Leave a Comment

In parts 1 and 2 of this series, we’ve gathered info and done the triage just like anyone in almost any industry does

At this point you’ve:

Defined what “slow” means and built a timeline (Part 1).
Checked things outside SQL Server like network, storage, and VM noise (Part 2).

Now it’s time to open the hood on SQL Server itself.

Step 1: Check Active Sessions

Run a quick session (sp_whoisactive is a favorite):

Who’s running queries right now?
What queries have been running the longest? Is that normal?
Any blocking chains?
Are any queries hogging resources?

At this stage, you’re only identifying potential offenders. Next issue, we’ll dig into queries and indexes more deeply.

Step 2: Look at Wait Stats

Wait stats tell you what SQL Server has really been waiting for (everything in SQL Server is waiting for something else):

PAGEIOLATCH: slow storage reads.
LCK_M_X: blocking/locking.
CXPACKET/CXCONSUMER: parallelism info.
THREADPOOL: CPU threads
RESOURCE_SEMAPHORE: memory
ASYNC_NETWORK_IO: Probably more of a client side problem than SQL Side
- The most comprehensive list of wait types and explanations from my friends at SQL Skills

This isn’t about solving yet – it’s about categorizing where SQL feels the pain.

Step 3: Review Agent Jobs & Error Logs

SQL may already be waving red flags:

Overlapping or stuck Agent jobs. A long running purge job or index rebuild can cause all sorts of issues during the day.
Failed backups or CHECKDB runs. A failed CHECKDB could mean corruption. Read this
Errors or memory dumps tied to patching or system instability. Look in the same folder as your ERRORLOG location
- Can’t find that folder? Watch this

Step 4: Don’t Forget the “Gotchas”

Other less obvious issues can cause system-wide drag:

High VLF count (often from failed or missing log backups).
Database compatibility or config changes – check SSMS reports like:
- Server level: Configuration Changes History
- Database level: All Blocking Transactions, Index Usage Statistics
Recent patching issues (especially if tied to errors or dump files).

These aren’t everyday culprits, but when they show up, they can cripple performance.

Step 5: Compare Against Your Baseline

Today’s “slow” may be tomorrow’s “normal.”

Track batch requests/sec, CPU Utilization, wait stats, I/O latency, and log file size/VLF count.
Without this baseline, every slowdown feels like a brand-new mystery.

If you don’t already have a baseline, NOW is the time to start, while the server is healthy.

Collect I/O stats and wait stats regularly.
Run sp_Blitz for a full health snapshot (free tool from Brent Ozar)
Capture DMV performance counters (sys.dm_os_performance_counters) on a schedule.

A baseline doesn’t need to be fancy, it just needs to exist so you know what “normal” looks like before things go sideways.

The Bottom Line

Part 3 is about categorizing slowness inside SQL Server: sessions, waits, jobs, error logs, and configuration gotchas. Don’t jump straight into query rewrites yet. You’re still isolating the nature of the slowdown. Having a consistent process for this reduces panic and anxiety.

In Part 4, we’ll cover what to do when the culprit is truly inside SQL Server: look at queries, indexes, and design choices.

Thanks for reading!

“SQL Server Is Slow” – Part 2 of 4

September 24, 2025 by Kevin3NF Leave a Comment

The 10-Minute Outside-In Triage

Don’t Blame SQL First

It’s 9:05 AM and your helpdesk lights up: “The SQL Server is down. Nothing works.”

By 9:07, everyone is staring at you.

The trap: you open SSMS and start digging for blocking queries. But what if the database isn’t the problem at all?

I’ve seen teams lose hours chasing SQL when the real culprit was a Windows power plan, an antivirus update, or a snapshot on the storage array.

That’s why step two in our framework is a 10-minute outside-in triage. You prove or eliminate the usual suspects before you go digging inside SQL Server.

The Outside-In Checklist

When an incident is happening right now, you don’t need deep analysis. You need to rule out the obvious and collect proof.

Step 1: Blast Radius (1 min)

Are other apps/servers slow too?
Other users, or just one?
Cloud? Check provider status pages and throttling alerts.
If everything is slow, SQL may not be the root cause.

Step 2: Host & OS Health (3 – 4 min)

CPU: Is the server pegged, not just sqlservr.exe?
- I have a story here…coined the term “three finger salute” from it (CTRL-ALT-DELETE). Ask me if we ever meet in person!
Hypervisor: Look for VM “ready time” or “steal time.”
Power plan: Make sure Windows is on High Performance.
Memory: Check paging, commit vs installed.
Antivirus: Any active scans?
Windows Update: Patches or Defender scans running?

Step 3: Storage Sanity (3 – 4 min)

Latency: Are read/write times spiking?
Queue depth: Backlog on MDF, LDF, or tempdb volumes?
Snapshots/backup agents: Running now?
Cloud disks: Bursting credits exhausted?

Step 4: Network Checks (1 – 2 min)

Is RDP sluggish? File copies slow?
Any new firewall/VPN/SSL changes?

Step 5: Quick SQL Health Pulse (1 – 2 min)

(You’re not tuning queries here, you’re checking basic health)

Blocking chains: is one session holding everyone else hostage? Sp_whoisactive is great for this!
Current waits: PAGEIOLATCH (storage), WRITELOG (log), ASYNC_NETWORK_IO (client/network). Use SQLSkills Waits Now script
Active jobs: backups, CHECKDB, index maintenance colliding with business hours.

The Bottom Line

The point of a 10-minute triage isn’t to fix everything. It’s to prove or eliminate SQL Server as the culprit.

When you walk into a war room and say, “Here’s proof it’s the storage snapshot, not SQL,” you stop the blame game cold.

At Dallas DBAs, we use this exact checklist when clients call in a panic. It’s fast, repeatable, and it keeps you from chasing the wrong rabbit.

First Month Free for New Pocket DBA® Clients

https://pocket-dba.com/

Thanks for reading!

“SQL Server Is Slow” part 1 of 4

September 17, 2025 by Kevin3NF Leave a Comment

How should you respond when you get the dreaded Email/Slack/Text/DriveBy from someone yelling at you that SQL Server is slow?

Stop. Don’t Open SSMS Yet.

You’ve heard it before: “The server is slow.”

What does that actually mean?

If you jump straight into SQL Server looking for blocking, bad queries, or a missing index, you’re working with bad input. And bad input = wasted time.

The real work starts before you ever connect to SQL Server.

Intake & Scoping Comes First

Your first job is to turn “slow” into something measurable and time-bound. That means gathering evidence, building a timeline, and classifying the type of incident.

Translate “Slow” Into Scenarios

“Slow” can mean very different things depending on who you ask:

Login delays – Applications take 30 seconds to connect.
Application lag – a report that usually runs in 2 seconds now takes 30.
Broad latency – all users complain at once.
Maintenance overruns – ETL jobs not finishing in their normal window.
Timeouts or errors – not just lag, but outright failures.

Each one leads you in a different direction.

The Fast Intake Questionnaire

Ask these every single time:

What exactly was slow? (screen/report/job/login)
When did it start? (date/time, not “earlier today”)
How widespread? (one user, team, region, everyone)
Repeatable? (every time, sometimes, once)
What changed? (deploys, patches, config, new data)
Is it happening now? (yes/no)
Evidence (screenshots, error messages, job names with timestamps)

Classify the Incident

Once you have real answers, put it in a bucket:

Happening now – live triage needed.
Recurring/pattern – happens on a schedule.
Sporadic – random spikes.
One-time – tied to a patch, deploy, or outage.

Build the Timeline

Anchor events with real timestamps.

“Around lunch” = useless.
“9:12 AM CST, FinanceApp login timeout” = actionable.

Capture start, stop, duration, who was affected, and what else was happening on the system at that time.

The (first) Bottom Line:

Action without information is useless. Any other service provider (ER, plumber, Electrician, auto mechanic) will ALWAYS triage your issues before starting an investigation or fix. You should too.

SQL Tidbit:

When gathering timelines, align your clocks. SQL Server may log in UTC while users report in Central or Eastern time. A “9 AM slowdown” can easily show up in SQL logs at 2 PM. Get this wrong and you’ll chase the wrong evidence.

Thanks for reading!

SQL Server Availability Groups

September 10, 2025 by Kevin3NF Leave a Comment

Flexibility and Scale at the Database Level

When SQL Server 2012 introduced Availability Groups (AGs), they changed the HA/DR game. Unlike Failover Cluster Instances (FCIs), which protect the whole instance, AGs protect individual databases and let you scale reads, offload backups, and stretch across data centers.

They’re not perfect, but they’ve become the go-to HA/DR option for many SQL Server environments.

What Are Availability Groups?

An Availability Group is a set of one or more user databases that fail over together between SQL Server instances.

The Primary replica handles all writes.
Secondary replicas keep synchronized copies of the databases.
Clients connect through a Listener, a virtual network name similar to an FCI’s Virtual Name, but tied to the Group rather than the instance.

Core Benefits of AGs

Database-level protection: Choose which databases to include, rather than the entire instance.
Readable secondaries: Offload reporting, ETL, or read workloads.
Backup offloading: Run backups on secondary replicas to reduce load on the primary.
Geographic flexibility: Replicas can live across subnets or even in different data centers (Distributed AG)
Multiple copies of data: Protects against both server failure and storage corruption.

How AGs Differ from FCIs

Storage:
- AG: multiple independent copies of the data local to the replica)
- FCI: single shared storage, single point of failure.
Scope:
- AG: database-level protection only.
- FCI: instance-level protection (includes system DBs, jobs, logins, etc.).
Connectivity:
- AG: Listener provides a single name for apps, routes them to the current primary.
- FCI: Virtual Name tied to the whole instance.

Clusterless / Read-Scale AGs

Not every AG is about high availability.

Read-scale availability groups (also called clusterless AGs) provide read-only secondaries without requiring Windows Server Failover Clustering (WSFC).
They’re useful for:
- Scaling out reporting and analytics workloads.
- Offloading read traffic without the complexity of cluster configuration.
Caveats:
- No automatic failover.

If the primary fails, you must manually fail over.

- Purely for offloading reporting, backups (optional) and checkdb work.

Version & Edition Requirements

Enterprise Edition: Full AG support, multiple replicas, readable secondaries, async replicas across regions.
Standard Edition: Basic AGs (since SQL 2016 SP1), limited to 2 nodes, 1 database per AG, no read-scale.
- Minimally useful for one or two databases…more than that can cause thread starvation.
Windows / Linux:
- AGs require WSFC (Windows) or Pacemaker (Linux).
- Read-scale AGs don’t require clustering.

Downsides of AGs

Instance-level objects not included: Logins, SQL Agent jobs, linked servers must be synced separately.
Licensing cost: Enterprise Edition required for full features.
Operational complexity: Requires careful planning for quorum, replicas, failover modes, and monitoring.
Storage overhead: Each replica requires its own storage and capacity.

When to Choose AGs

Pick AGs if:
- You need readable secondaries for offloading.
- You need database-level granularity for HA/DR.
- You want cross-datacenter disaster recovery.
Pick FCIs if:
- You need instance-level protection.
- You have reliable shared storage.
- You want simpler management without replica sync overhead.

The Bottom Line:

Availability Groups offer flexibility that FCIs never could: readable secondaries, offloaded backups, and multi-datacenter options. But that power comes with complexity, higher licensing costs, and the need to manage instance-level objects separately.

If you need read scale or cross-site DR, AGs are often the right choice. If you just need your instance to stay alive, FCIs might still be enough.

Are you ready for a REAL database disaster?

Take the quiz to find out:

SQL Server Recovery Assessment | 2-Minute Quiz

SQL Tidbit:

AG’s come with a pretty handy, auto-refreshing dashboard in SSMS, complete with customizable columns for the metrics most important to you!

SQL Server Failover Cluster Instances

September 3, 2025 by Kevin3NF Leave a Comment

Old Reliable Still Matters

If you’ve been around SQL Server for a while, you’ve heard of Failover Cluster Instances (FCIs). They’ve been part of SQL’s high availability toolbox since long before Availability Groups showed up, and they’re still relevant today, especially if you want protection at the instance level, not just the database level.

Let’s break down what they are, how they differ from AGs, and where they make sense in your environment.

What Is an FCI?

A Failover Cluster Instance is a single SQL Server instance installed across multiple Windows Server nodes. At any given time, one node is active, while the others are standing by. If the active node fails, Windows Server Failover Clustering (WSFC) brings the SQL instance online on another node.

The big requirement? Shared storage. All nodes point to the same database files, whether that’s on a SAN, SMB file share, or Storage Spaces Direct.

How FCIs Differ from Availability Groups

Storage:
- FCI: one copy of the data on shared storage.
- AG: multiple synchronized copies on separate storage.
Scope:
- FCI: protects the whole instance (system databases, SQL Agent jobs, logins, SSIS packages, everything).
- AG: protects only selected user databases.
Client Connections:
- FCI: uses a Virtual Network Name (VNN). Apps connect to a single name, and after failover, that name points to the new active node.
- AG: uses an AG Listener, which looks and behaves similarly, but is tied to a group of databases instead of the entire instance.
Use Cases:
- FCI: protects against server or instance failure.
- AG: adds database-level resilience, readable secondaries, and cross-datacenter flexibility (Distributed AG)

FCI Version & Edition Requirements

Windows Server: Requires WSFC.
SQL Server Editions:
- Standard: limited to 2 nodes.
- Enterprise: supports multiple nodes, multi-subnet clusters, and advanced options.
Storage: Must be shared, unless you’re on newer Windows with Storage Spaces Direct.

Benefits of FCIs

Instance-level protection: Everything tied to SQL Server moves during failover.
Automatic failover: Fast recovery from node crashes, no manual intervention required.
No data divergence: One data copy = no sync overhead.
Connection simplicity: The Virtual Network Name hides node-level details from applications.

Downsides & Risks

Shared storage dependency: If the SAN or file share fails, the cluster fails.
No readable secondaries: Unlike AGs, you can’t offload reporting or backups.
Longer failover times: The whole instance must restart on the new node.
Complex setup: Requires WSFC expertise, quorum planning, and strong storage design.

When to Choose FCI vs AG

Pick FCI if:
- You want full-instance protection, including jobs, logins, and system databases.
- You already have a reliable SAN or shared storage.
- You need a single connection name (VNN) that hides failovers from applications.
Pick AG if:
- You want multiple data copies for extra resilience.
- You need readable secondaries for reporting or backups.
- You want DR across data centers without shared storage.
- You have personnel to maintain the underlying WSFC components
  - Clusterless / Read-Scale AGs are an option

The Bottom Line

Failover Cluster Instances aren’t flashy, but they’re still solid. They protect more than just databases, they’re available in Standard Edition, and for many shops they’re the right balance of reliability and simplicity.

Just remember: your cluster is only as good as your storage.

New Pocket DBA® clients get the first month FREE!

https://pocket-dba.com/

Book a call, and mention “Newsletter”

SQL Tidbit

When connecting to an FCI, always use the Virtual Network Name (VNN) — never the node name. This way, your applications don’t care which node is currently active. Failover happens, your apps keep talking, and you don’t get that 3 a.m. call.

Thanks for reading!

SQL Server Log Shipping: The Tried and True of DR

August 27, 2025 by Kevin3NF Leave a Comment

It’s not glamorous, but it works

In a world where shiny new HA/DR features get all the press, there’s one SQL Server technology that just keeps doing its job.

Log Shipping has been around since SQL Server 2000. It doesn’t make headlines, it doesn’t have fancy dashboards, and it’s not going to win you any architecture awards. But for the right environment and use case, it’s rock solid and can save your bacon job in a disaster.

Backup, copy, retore. With Automation.

What is Log Shipping?

At its core, log shipping is an automated process that:

Backs up the transaction log from your primary database.
Copies that backup to one or more secondary servers.
Restores the log onto the secondary.

Do that on a schedule (15 minutes is the default) and your secondary database stays nearly up to date with production.

Requirements:

Database in FULL recovery model
SQL Agent running (Log Shipping is powered by jobs)
Shared/network path for moving the log backups

SQL Server Versions & Editions

Works in Standard, Enterprise, and Developer Editions — not in Express.
Available since SQL Server 2000, still here in 2022.
Works in on-premises or cloud/colo VM deployments (Azure SQL VM, AWS EC2, etc.).

Why It’s Disaster Recovery, Not High Availability

HA means automatic failover with near-zero downtime. That’s not log shipping.

With log shipping:

Failover is manual – you bring the secondary online when needed, database by database
RPO (data loss) is whatever gap exists between the last backed up log and the failure.
RTO (downtime) is how long it takes to restore the last backups and bring the database online.

It’s DR because you can be up in minutes to hours, but not instantly.

Setting Up Log Shipping (The Basics)

You can walk through the wizard in SSMS, and this really the easiest way in my opinion:

Primary server: Configure the log backup job.
Secondary server: Configure the copy and restore jobs.
Schedule jobs: As a general rule I backup frequently, copy frequently (if using the copy at all) and build in a delay on the restore side. This helps give you a window to catch an “oops” moment, rather than restoring the oops to the secondary server.
Initial restore: Seed the secondary with a full backup (WITH NORECOVERY).

Monitoring & Maintenance

Log Shipping has some built-in monitoring functionality:

Instance level report: Server properties>>Reports>>Standard Reports>>Transaction Log Shipping Status
Alert jobs created for thresholds not being met. You may have to adjust these for databases that can go for long periods of time with no activity to avoid false alerts.
Set up alerts for job failures on the Backup, copy and restore jobs. If you backup to and restore from the same file location you don’t need the copy job.

Failover & Failback

Failover (manual):

Stop the log shipping jobs.
Restore any remaining logs.
Bring the secondary online (RESTORE DATABASE myDatabase WITH RECOVERY).

Failback:

Rebuild the original setup from scratch. This will leave you with copies of all of the jobs on each server in a 2 node setup. Make sure to disable the jobs that don’t make sense, such as LS_restore on the Primary.

When to Use Log Shipping

Low budget DR solution
Simple read-only reporting copy (standby mode)
Decent for setting up a reporting server when near real-time data isn’t a requirement.

The Bottom line:

Log Shipping isn’t shiny, but it’s dependable. If you can live with a few minutes of potential data loss and a manual failover, it’s a cost-effective way to add resilience to your SQL Server environment.

New Pocket DBA® clients get the first month FREE!

https://pocket-dba.com/

Book a call, and mention “Newsletter”

Thanks for reading!

— Kevin