Kevin3NF

SQL Server Availability Groups

September 10, 2025 by Kevin3NF Leave a Comment

Flexibility and Scale at the Database Level

When SQL Server 2012 introduced Availability Groups (AGs), they changed the HA/DR game. Unlike Failover Cluster Instances (FCIs), which protect the whole instance, AGs protect individual databases and let you scale reads, offload backups, and stretch across data centers.

They’re not perfect, but they’ve become the go-to HA/DR option for many SQL Server environments.

What Are Availability Groups?

An Availability Group is a set of one or more user databases that fail over together between SQL Server instances.

The Primary replica handles all writes.
Secondary replicas keep synchronized copies of the databases.
Clients connect through a Listener, a virtual network name similar to an FCI’s Virtual Name, but tied to the Group rather than the instance.

Core Benefits of AGs

Database-level protection: Choose which databases to include, rather than the entire instance.
Readable secondaries: Offload reporting, ETL, or read workloads.
Backup offloading: Run backups on secondary replicas to reduce load on the primary.
Geographic flexibility: Replicas can live across subnets or even in different data centers (Distributed AG)
Multiple copies of data: Protects against both server failure and storage corruption.

How AGs Differ from FCIs

Storage:
- AG: multiple independent copies of the data local to the replica)
- FCI: single shared storage, single point of failure.
Scope:
- AG: database-level protection only.
- FCI: instance-level protection (includes system DBs, jobs, logins, etc.).
Connectivity:
- AG: Listener provides a single name for apps, routes them to the current primary.
- FCI: Virtual Name tied to the whole instance.

Clusterless / Read-Scale AGs

Not every AG is about high availability.

Read-scale availability groups (also called clusterless AGs) provide read-only secondaries without requiring Windows Server Failover Clustering (WSFC).
They’re useful for:
- Scaling out reporting and analytics workloads.
- Offloading read traffic without the complexity of cluster configuration.
Caveats:
- No automatic failover.

If the primary fails, you must manually fail over.

- Purely for offloading reporting, backups (optional) and checkdb work.

Version & Edition Requirements

Enterprise Edition: Full AG support, multiple replicas, readable secondaries, async replicas across regions.
Standard Edition: Basic AGs (since SQL 2016 SP1), limited to 2 nodes, 1 database per AG, no read-scale.
- Minimally useful for one or two databases…more than that can cause thread starvation.
Windows / Linux:
- AGs require WSFC (Windows) or Pacemaker (Linux).
- Read-scale AGs don’t require clustering.

Downsides of AGs

Instance-level objects not included: Logins, SQL Agent jobs, linked servers must be synced separately.
Licensing cost: Enterprise Edition required for full features.
Operational complexity: Requires careful planning for quorum, replicas, failover modes, and monitoring.
Storage overhead: Each replica requires its own storage and capacity.

When to Choose AGs

Pick AGs if:
- You need readable secondaries for offloading.
- You need database-level granularity for HA/DR.
- You want cross-datacenter disaster recovery.
Pick FCIs if:
- You need instance-level protection.
- You have reliable shared storage.
- You want simpler management without replica sync overhead.

The Bottom Line:

Availability Groups offer flexibility that FCIs never could: readable secondaries, offloaded backups, and multi-datacenter options. But that power comes with complexity, higher licensing costs, and the need to manage instance-level objects separately.

If you need read scale or cross-site DR, AGs are often the right choice. If you just need your instance to stay alive, FCIs might still be enough.

Are you ready for a REAL database disaster?

Take the quiz to find out:

SQL Server Recovery Assessment | 2-Minute Quiz

SQL Tidbit:

AG’s come with a pretty handy, auto-refreshing dashboard in SSMS, complete with customizable columns for the metrics most important to you!

SQL Server Failover Cluster Instances

September 3, 2025 by Kevin3NF Leave a Comment

Old Reliable Still Matters

If you’ve been around SQL Server for a while, you’ve heard of Failover Cluster Instances (FCIs). They’ve been part of SQL’s high availability toolbox since long before Availability Groups showed up, and they’re still relevant today, especially if you want protection at the instance level, not just the database level.

Let’s break down what they are, how they differ from AGs, and where they make sense in your environment.

What Is an FCI?

A Failover Cluster Instance is a single SQL Server instance installed across multiple Windows Server nodes. At any given time, one node is active, while the others are standing by. If the active node fails, Windows Server Failover Clustering (WSFC) brings the SQL instance online on another node.

The big requirement? Shared storage. All nodes point to the same database files, whether that’s on a SAN, SMB file share, or Storage Spaces Direct.

How FCIs Differ from Availability Groups

Storage:
- FCI: one copy of the data on shared storage.
- AG: multiple synchronized copies on separate storage.
Scope:
- FCI: protects the whole instance (system databases, SQL Agent jobs, logins, SSIS packages, everything).
- AG: protects only selected user databases.
Client Connections:
- FCI: uses a Virtual Network Name (VNN). Apps connect to a single name, and after failover, that name points to the new active node.
- AG: uses an AG Listener, which looks and behaves similarly, but is tied to a group of databases instead of the entire instance.
Use Cases:
- FCI: protects against server or instance failure.
- AG: adds database-level resilience, readable secondaries, and cross-datacenter flexibility (Distributed AG)

FCI Version & Edition Requirements

Windows Server: Requires WSFC.
SQL Server Editions:
- Standard: limited to 2 nodes.
- Enterprise: supports multiple nodes, multi-subnet clusters, and advanced options.
Storage: Must be shared, unless you’re on newer Windows with Storage Spaces Direct.

Benefits of FCIs

Instance-level protection: Everything tied to SQL Server moves during failover.
Automatic failover: Fast recovery from node crashes, no manual intervention required.
No data divergence: One data copy = no sync overhead.
Connection simplicity: The Virtual Network Name hides node-level details from applications.

Downsides & Risks

Shared storage dependency: If the SAN or file share fails, the cluster fails.
No readable secondaries: Unlike AGs, you can’t offload reporting or backups.
Longer failover times: The whole instance must restart on the new node.
Complex setup: Requires WSFC expertise, quorum planning, and strong storage design.

When to Choose FCI vs AG

Pick FCI if:
- You want full-instance protection, including jobs, logins, and system databases.
- You already have a reliable SAN or shared storage.
- You need a single connection name (VNN) that hides failovers from applications.
Pick AG if:
- You want multiple data copies for extra resilience.
- You need readable secondaries for reporting or backups.
- You want DR across data centers without shared storage.
- You have personnel to maintain the underlying WSFC components
  - Clusterless / Read-Scale AGs are an option

The Bottom Line

Failover Cluster Instances aren’t flashy, but they’re still solid. They protect more than just databases, they’re available in Standard Edition, and for many shops they’re the right balance of reliability and simplicity.

Just remember: your cluster is only as good as your storage.

New Pocket DBA® clients get the first month FREE!

https://pocket-dba.com/

Book a call, and mention “Newsletter”

SQL Tidbit

When connecting to an FCI, always use the Virtual Network Name (VNN) — never the node name. This way, your applications don’t care which node is currently active. Failover happens, your apps keep talking, and you don’t get that 3 a.m. call.

Thanks for reading!

SQL Server Log Shipping: The Tried and True of DR

August 27, 2025 by Kevin3NF Leave a Comment

It’s not glamorous, but it works

In a world where shiny new HA/DR features get all the press, there’s one SQL Server technology that just keeps doing its job.

Log Shipping has been around since SQL Server 2000. It doesn’t make headlines, it doesn’t have fancy dashboards, and it’s not going to win you any architecture awards. But for the right environment and use case, it’s rock solid and can save your bacon job in a disaster.

Backup, copy, retore. With Automation.

What is Log Shipping?

At its core, log shipping is an automated process that:

Backs up the transaction log from your primary database.
Copies that backup to one or more secondary servers.
Restores the log onto the secondary.

Do that on a schedule (15 minutes is the default) and your secondary database stays nearly up to date with production.

Requirements:

Database in FULL recovery model
SQL Agent running (Log Shipping is powered by jobs)
Shared/network path for moving the log backups

SQL Server Versions & Editions

Works in Standard, Enterprise, and Developer Editions — not in Express.
Available since SQL Server 2000, still here in 2022.
Works in on-premises or cloud/colo VM deployments (Azure SQL VM, AWS EC2, etc.).

Why It’s Disaster Recovery, Not High Availability

HA means automatic failover with near-zero downtime. That’s not log shipping.

With log shipping:

Failover is manual – you bring the secondary online when needed, database by database
RPO (data loss) is whatever gap exists between the last backed up log and the failure.
RTO (downtime) is how long it takes to restore the last backups and bring the database online.

It’s DR because you can be up in minutes to hours, but not instantly.

Setting Up Log Shipping (The Basics)

You can walk through the wizard in SSMS, and this really the easiest way in my opinion:

Primary server: Configure the log backup job.
Secondary server: Configure the copy and restore jobs.
Schedule jobs: As a general rule I backup frequently, copy frequently (if using the copy at all) and build in a delay on the restore side. This helps give you a window to catch an “oops” moment, rather than restoring the oops to the secondary server.
Initial restore: Seed the secondary with a full backup (WITH NORECOVERY).

Monitoring & Maintenance

Log Shipping has some built-in monitoring functionality:

Instance level report: Server properties>>Reports>>Standard Reports>>Transaction Log Shipping Status
Alert jobs created for thresholds not being met. You may have to adjust these for databases that can go for long periods of time with no activity to avoid false alerts.
Set up alerts for job failures on the Backup, copy and restore jobs. If you backup to and restore from the same file location you don’t need the copy job.

Failover & Failback

Failover (manual):

Stop the log shipping jobs.
Restore any remaining logs.
Bring the secondary online (RESTORE DATABASE myDatabase WITH RECOVERY).

Failback:

Rebuild the original setup from scratch. This will leave you with copies of all of the jobs on each server in a 2 node setup. Make sure to disable the jobs that don’t make sense, such as LS_restore on the Primary.

When to Use Log Shipping

Low budget DR solution
Simple read-only reporting copy (standby mode)
Decent for setting up a reporting server when near real-time data isn’t a requirement.

The Bottom line:

Log Shipping isn’t shiny, but it’s dependable. If you can live with a few minutes of potential data loss and a manual failover, it’s a cost-effective way to add resilience to your SQL Server environment.

New Pocket DBA® clients get the first month FREE!

https://pocket-dba.com/

Book a call, and mention “Newsletter”

Thanks for reading!

— Kevin

SQL Server Backups: The Basics

August 20, 2025 by Kevin3NF Leave a Comment

If you’re responsible for a SQL Server instance, you need working, consistent backups. Not just a .bak file here and there, but a plan that runs automatically and covers the full recovery cycle.

Here’s how to get that in place, even if you’re not a DBA.

Understand What Each Backup Type Does:

You don’t need them all every time, but you do need to know what they’re for:

Full Backup
A complete copy of the entire database at that moment in time. It’s your foundation.
Differential Backup
Captures only what changed since the last full These can help speed up recovery time and reduce storage needs. Not really necessary if your databases are small.
Transaction Log Backup
Captures everything written to the transaction log since the last log backup. Needed for point-in-time recovery.
- If your database is in Full or Bulk-Logged recovery model and you’re not doing log backups, your log file will grow endlessly, potentially filling the drive it is on.

Set a Backup Schedule That Works

For production databases, this is my minimum recommended setup:

Full backups once per day
Log backups every 5 to 15 minutes
Optional differentials every few hours for large databases

For dev/test databases:

Full backups daily or weekly are usually fine
You can skip log backups unless you’re testing recovery processes
- If you are going to skip, set the databases to SIMPLE Recovery

Automate the Backups

Use SQL Server Agent to schedule the jobs. Here are two options:

Maintenance Plans (basic, GUI-driven)
- Good for smaller environments or shops without scripting experience
- Be careful, default plans may not have the best options for your situation
- Included in SQL Server, supported by Microsoft.
Ola Hallengren’s Maintenance Solution (highly recommended)
- Free, open-source, script-based
- Handles full/diff/log backup rotation, cleanup, logging, and more
  - Optionally does corruption checking and index/stats maintenance
- Use SQL Agent to schedule the process via the jobs the script created
- Free, FAQ/email/community support, but not Microsoft

Store Backups Somewhere Safe

Don’t store them on the same drive as the database files. If the drive dies, the data and backups may both be lost

Better options:

Separate disk or volume
Network share
Azure Blob Storage or S3, via Backup to URL option

Monitor It

Make sure the backup jobs are:

Running successfully
Completing on time
Not overwriting too soon or growing endlessly

Use SQL Agent alerts, third-party tools, or scripts to monitor backup age and job success.

The Bottom Line:

Understanding the basics of what backups are, and how they work is KEY to protecting your company’s most valuable asset. If you don’t know how this works and it is your responsibility, a database failure without a backup could be a career limiting move.

New Pocket DBA® clients get the first month FREE!

https://pocket-dba.com/

Book a call, and mention “Newsletter”

Thanks for reading!

— Kevin

DBCC CHECKDB: Just Because It’s Quiet Doesn’t Mean It’s Safe

August 13, 2025 by Kevin3NF Leave a Comment

Corruption isn’t a “maybe someday” problem – what you need to do now.

Stop. Don’t panic.

You just ran DBCC CHECKDB for the first time in a while (or maybe ever) and saw something you didn’t expect: the word corruption.

Take a breath.

Don’t detach the database.
Don’t run REPAIR_ALLOW_DATA_LOSS.
Don’t reboot the server or start restoring things just yet.

There’s a lot of bad advice floating around from old blogs, well-meaning forum posts, and even some popular current LinkedIn threads. Some of it might’ve been okay 15 years ago. Some of it is dangerous.

Let’s dig in.

What Corruption Really Means

When SQL Server says there’s corruption, it’s not talking about “bad data” like wrong numbers or missing values. It means it found internal structures that are damaged. The kind that can cause queries or even make your database unusable.

This could be:

Broken data or index pages
Allocation inconsistencies (GAM, SGAM, PFS pages)
Corrupt system metadata
Problems in the transaction log.

This isn’t a performance problem.
It’s a data integrity problem. If left untreated, it can get worse.

How Does Corruption Happen?

Even if your server is well-configured, corruption can still creep in. Common causes include:

Failing disks or controllers (especially SANs and older SSDs)
Disk subsystems lying about successful writes
Power outages or hard shutdowns.
Sometimes SQL Server itself has bugs that cause corruption – especially in RTM versions.
Snapshot or backup software interfering at the file level
Antivirus software scanning .mdf, .ldf, or .ndf files directly

Some of these things leave no obvious signs. This is why running CHECKDB regularly is so important.

What DBCC CHECKDB Actually Does

When you run DBCC CHECKDB, SQL Server performs a deep consistency check of your database:

Every table, every index, every system structure
Logical and physical page consistency
Allocation integrity

If possible, SQL uses a snapshot to avoid locking the database.

What it doesn’t do:

Fix anything (unless you tell it to)
Prevent corruption
Run automatically (unless you set it up)

How Often Should You Run It?

Ideally: once per week, at minimum.

Schedule it in a SQL Agent job, off-hours.
Save the job output to file or table so you don’t miss warnings.
Set up an email alert for failures of this job (as well as corruption alerts for error 823-825)

If CHECKDB takes too long or hits your performance too hard, you can offload the work.

Offload CHECKDB with Test-DbaLastBackup

If you’re taking backups regularly (you are, right?), you can use Test-DbaLastBackup from the dbatools.io PowerShell module to verify database consistency (and restorability) without touching production.

This command:

Restores your most recent backup to another SQL instance
Runs DBCC CHECKDB against the restored copy
Confirms both restorable state and internal consistency

Test-DbaLastBackup -SqlInstance "TestRestoreSQL" -Destination "TestRestoreSQL" -Database "YourDatabase"

It’s a great way to validate backups and run CHECKDB in a lower-impact environment.
Not a replacement for CHECKDB in production, but a powerful supplement when time or resources are tight.

Consider running CHECKDB on a secondary replica if you’re using Availability Groups.
If CHECKDB fails due to size or takes too long, it’s even more important to find time and a strategy that works.

What to Do If You Find Corruption

Read the output carefully.
It tells you which object is affected and how.
Run CHECKDB again to confirm.
Temporary issues can happen, especially on shared storage.
Do not detach the database.
Doing so loses the ability to investigate further.
Check your backups.
Can you restore from before the corruption appeared? This is the first thing Microsoft will tell you when you call support.
If you are really lucky the corruption might be in a non-clustered index, and dropping/recreating that index may solve it for now.

Still stuck?

Read this from Brent Ozar: DBCC CHECKDB Reports Corruption? Here’s What to Do

About REPAIR_ALLOW_DATA_LOSS

That command does exactly what it says: it removes damaged pages and objects to make the database consistent again—even if that means losing real data.

Use it only when:

You have no usable backup
You’ve consulted with your team and accepted the risk (get that in writing from your manager/CTO)
You’ve tried every other recovery option

If you’re not 100% sure what it’s going to delete (if anything), you’re not ready to run it. This is the sort of thing that can get you fired. So is not having backups.

How to Check When DBCC CHECKDB Was Last Run

This script gives you the last successful run for each database:

SELECT 
    name AS DatabaseName,
    DATABASEPROPERTYEX(name, 'LastGoodCheckDbTime') AS LastCheckDBSuccess
FROM 
    sys.databases
WHERE 
    state_desc = 'ONLINE'
ORDER BY 
    LastCheckDBSuccess DESC;

If the date is blank, it has never been run.

The Bottom Line

Corruption doesn’t announce itself with a trumpet. You only know it’s there if you go looking.

CHECKDB gives you an early warning. It’s not glamorous, but it’s essential, especially in environments without a dedicated DBA watching for signs of trouble.

If you’re not running it, you’re flying blind.

If you don’t know what to do when it finds something, now’s the time to prepare.

Don’t panic. But don’t ignore it either.

Thanks for reading!

— Kevin

SQL Server I/O Bottlenecks: It’s Not Always the Disk’s Fault

August 6, 2025 by Kevin3NF Leave a Comment

“SQL Server is slow.”

We’ve all heard it. But that doesn’t always mean SQL Server is the problem. And “slow” means nothing without context and ability to verify.

More often than you’d think, poor performance is rooted in the one thing most sysadmins don’t touch until it’s on fire: the disk subsystem.

Why I/O Bottlenecks Fly Under the Radar

Many IT teams blame queries, blocking, or missing indexes when performance tanks, and sometimes they’re right. But if you’re seeing symptoms like long wait times, timeouts, or sluggish backups, there’s a good chance the underlying storage is at fault. I’ve rarely seen a storage admin agree with this at the onset of the problem, so you need to do the work up front.

Unless you look for I/O issues, you might never find them.

Common Causes of SQL Server I/O Bottlenecks

Slow or oversubscribed storage
Spinning disks, congested SANs, or underpowered SSDs can’t keep up with demand.
Outdated or faulty drivers
We’ve seen HBA or RAID controller driver issues that looked like database bugs.
Auto-growths triggered during business hours
Small filegrowth settings lead to frequent stalls. Instant File Initialization helps this. If you cannot use IFI, manually grow your data files off-hours.
Bad indexing or bloated tables
Too much data read, written, and maintained.
Unused indexes
Every insert, update, or delete has to update them, whether they’re used or not. This one is a killer. My script is based one my friend Pinal Dave wrote many years ago.
Data, log, and tempDB all sharing a volume
A recipe for write contention and checkpoint stalls. The more separation you can do, the better. If everything is going through one controller, this might not help, especially in a VMWare virtual controller configuration.
VM storage contention or thin provisioning
Your VM’s dedicated storage might not be as dedicated as you think. Check with your admin to see if VMs have moved around and you are now in a “noisy neighbor” situation.

What Do “Good” Disk Numbers Look Like?

If you’re not sure what “normal” looks like for your disks, here are some rough benchmarks:

You can get these numbers using:

sys.dm_io_virtual_file_stats
Performance Monitor (Avg. Disk sec/Read, Disk Queue Length)
Disk benchmarking tools like CrystalDiskMark (local test environments)
Resource Monitor>>Disk tab is a quick and easy way to see visually what the disks are spinning time on, if you are on the server.

Fixes and Workarounds

Identify and reduce high physical reads
These indicate SQL Server is constantly pulling data from disk, which could be caused by poor indexing, insufficient memory, or queries reading too much data. sp_BlitzCache from Ozar can help with this. Use @SortOrder = ‘reads’ or ‘avg reads’. Sp_whoisactive can help if the issue is ongoing.
Tune queries with high reads reads
Even if a query runs from memory, it can churn the buffer pool and evict useful pages, leading to other queries hitting disk more often.
Set reasonable autogrowth sizes
Growing in 1MB chunks? That’s going to hurt. Aim for larger, consistent growth settings, especially for TempDB and transaction logs.
Move files to better storage
Separate data, logs, TempDB, and backups if possible. SSDs or NVMe where it counts.
Clean up unused indexes
If they’re not used for reads, they’re just extra write overhead. Especially your audit and logging tables that rarely get queried.
Keep your drivers and firmware current
Storage vendors quietly fix performance bugs all the time.
Monitor your VM host’s disk utilization
Especially in shared environments. Noisy neighbors can take you down.

The Bottom Line:

SQL Server does a lot of things right, but it can’t make slow storage go faster. Verify the storage is the likely culprit before you go yell at the storage admin.

Before you throw more CPU or memory at a problem, take a closer look at your I/O path. You might just find the real bottleneck isn’t SQL Server at all.

Thanks for reading!

— Kevin