Trainer and Consultant extraordinaire Ginger Grant stops by to talk Machine Learning, Data Bricks, Certifications, Norwegian pastries and proper chocolate frosting
Image credit to Jeff (t)
Back in June of 2019, I published this YouTube video covering the highlights of the various SQL Server High Availability and Disaster Recovery options. But I didn’t do any of it in writing like I usually do in the companion piece…so here we go!
First, some definitions:
- High Availability (HA)—typically means that the database will be back online in seconds or minutes, not hours or days
- Disaster Recovery (DR)—the ability to bring the database server and databases online after something really bad happens
- HADR—An umbrella term covering any feature that encompasses HA, DR or both.
- RPO—Recovery Point Objective—Essentially governs what point we can recover to (or…how much data loss can we handle)
- RTO—Recovery Time Objective—How long to get everything running again.
SQL Server comes with a number of features that cover different scenarios, in different ways. Some are only in the Enterprise edition.
I am going to cover these at a very high level so you can quickly get the idea in your head, and refer back to this post as needed later. We will start with the most basic.
Backup and Restore
It sounds simple, but backing up the databases is the simplest, most cost effective choice in a DR solution. BACKUP DATABASE is available in all editions of SQL Server. Express edition makes it hard to automate, because there is no SQL Agent functionality. There are answers for that, such as Windows Task Scheduler and the Ola Hallengren Scripts.
Backups by themselves are great, but in the event of an actual disaster, the ability to RESTORE them is the critical part. You should be backing up and test restoring your databases regularly.
This is database level only. Backup all user and system databases.
Backup and Restore is DR, not HA.
Log Shipping (LS)
Transaction Log Shipping at its core is just a Backup-Copy-Restore process with some bells and whistles added to it. GUI for setup, alert jobs to let you know when it gets behind, etc. But the basic concept is that every 15 minutes a T-log backup is taken, then copied to another server, then restored there. In LS, you can set up the Secondary server to be read-only between restores. You can have multiple Secondaries.
This is database level only, so Logins, jobs, etc. are not copied.
You need to be familiar with File shares, UNC paths, permissions settings, etc. to set LS up.
Log Shipping is DR only, not HA unless you write a bunch of scripts to detect an issue, catch it all up and repoint your applications to the new Primary.
Database Mirroring (DBM):
Database mirroring is a deprecated feature, but it still exists in some recent versions of SQL Server. Check the Microsoft documentation for your version. DBM is available in Standard and Enterprise editions. As the name implies, the Principal copy of the database is mirrored to the Mirror copy on a secondary server.
DBM works at the transaction level, unlike Log Shipping which uses T-log backups. DBM can be set to be either Synchronous or Asynchronous modes, and either Automatic or Manual failover. Not all combinations of these two exist.
This is Database level, so jobs, logins, etc. do not participate.
Since Automatic failover is available, I classify DBM as HA, and DR at the Database level.
Always On—a marketing term from MS that includes 2 features:
SQL Server Failover Cluster Instance (FCI)
This is a traditional Windows Cluster (WSFC) that has been around for ages, with one or more SQL Server Instances installed into it.
Failover is handled by the Windows Cluster service and is usually very quick. A few seconds, but exceptions exist, such as 10000 databases all starting up at the same time.
An over-simplified explanation—2 or more servers (nodes) are connected to a SAN. The databases exist on the SAN, so in a failover situation, they don’t move.
If the SQL Instance on Server A becomes unreachable, the cluster stops the SQL Service there and starts it on Server B.
The SAN is a single point of failure, unless it is being replicated via some non-SQL Server technology. For this reason, a SQL FCI (in my opinion) is not a full HADR solution…but definitely IS HA.
SQL Server Availability Groups (AGs):
AGs are an Enterprise only feature as of this writing.
AGs are essentially a much-improved version of Database Mirroring. Transaction level data movement from Primary to Secondary for one or more databases in a Group. Multiple groups are allowed. Synchronous or Asynchronous. Manual or Automatic failover. Readable secondaries to offload reporting queries.
Standard Edition has a “Basic Availability Group” which has lots of limitations, chief among these being one database per group.
When set up correctly, AGs are both HA and DR for the user databases, with no single point of failure. There are script options to keep the jobs, logins, etc. in sync between the Primary and all Secondary replicas.
A Windows Cluster is required, but a SAN is not. AGs work off local storage, not shared since there is a copy of each database on each server participating.
The licensing of AGs has changed a lot, so I won’t get into it here…but you probably already know that Enterprise Edition licenses are VERY expensive. Plan accordingly with your license vendor.
Replication—a very Special Snowflake:
SQL SERVER REPLICATION IS NEITHER HA NOR DR.
Not everything in a SQL Server user database CAN be replicated, such as users, or tables with no Primary Key. New objects are not automatically sent from Publisher to Subscriber. System databases are not replicated.
Replication IS a great option to send a subset of your data to another server for reporting, or for filtering by region or salesperson.
Replication is for Distributed Data Processing. Backup and Restore beats Replication when you have to rebuild an environment, every time.
I hope this helps you have a better understanding of the high level concepts of the various HA/DR options available to you. There are “gotchas” and details that are impossible to cover in this post. But, all of these are extensively well documented by Microsoft and the SQL Server Community at large.
Thanks for reading!
SQL Server may skip 1000 numbers on an Identity column if the server crashes. Here’s why:
Too long, didn’t watch version:
SQL Caches 1000 numbers at a time to boost insert performance. In a crash and recovery, those numbers are gone.
SQL 2016 and earlier – use instance-wide trace flag 272 to turn off this behavior (performance might suffer).
SQL 2017 and later – its now a database scoped config item:
use MyDB; go ALTER DATABASE SCOPED CONFIGURATION SET IDENTITY_CACHE = OFF GO
Video shows a walk-through of before and after each fix, plus a “Two guys walk into a bar” joke when I disappeared to troubleshoot a broken demo…
Thanks for reading and/or watching!