Troubleshooting

SQL Server Alerts

October 15, 2025 by Kevin3NF Leave a Comment

Don’t Let Trouble Sneak Up on You

Most SQL Servers run quietly. Until they don’t. By the time someone notices an application outage or a failed backup, you’re already behind. That’s why SQL Server’s built-in alerts exist – they give you an early warning before small problems become major outages.

There are a bunch of great 3^rd party tools and community scripts available, but not every firm is going to make that investment or allow open-source code on their servers.

SQL Server Alerts are Microsoft-supported, built into the product, and rely on Database Mail for notifications. Configure them once, and you’ll have a safety net that runs 24/7. But like smoke detectors, too many false alarms and you’ll start ignoring them.

Step 1: Create an Operator

An operator is just the person (or distribution list) that gets notified.

In SSMS:

SQL Server Agent >> Operators >> (right-click) New Operator
Fill in a name and email address (use a group if possible). 100 character limit.

T-SQL Example:

USE msdb;

EXEC msdb.dbo.sp_add_operator  

    @name = N'DBA On Call',  

    @enabled = 1,  

    @email_address = N'[email protected]';

Step 2: Define the Alert

Alerts can fire on:

Specific errors (e.g., error 823 = disk I/O issue)
Severity levels (e.g., all severity 20+ errors)
Performance conditions or WMI Events

In SSMS:
SQL Server Agent >> Alerts >> (right-click) New Alert >> choose type and scope.

T-SQL Example:

USE msdb;

EXEC msdb.dbo.sp_add_alert  

    @name = N'Error 823 Alert',  

    @message_id = 823,  

    @severity = 0,  

    @enabled = 1,  

    @delay_between_responses = 300, -- 5 minutes 

    @include_event_description_in = 1,  

    @notification_message = N'Disk I/O error (823) detected!';

Step 3: Tie It Together

Link the alert to the operator so someone actually gets notified.

In SSMS:
Open the alert >> Response >> “Notify Operators”

T-SQL Example:

EXEC msdb.dbo.sp_add_notification  

    @alert_name = N'Error 823 Alert',  

    @operator_name = N'DBA On Call',  

    @notification_method = 1; -- Email

Step 4: Enable the Mail Profile

Emails won’t get sent without this often-overlooked step.

In SSMS:

SQL Server Agent >> (right-click) >> Properties >> Alert System >> Check “Enable Mail Profile” and pick a profile from the drop down. This required Database mail to be configured and working.

Step 5: Cut the Noise

Not every warning deserves an email at 3 a.m. Start with the essentials:

Possible Corruption (823, 824, 825)
Critical Job failures (Agent jobs)
Severity 19+ errors (fatal errors, serious resource issues)
- Severity 20 may give false positives if you are using vulnerability testing software
HADR role changes for unexpected AG failovers

Then, test and adjust. If the alerts are noisy, you won’t trust them when it matters.

The Bottom Line

Setting up alerts in SQL Server is one of the easiest wins for DBAs. They’re built in, supported by Microsoft, and once tied to Database Mail and operators, they can catch serious issues before your phone rings. Just be selective, as too much noise, and the real signals get lost, or emails get “ruled” into a folder rather than acted on.

Free Disaster Readiness Quiz

I’ll trade you an email address for an honest assessment

DR Quiz – Are you ready?

Thanks for reading!

— Kevin

“SQL Server Is Slow” Part 4 of 4

October 7, 2025 by Kevin3NF Leave a Comment

Parts 1, 2 and 3 got you to the (SQL) engine room. Now we use community-trusted tools to find what’s going on, fix it safely, and hopefully keep it from coming back.

This post will lean heavily on the First Responder Kit from Brent Ozar, sp_WhoIsActive from Adam Machanic and others. They took what Microsoft provides and made them better. Hundreds or thousands of hours of work to make FREE things for you to use.

This is the most complex blog I’ve ever written. Your experiences may differ, and my code samples might have some bugs. Test first.

The Fork in the Road

After ruling out all the previous items from parts 1,2 and 3, you’ll probably land in one of two branches:

Branch A: Obvious & Fixable Now – A misconfiguration, a runaway query, or an ugly blocking chain.
Branch B: Systemic & Chronic – Indexing issues, bad query plan, stats/CE/database compat changes, or a new workload exposing weaknesses.

First Pulse (rank evidence before touching anything)

Run these back-to-back to see “right now” and “what changed”:

-- 60s wait/CPU/IO snapshot with expert details

EXEC sp_BlitzFirst @Seconds = 60, @ExpertMode = 1;

-- Active requests + blockers + plan handles (save to table if you’ll compare)

EXEC sp_WhoIsActive
     @get_plans = 1,
     @get_additional_info = 1,
     @find_block_leaders = 1;

If the server feels resource constrained (CPU/memory/tempdb/log), confirm with Erik Darling’s:

EXEC sp_PressureDetector;

Check Query Store (if enabled) for regressed queries over the last few hours. This is database level. Database Z that nobody thinks about could be sitting in the corner making life miserable for Database A.

Branch A — Obvious & Fixable Now (surgical)

A1) Confirm the bad actor (copy/paste or screenshot what you find)

sp_WhoIsActive: highest CPU/reads/duration, blocking leaders, tempdb usage.
- You may get lucky and see an obvious long running query blocking everything, or hogging CPU
sp_BlitzFirst (0s mode) for a quick wait profile on demand:
- EXEC sp_BlitzFirst @Seconds = 0, @ExpertMode = 1;

A2) Get permission & document the plan

Who’s affected, what’s the risk, what’s the rollback.

A3) Take the action (one change at a time)

Kill a true runaway query(after sign-off):
- KILL <session_id>;
Fix low-hanging config already validated in earlier parts: MAXDOP, Cost Threshold, auto-close/auto-shrink OFF, compatibility/CE sanity.
Remove a bad plan:

— Remove the specific plan from the cache (sample)

DBCC FREEPROCCACHE (0x060006001ECA270EC0215D05000000000000000000000000);

Update some stats

UPDATE STATISTICS dbo.TableName WITH FULLSCAN;  -- or sampled if huge

- This will purge some plans from the cache, but new plans will have great stats on at least one table.

A4) Re-measure immediately

EXEC sp_BlitzFirst @Seconds = 30, @ExpertMode = 1;

Goal: waits normalize, blockers drop, and users confirm latency relief.

Branch B — Systemic & Chronic (rank → fix → re-measure)

B1) Rank the biggest fish

EXEC sp_BlitzFirst  @Seconds = 0, @ExpertMode = 1;   -- overall pain profile
EXEC sp_BlitzCache  @SortOrder = 'cpu';   -- try 'reads', 'avg cpu', 'memory grant', ‘avg duration’ etc
EXEC sp_PressureDetector;                -- resource pressure confirmation

Use Query Store timelines to spot regressions and plan churn.

B2) Indexing reality check

EXEC sp_BlitzIndex @DatabaseName = 'YourDB', @Mode = 0;  -- database-wide health

Cross-check with native DMVs when you need specifics:

Top impact missing indexes (advisory; validate!)

EXEC sp_BlitzIndex @Mode = 3, @GetAllDatabases = 1, @BringThePain = 1;  -- database-wide health

Unused/rarely used indexes (drop/consolidate after monitoring)

-- Unused Index Script
-- Original Author: Pinal Dave
-- Edit the last Where clause to suit your situation

Create Table #Unused(
                [Database] varchar(1000),
                [Table] varchar (500),
                [IndexName] varchar(500),
                [IndexID] bigint,
                [UserSeek] bigint,
                [UserScans] bigint,
                [UserLookups] bigint,
                [UserUpdates] bigint,
                [TableRows] bigint,
                [Impact] bigint,
                [DropStatement] varchar(1500)
                )

exec sp_MSforeachdb

'use [?]
Insert #Unused

SELECT
                Db_Name(DB_ID()) as [Database]
                ,o.name AS [Table]
                , i.name AS IndexName
                , i.index_id AS IndexID
                , dm_ius.user_seeks AS UserSeek
                , dm_ius.user_scans AS UserScans
                , dm_ius.user_lookups AS UserLookups
                , dm_ius.user_updates AS UserUpdates
                , p.TableRows
                , dm_ius.user_updates * p.TableRows
                , ''DROP INDEX '' + QUOTENAME(i.name)
                + '' on '' + QUOTENAME(Db_Name(DB_ID())) + ''.''
                + QUOTENAME(s.name) + ''.''
                + QUOTENAME(OBJECT_NAME(dm_ius.OBJECT_ID)) AS ''drop statement''
FROM sys.dm_db_index_usage_stats dm_ius
                INNER JOIN sys.indexes i ON i.index_id = dm_ius.index_id
                                AND dm_ius.OBJECT_ID = i.OBJECT_ID
                INNER JOIN sys.objects o ON dm_ius.OBJECT_ID = o.OBJECT_ID
                INNER JOIN sys.schemas s ON o.schema_id = s.schema_id
                INNER JOIN (SELECT SUM(p.rows) TableRows, p.index_id, p.OBJECT_ID
                                                                FROM sys.partitions p
                                                                GROUP BY p.index_id, p.OBJECT_ID) p
                                ON p.index_id = dm_ius.index_id AND dm_ius.OBJECT_ID = p.OBJECT_ID
WHERE
                OBJECTPROPERTY(dm_ius.OBJECT_ID,''IsUserTable'') = 1
                AND dm_ius.database_id = DB_ID()
                AND i.type_desc = ''nonclustered''
                AND i.is_primary_key = 0
                --AND i.is_unique_constraint = 0
                --AND o.name in (''CloverSummary'')
ORDER BY
                (dm_ius.user_seeks + dm_ius.user_scans + dm_ius.user_lookups) ASC
--GO
'
Select *
from #unused
Where 1=1
                --and [IndexName] like '%_DDBA'
                --and [IndexName] IN ('')
                --and [database] Not in ('MSDB','tempdb')
                --and [database] in ('StackOverflow')
                --and UserSeek + UserScans + UserLookups < 1
                --and [Table] in ('')
Order By [Database] asc, UserSeek + userscans + UserLookups, impact desc

Drop table #Unused

B3) Bad Queries (identify → inspect plan → fix → handoff)

When is a query “bad”? Probably if it ranks in your top 10 by total CPU/reads in the target window, has high avg cost and high executions, triggers spills/memory grants, blocks, or just regressed in Query Store.
- Using the SortOrder parameter, also check for ‘memory grants’ and ‘avg duration’
Find them fast:

EXEC sp_BlitzCache @SortOrder = 'cpu';           -- or 'reads' , 'avg cpu' , ‘avg duration’ etc.

EXEC sp_WhoIsActive @get_plans = 1

Classic face-palm: Key Lookup from “one more column”
- What happens: a nonclustered index is used for a seek but doesn’t cover a new column added to SELECT, so the engine performs one lookup per row. This shows up in the Query Plan as a ‘Key Lookup’ or ‘RID Lookup’ operator.
- Why it hurts: on hot paths, that lookup loop can hammer CPU/IO and tank a busy server.
- Fix choices: INCLUDE the missing column(s) on the used index, reshape keys to cover, or trim the SELECT if it isn’t needed.

Quick hunt for plans with lookups:

SELECT TOP 50
       DB_NAME(qp.dbid) AS dbname,
       (qs.total_logical_reads*1.0/NULLIF(qs.execution_count,0)) AS avg_reads,
       qs.execution_count, qt.text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
WHERE qp.query_plan.exist('//RelOp[@LogicalOp="Key Lookup" or @LogicalOp="RID Lookup"]') = 1
ORDER BY avg_reads DESC;

Typical covering fix (example):

CREATE NONCLUSTERED INDEX IX_Sales_OrderDate_CustomerID
  ON dbo.Sales (OrderDate, CustomerID)
  INCLUDE (TotalDue);   -- add the “new SELECT column” here

Other common suspects: implicit conversions on join/filter columns, parameter sniffing, spills from oversized sorts/hashes, scalar UDFs/multi-stmt TVFs, RBAR triggers.
Developer handoff package (keep it short and useful):
- Evidence: normalized query text + sample parameters, actual plan (.sqlplan or XML), metrics window (total/avg CPU/reads/duration/executions), warnings (spills, lookups, implicit conversions).
- Hypothesis & options: e.g., new column caused Key Lookup on PK_MyTable_ID → options: covering INCLUDE, index reshape/filtered index, query/ORM change, or plan stability tactic.
- Safety: size/impact estimate, rollback (drop index/unforce plan), and success criteria (Query Store deltas, BlitzFirst/Cache snapshots).

B4) Plan stability

Verify statistics are being updated “often enough” and at the right sample size
Look for plan cache bloat and churn
Parameter sniffing fixes: targeted OPTION (RECOMPILE), OPTIMIZE FOR cases, or Query Store forced plan (monitor; keep rollback).
Memory grants/spills: better indexes (narrow keys + good includes), stats refresh, and watch row-goal operators.

B5) Stats/CE/compat sanity

Ensure AUTO_UPDATE_STATISTICS (and consider ASYNC) fit the workload.
Recent compat level or CE changes? Compare before/after in Query Store.

B6) Parallelism & CPU policy

Validate MAXDOP and Cost Threshold against core count + workload mix.
Use BlitzCache to spot skewed exchanges or thrashy parallel plans.

B7) Tempdb & log health

sp_PressureDetector will flag contention/pressure; confirm with file IO stats:

SELECT DB_NAME(vfs.database_id) AS dbname, mf.type_desc, mf.physical_name,
       vfs.num_of_reads, vfs.num_of_writes,
       vfs.io_stall_read_ms, vfs.io_stall_write_ms
FROM sys.dm_io_virtual_file_stats(NULL,NULL) vfs
JOIN sys.master_files mf ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id
ORDER BY (vfs.io_stall_read_ms + vfs.io_stall_write_ms) DESC;

Right-size VLFs (all active databases), ensure multiple equally sized tempdb data files, and watch version store if RCSI/SI is on.

B8) New workload exposure

Correlate Query Store/BlitzFirst windows with deploys, ORM queries, reporting jobs, ETL shifts, or seasonal peaks. Fix that pattern first (indexing, parameterization, caching, schedule changes).

Change Control & Safety Net

One change at a time → measure → document → keep/rollback.
Save “before/after” artifacts: sp_WhoIsActive snapshots (to table), BlitzFirst output, BlitzCache exports, Query Store screenshots/DMVs.
Always include who/when/why and expected KPI movement.
Assume someone will want a post-mortem. I wrote one this morning for a client outage.

What “Good” Looks Like

Waits shift away from bottleneck classes to benign background.
Top 10 statements show reduced CPU/reads; fewer/shorter blocking chains.
Tempdb/log growth stabilize; fewer off-hours alerts.
Users say “fast enough,” matched to your baseline and SLAs.

Keep It from Coming Back

Maintain baselines (perf counters, waits, file IO, Query Store top queries).
Align index & stats maintenance to your workload.
Add deploy gates for schema/index/compat changes with pre/post metrics.
Keep lightweight Extended Event sessions for spills, long-running, parameter-sensitive queries.
Review Query Store regressions and any forced plan safety periodically.

The Bottom Line:

Slow SQL isn’t mysterious; it’s measurable. Rank the pain, fix the biggest offender or the pattern behind it, and prove the result with before/after metrics. Keep notes as you go, and be methodical,

Above all…be calm. Everyone else can panic while you be the hero.

First Month of Pocket DBA® Free!

Pocket DBA

— Thanks for reading!

“SQL Server is Slow” Part 3 of 4

October 1, 2025 by Kevin3NF Leave a Comment

In parts 1 and 2 of this series, we’ve gathered info and done the triage just like anyone in almost any industry does

At this point you’ve:

Defined what “slow” means and built a timeline (Part 1).
Checked things outside SQL Server like network, storage, and VM noise (Part 2).

Now it’s time to open the hood on SQL Server itself.

Step 1: Check Active Sessions

Run a quick session (sp_whoisactive is a favorite):

Who’s running queries right now?
What queries have been running the longest? Is that normal?
Any blocking chains?
Are any queries hogging resources?

At this stage, you’re only identifying potential offenders. Next issue, we’ll dig into queries and indexes more deeply.

Step 2: Look at Wait Stats

Wait stats tell you what SQL Server has really been waiting for (everything in SQL Server is waiting for something else):

PAGEIOLATCH: slow storage reads.
LCK_M_X: blocking/locking.
CXPACKET/CXCONSUMER: parallelism info.
THREADPOOL: CPU threads
RESOURCE_SEMAPHORE: memory
ASYNC_NETWORK_IO: Probably more of a client side problem than SQL Side
- The most comprehensive list of wait types and explanations from my friends at SQL Skills

This isn’t about solving yet – it’s about categorizing where SQL feels the pain.

Step 3: Review Agent Jobs & Error Logs

SQL may already be waving red flags:

Overlapping or stuck Agent jobs. A long running purge job or index rebuild can cause all sorts of issues during the day.
Failed backups or CHECKDB runs. A failed CHECKDB could mean corruption. Read this
Errors or memory dumps tied to patching or system instability. Look in the same folder as your ERRORLOG location
- Can’t find that folder? Watch this

Step 4: Don’t Forget the “Gotchas”

Other less obvious issues can cause system-wide drag:

High VLF count (often from failed or missing log backups).
Database compatibility or config changes – check SSMS reports like:
- Server level: Configuration Changes History
- Database level: All Blocking Transactions, Index Usage Statistics
Recent patching issues (especially if tied to errors or dump files).

These aren’t everyday culprits, but when they show up, they can cripple performance.

Step 5: Compare Against Your Baseline

Today’s “slow” may be tomorrow’s “normal.”

Track batch requests/sec, CPU Utilization, wait stats, I/O latency, and log file size/VLF count.
Without this baseline, every slowdown feels like a brand-new mystery.

If you don’t already have a baseline, NOW is the time to start, while the server is healthy.

Collect I/O stats and wait stats regularly.
Run sp_Blitz for a full health snapshot (free tool from Brent Ozar)
Capture DMV performance counters (sys.dm_os_performance_counters) on a schedule.

A baseline doesn’t need to be fancy, it just needs to exist so you know what “normal” looks like before things go sideways.

The Bottom Line

Part 3 is about categorizing slowness inside SQL Server: sessions, waits, jobs, error logs, and configuration gotchas. Don’t jump straight into query rewrites yet. You’re still isolating the nature of the slowdown. Having a consistent process for this reduces panic and anxiety.

In Part 4, we’ll cover what to do when the culprit is truly inside SQL Server: look at queries, indexes, and design choices.

Thanks for reading!

“SQL Server Is Slow” – Part 2 of 4

September 24, 2025 by Kevin3NF Leave a Comment

The 10-Minute Outside-In Triage

Don’t Blame SQL First

It’s 9:05 AM and your helpdesk lights up: “The SQL Server is down. Nothing works.”

By 9:07, everyone is staring at you.

The trap: you open SSMS and start digging for blocking queries. But what if the database isn’t the problem at all?

I’ve seen teams lose hours chasing SQL when the real culprit was a Windows power plan, an antivirus update, or a snapshot on the storage array.

That’s why step two in our framework is a 10-minute outside-in triage. You prove or eliminate the usual suspects before you go digging inside SQL Server.

The Outside-In Checklist

When an incident is happening right now, you don’t need deep analysis. You need to rule out the obvious and collect proof.

Step 1: Blast Radius (1 min)

Are other apps/servers slow too?
Other users, or just one?
Cloud? Check provider status pages and throttling alerts.
If everything is slow, SQL may not be the root cause.

Step 2: Host & OS Health (3 – 4 min)

CPU: Is the server pegged, not just sqlservr.exe?
- I have a story here…coined the term “three finger salute” from it (CTRL-ALT-DELETE). Ask me if we ever meet in person!
Hypervisor: Look for VM “ready time” or “steal time.”
Power plan: Make sure Windows is on High Performance.
Memory: Check paging, commit vs installed.
Antivirus: Any active scans?
Windows Update: Patches or Defender scans running?

Step 3: Storage Sanity (3 – 4 min)

Latency: Are read/write times spiking?
Queue depth: Backlog on MDF, LDF, or tempdb volumes?
Snapshots/backup agents: Running now?
Cloud disks: Bursting credits exhausted?

Step 4: Network Checks (1 – 2 min)

Is RDP sluggish? File copies slow?
Any new firewall/VPN/SSL changes?

Step 5: Quick SQL Health Pulse (1 – 2 min)

(You’re not tuning queries here, you’re checking basic health)

Blocking chains: is one session holding everyone else hostage? Sp_whoisactive is great for this!
Current waits: PAGEIOLATCH (storage), WRITELOG (log), ASYNC_NETWORK_IO (client/network). Use SQLSkills Waits Now script
Active jobs: backups, CHECKDB, index maintenance colliding with business hours.

The Bottom Line

The point of a 10-minute triage isn’t to fix everything. It’s to prove or eliminate SQL Server as the culprit.

When you walk into a war room and say, “Here’s proof it’s the storage snapshot, not SQL,” you stop the blame game cold.

At Dallas DBAs, we use this exact checklist when clients call in a panic. It’s fast, repeatable, and it keeps you from chasing the wrong rabbit.

First Month Free for New Pocket DBA® Clients

https://pocket-dba.com/

Thanks for reading!

“SQL Server Is Slow” part 1 of 4

September 17, 2025 by Kevin3NF Leave a Comment

How should you respond when you get the dreaded Email/Slack/Text/DriveBy from someone yelling at you that SQL Server is slow?

Stop. Don’t Open SSMS Yet.

You’ve heard it before: “The server is slow.”

What does that actually mean?

If you jump straight into SQL Server looking for blocking, bad queries, or a missing index, you’re working with bad input. And bad input = wasted time.

The real work starts before you ever connect to SQL Server.

Intake & Scoping Comes First

Your first job is to turn “slow” into something measurable and time-bound. That means gathering evidence, building a timeline, and classifying the type of incident.

Translate “Slow” Into Scenarios

“Slow” can mean very different things depending on who you ask:

Login delays – Applications take 30 seconds to connect.
Application lag – a report that usually runs in 2 seconds now takes 30.
Broad latency – all users complain at once.
Maintenance overruns – ETL jobs not finishing in their normal window.
Timeouts or errors – not just lag, but outright failures.

Each one leads you in a different direction.

The Fast Intake Questionnaire

Ask these every single time:

What exactly was slow? (screen/report/job/login)
When did it start? (date/time, not “earlier today”)
How widespread? (one user, team, region, everyone)
Repeatable? (every time, sometimes, once)
What changed? (deploys, patches, config, new data)
Is it happening now? (yes/no)
Evidence (screenshots, error messages, job names with timestamps)

Classify the Incident

Once you have real answers, put it in a bucket:

Happening now – live triage needed.
Recurring/pattern – happens on a schedule.
Sporadic – random spikes.
One-time – tied to a patch, deploy, or outage.

Build the Timeline

Anchor events with real timestamps.

“Around lunch” = useless.
“9:12 AM CST, FinanceApp login timeout” = actionable.

Capture start, stop, duration, who was affected, and what else was happening on the system at that time.

The (first) Bottom Line:

Action without information is useless. Any other service provider (ER, plumber, Electrician, auto mechanic) will ALWAYS triage your issues before starting an investigation or fix. You should too.

SQL Tidbit:

When gathering timelines, align your clocks. SQL Server may log in UTC while users report in Central or Eastern time. A “9 AM slowdown” can easily show up in SQL logs at 2 PM. Get this wrong and you’ll chase the wrong evidence.

Thanks for reading!

DBCC CHECKDB: Just Because It’s Quiet Doesn’t Mean It’s Safe

August 13, 2025 by Kevin3NF Leave a Comment

Corruption isn’t a “maybe someday” problem – what you need to do now.

Stop. Don’t panic.

You just ran DBCC CHECKDB for the first time in a while (or maybe ever) and saw something you didn’t expect: the word corruption.

Take a breath.

Don’t detach the database.
Don’t run REPAIR_ALLOW_DATA_LOSS.
Don’t reboot the server or start restoring things just yet.

There’s a lot of bad advice floating around from old blogs, well-meaning forum posts, and even some popular current LinkedIn threads. Some of it might’ve been okay 15 years ago. Some of it is dangerous.

Let’s dig in.

What Corruption Really Means

When SQL Server says there’s corruption, it’s not talking about “bad data” like wrong numbers or missing values. It means it found internal structures that are damaged. The kind that can cause queries or even make your database unusable.

This could be:

Broken data or index pages
Allocation inconsistencies (GAM, SGAM, PFS pages)
Corrupt system metadata
Problems in the transaction log.

This isn’t a performance problem.
It’s a data integrity problem. If left untreated, it can get worse.

How Does Corruption Happen?

Even if your server is well-configured, corruption can still creep in. Common causes include:

Failing disks or controllers (especially SANs and older SSDs)
Disk subsystems lying about successful writes
Power outages or hard shutdowns.
Sometimes SQL Server itself has bugs that cause corruption – especially in RTM versions.
Snapshot or backup software interfering at the file level
Antivirus software scanning .mdf, .ldf, or .ndf files directly

Some of these things leave no obvious signs. This is why running CHECKDB regularly is so important.

What DBCC CHECKDB Actually Does

When you run DBCC CHECKDB, SQL Server performs a deep consistency check of your database:

Every table, every index, every system structure
Logical and physical page consistency
Allocation integrity

If possible, SQL uses a snapshot to avoid locking the database.

What it doesn’t do:

Fix anything (unless you tell it to)
Prevent corruption
Run automatically (unless you set it up)

How Often Should You Run It?

Ideally: once per week, at minimum.

Schedule it in a SQL Agent job, off-hours.
Save the job output to file or table so you don’t miss warnings.
Set up an email alert for failures of this job (as well as corruption alerts for error 823-825)

If CHECKDB takes too long or hits your performance too hard, you can offload the work.

Offload CHECKDB with Test-DbaLastBackup

If you’re taking backups regularly (you are, right?), you can use Test-DbaLastBackup from the dbatools.io PowerShell module to verify database consistency (and restorability) without touching production.

This command:

Restores your most recent backup to another SQL instance
Runs DBCC CHECKDB against the restored copy
Confirms both restorable state and internal consistency

Test-DbaLastBackup -SqlInstance "TestRestoreSQL" -Destination "TestRestoreSQL" -Database "YourDatabase"

It’s a great way to validate backups and run CHECKDB in a lower-impact environment.
Not a replacement for CHECKDB in production, but a powerful supplement when time or resources are tight.

Consider running CHECKDB on a secondary replica if you’re using Availability Groups.
If CHECKDB fails due to size or takes too long, it’s even more important to find time and a strategy that works.

What to Do If You Find Corruption

Read the output carefully.
It tells you which object is affected and how.
Run CHECKDB again to confirm.
Temporary issues can happen, especially on shared storage.
Do not detach the database.
Doing so loses the ability to investigate further.
Check your backups.
Can you restore from before the corruption appeared? This is the first thing Microsoft will tell you when you call support.
If you are really lucky the corruption might be in a non-clustered index, and dropping/recreating that index may solve it for now.

Still stuck?

Read this from Brent Ozar: DBCC CHECKDB Reports Corruption? Here’s What to Do

About REPAIR_ALLOW_DATA_LOSS

That command does exactly what it says: it removes damaged pages and objects to make the database consistent again—even if that means losing real data.

Use it only when:

You have no usable backup
You’ve consulted with your team and accepted the risk (get that in writing from your manager/CTO)
You’ve tried every other recovery option

If you’re not 100% sure what it’s going to delete (if anything), you’re not ready to run it. This is the sort of thing that can get you fired. So is not having backups.

How to Check When DBCC CHECKDB Was Last Run

This script gives you the last successful run for each database:

SELECT 
    name AS DatabaseName,
    DATABASEPROPERTYEX(name, 'LastGoodCheckDbTime') AS LastCheckDBSuccess
FROM 
    sys.databases
WHERE 
    state_desc = 'ONLINE'
ORDER BY 
    LastCheckDBSuccess DESC;

If the date is blank, it has never been run.

The Bottom Line

Corruption doesn’t announce itself with a trumpet. You only know it’s there if you go looking.

CHECKDB gives you an early warning. It’s not glamorous, but it’s essential, especially in environments without a dedicated DBA watching for signs of trouble.

If you’re not running it, you’re flying blind.

If you don’t know what to do when it finds something, now’s the time to prepare.

Don’t panic. But don’t ignore it either.

Thanks for reading!

— Kevin