Welcome back to the series “Server Review Essentials for Accidental and Junior DBAs.” So far in this series we’ve taken a look at how to set up your work environment, best practices for onboarding new servers and clients, and prepping your servers for daily reviews.
What You’ve Accomplished Thus Far
After all of our work and prep during the last several posts in this series, it’s finally time to get down to the “real” work of a DBA.
Wait, that’s not quite right. While it may seem as if everything covered so far in this series has been pretty basic and production servers have barely been touched, don’t get the idea that the work done so far isn’t of high value. Think about some of the things you’ve already learned about your servers during this preparation process:
- You’ve gathered all of the necessary credentials needed to log into VPNs and individual servers, verifying the accuracy of those credentials.
- You’ve set up a stable work environment.
- You’ve built a starting personal script library that will grow and evolve over time.
- You’ve collected “biographical” information about every server you’ll be reviewing, so you have a baseline to see how everything stands before you begin reviews.
- You’ve learned the expected RPO and RTO for all of your databases.
- You’ve built a Server Inventory List that shows the status and use of every server for which you will be responsible for administrating.
- You’ve discovered the current patch level of all the servers within your purview, which will give you vital knowledge as you move forward.
- You’ve learned what maintenance jobs are already active on your servers.
- You’ve learned what the current backups look like and whether or not you’re currently meeting RPO and RTO.
- And you’ve created a DBA Tools database on all of your servers so that you’re prepared for regular server reviews.
Why is all of the above so vital? If I’ve learned one thing in my two years of database administration it’s that there seems to be a large number of folks who really have no clear understanding of what’s going on in their server environments. Their approach to preserving their data is more seat of the pants and reactionary than proactive monitoring.
As outside consultants, these are the types of folks who typically end up calling us to say, “my server is on fire!” Many of the issues we see on an “emergency” basis could have been prevented with regular reviews.
A Proactive Approach
At Dallas DBAs, we much prefer a proactive approach to database maintenance and performance. When we take on a new group of servers, the first thing we do is run a health check on the boxes and search for any issues that may need immediate remediation or any server settings that can be adjusted for immediate benefit.
While these topics are not covered in this series, we’re often looking for common issues related to things such as Cost Threshold for Parallelism, Instant File Initialization, MAXDOP, max server memory, the current state of backups, TempDB configuration, and more.
This gives us a launchpad to implement all of the prep work mentioned in this series so far, and better guide the people who are depending upon the data to run their business or otherwise get work done.
By taking a proactive approach, we often know about problems days or hours before our clients do. We can see if backups have failed, ensure index maintenance is occurring regularly, and keep an eye on job failures and errors.
We often prefer to do a quick 15 minute or less check of all servers daily, running a handful of scripts that I’ll begin sharing in the next few posts. These scripts quickly show us the state of the backups, reveal any job failures or errors we need to investigate, and find alerts we may need to be aware of.
This proactive approach also helps us get to know what is expected from the servers. As we report job failures and errors, we may learn that certain issues are common knowledge and are not necessarily causing the end-users any pain.
We learn how our data users feel about deadlocks. They may not care about them if they don’t “feel” them. A job that runs every 10 minutes and is the victim of a deadlock once or twice a day likely isn’t going to be an issue. Sometimes job failures are simply random “network blips” that happen periodically, so all you can do is name a note, shrug your shoulders, and move on.
That’s all for now. Next time we’ll take a deeper dive into looking at investigating your server’s backup history.