I'm been a contractor for most of my career so I have experience with a lot of companies. One of the most common and most frustrating things that I encounter are preventable errors. Over the years I've compiled quite list, a portion of which I turned into my DevOps Holiday Emergency Checklist.
This year I decided I'd highlight my list of top 10 preventable errors as a set of New Year's Resolutions for DevOps.
Hopefully you can say that you've already protected yourself from mistakes like these. However, as you'll see, it is all too easy to make them. I've done it more times than I care to admit, and I've seen them made hundreds of times by now. If anything can help you prevent a preventable error from happening at a critical time, then this article is worth it.
#1: Don't be a purist. Instead of trying to find the "best" solution or technology, use what you have well. There is no perfect solution, so stop trying to find it. When I was younger and inexperienced, I believed in finding the "best" tool, but I soon came to realise that a purist mentality leads only to inefficiencies and incompetence. The most important lesson I've ever learned is that it is not about the technology, it is about how intelligently you use it. While the Internet endlessly debates about which is "best" (e.g. Azure vs AWS vs GCP, Gitlab vs Github vs Azure DevOps vs ArgoCD vs Jenkins, Datadog vs Splunk vs Prometheus/Grafana vs ELK) I just use what the client has or what fits the client ecosystem, and I make it work. And if the tool doesn't have something I need, I figure out a way to supplement that and keep moving on.
#2: Always make sure you know when your plans are up for renewal and have current payment dates. This sounds so obvious but I've seen happen so many times, especially with CI/CD or related DevOps services. There is nothing quite like the sound of having no pipelines running, or devs with nothing to do as their latest check-ins are hanging. I've also seen it with cloud plans themselves, as well as with third-party dependencies. While you can argue this is a CFO concern, I disagree: if I'm responsible for the platform, than I'd damn well better know about everything that can break it.
#3: Improve your dashboards. There's an art to it. You need to think in terms of how your operations team will be using them. They need to see, on a single screen, only panes that show them when a failure happens. Then they need other panes to drill down to all the relevant information so they can triage the failure. On regular basis, check your dashboards for dead panes and stale data. Even more regularly, make sure that you update dashboards every time a major release is made. Your dashboards can go from great to full of holes in one release if you are not on top of it.
#4: Establish clear and reasonable guardrails that support the process. I've done a lot of development in my day, so I get on well with developers. I also know that developers will naturally want the quickest and most efficient way to deliver. But doing that can miss critical steps in the CI/CD flow - approvals, security scans, test deploys, etc. I always develop clear guardrails and procedures that protect the process without unduly inhibiting development. What I often find is that DevOps procedures aren't one-size-fits-all. They need to be developed in coordination with other stakeholders, like developers, testers, engineering managers, release managers and product managers.
#5: Don't store secrets in pipelines or IaC. Let me repeat that: DO NOT store secrets in pipelines or IaC. I've seen this SO many times. It is hard not to do when you are in a hurry. When I set up a DevOps environment, one of my first steps is to set up a KMS. It's dead easy to use for secrets needed by your pipeline and IaC code. And while you are at it, make sure you also set up a code scanner to find secrets that are accidentally committed by devs. There are plenty of good options out there.
#6: Document and monitor anything that can expire: secrets, tokens, service principals, deployment credentials, certs, API keys, etc. Our modern IT systems have more dependencies than ever, so it is way too easy to lose track of something about to expire . I've done it, I'll probably do it again, and I'll bet you've done it. What I do is keep an up-to-date list (reviewed with the dev team monthly), put it in a dashboard and configure alerts to fire within 45 / 30 / 7 days of expiry. That has saved my ass, especially over the holidays.
#7: Don't forget to monitor DevOps disk space. We all monitor disk space of production machines, but it is easy to forget the supporting machines: log collectors, external monitoring or dashboard servers, CI runners, build servers, security scanners, etc. I've experienced more outages from those things than production machines. Often when we set up DevOps-related servers we forget to apply the standards for logging, monitoring and alerting that we do for production.
#8: Review alert rules everything month. It is amazing how little time it takes for an alert rule to go stale, i.e., not be relevant any more because of a code change. The same goes for where alerts are supposed to go. So many times I've configured email addresses for alerts only to find out later than no one was getting them because someone else decided to rename an Exchange alias. For one client, I went through this like 5 times: I'd set up and test with the "right" email address", and then it would be changed by someone I didn't even know. Man, that was annoying, and very bad practice besides.
#9: Document all third-party integrations. I've managed systems that had API keys and accounts from dozens of third-party services. I get why we need to use such services. The idea is that a third-party can do a job better and cheaper than we can do it in house. I don't completely agree with that, but as a DevOps guy I don't have a say in that. So then I have to track these integrations, integrate their status / incident notifications into my monitoring, and keep up-to-date contact information and escalation procedures to hand.
#10: Communicate with all stakeholders. I do my best DevOps work when I'm engaged, not just with developers and testers, but also product managers and senior management. I always make sure I know why we are developing a product and whom our customers are. Knowing these things means I can tune how we build and ship to support the business and customers are much as possible. When I get the chance, I will even engage with customers so I can get their feed back on how we are shipping and how the product is working.
If you have things to add based on your experiences, I'd love to hear about it. Drop me a line at welcome@ondemanddevops.com
Too late? Facing an emergency? I can help, see my Emergency Services Page.
Lajos Moczar - 02/01/2026