The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?

18. Do automated processes that generate e-mail only do so when they have something to say?

Do you know the story of "The Boy Who Cried Wolf"? What about: "The cron job that everyone ignored because it blasted everyone twice a day, and nobody noticed when it started to report problems."

My rule is simple:

  • If it needs human action now: Send a page/SMS.
  • If it needs action in 24 hours: Create a ticket.
  • If it is informational: Log to a file.
  • Output nothing if there is no information.

While sending a page and creating a ticket might be done via email, the point is that email is not a good mechanism for this. Either may CC: you, but email is not the primary mechanism.

The worst case situation is a system that sends log messages to everyone on the sysadmin team via email. Your email system is not a good log archive.

True story: A friend in NYC worked at a site where all automated processes sent their output as email to root@the-company-domain. "root" was a mailing list that went to all the sysadmins in the company. It was a constant flood of messages. Sysadmins at this company literally could not read email. No amount of filtering would be enough. As a result, the sysadmins for this company used their personal email accounts for all communication, even stuff that was work-related. What company was this? A major email provider that is no longer in business. (I wonder what other bad decisions helped put them out of business?)

Community Spotlight