The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

10. Do you have a "post-mortem" process?

After a failure do you write up what happened so you can learn from it or do you just hope nobody notices and that it will all go away?

A good post-mortem (PM) includes a timeline of what happened, who was affected, what was done to fix it, how was business affected, and a list of proposed solutions to prevent this problem from happening again. Each proposal should be filed as a bug or ticket so they can be tracked to completion.

Doing PMs consistently builds a more stable environment. After each outage come up with at least one preventative measure. Can your monitoring system detect the situation so you know about it before users do? Can you detect precursors to the problem? Often systems have a way to run a battery of tests on new configurations before they are adopted ("pre-submit scripts" in source code repositories, for example). Are there tests you can add that will detect the typo that created the outage?

A post-mortem is not about blaming and shaming. In a good sysadmin culture you are comfortable with putting your name in the "what went wrong" section. You are taking a leadership role by educating people so they don't make the same mistake.

If your management uses PMs to find who to punish, they don't understand that operations isn't about doing things perfectly; it is about doing things better and better every day. Any manager that fires a person because of a non-malicious outage is going to run their company into the ground.

The PM should be published for all to see. You may be embarrassed and concerned that you are "airing your team's dirty laundry" but the truth is that if you consistently do this your users will respect you more. Transparency breeds trust.

Of course, to really develop confidence all those bugs and tickets filed as a result need to actually get worked on.

For More Information

See below links for more information on this topic:

 
Community Spotlight
LISA15