The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

26. Are your disaster recovery plans tested periodically?

The last section was a bit of a lie. There aren't 4 reasons to do backups. There are 4 reasons to do restores.

Nobody cares about backups. People only care about restores. If you can figure out how to do restores without needing to do backups first I will lobby the Nobel committee to create a prize for sysadmins just so that you can be the first to receive it.

You don't know if backups are valid until you test them. Faith-based backup systems are not good. Hope sustains us but it is not an IT "strategy".

A full test involves simulating a total failure and doing a 'full restore'.

You won't know the real amount of time a restore takes until you try it. Restores from tape often take 10x longer than doing the backup. If you can do a full backup of your payroll server in 8 hours, then you have to be prepared to not cut paychecks for 80 hours in the event of a restore from scratch. That's more than 3 days.

If you are doing absolutely no tests then a little testing is better than nothing. Write a small script that randomly picks a server, then randomly picks a disk on that server, then randomly picks a file on that disk. The script should then create a ticket asking for that file to be restored (to a scratch location) as it existed 6 weeks ago. Have the script run automatically every week. This has a good chance of finding a server or disk that wasn't added to the backup schedule. Also, if you think doing these restores will be a lot of work for you, here's a secret: it won't use any of your time if your coworkers end up doing the ticket. Generate the ticket with enough random text that they don't know it is a drill.

To take this one step further, plan a "game day" where the disaster recovery plans are really put to the test. Pretend that certain people are dead and make sure the remaining people know how to fail-over services. Write scripts that document what tests will be performed. Either actually cause outages (disconnect the power or network cable) or play-act the scene: the "dead" person can proctor the test. "Ok, now lets suppose you got paged with this message. Tell me the commands you type and the actions you take." Another method is to permit your CEO to walk into the data center and unplug any cable of his or her choosing.

For More Information

See below links for more information on this topic:

 
Community Spotlight
LISA15