The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

4. Do you have a "policy and procedure" wiki?

Your team needs a wiki. On it you can document all your policies (what should be done) and procedures (how it is done).

Automation is great but before you can automate something you must be able to do it manually. Documenting the manual process for something is a precondition to automation. In the meantime it enables consistent operations across a team and it gains you the ability to delegate. If it is documented, someone other than you can do it.

The table of contents for this wiki should include common, routine tasks. A good place to start is the add/change/delete procedures that anyone on the team should be able to do, and the tasks you dislike doing and would delegate to an assistant if you had one.

Procedure list:

  • When a new employee starts.
  • When an employee leaves the company.
  • When an employee is terminated.
  • When a new machine is installed.
  • When a machine is decommissioned.
  • How to add/delete a person to the VPN service.
  • How to change a disk in the RAID system.
  • How to change the root password on all machines.

There are three categories here: Things that you want to be consistent, things that you do infrequently and don't want to have to spend time re-remembering the procedure, things you do when panicking and don't want to have to think on your feet.

These things can all be documented with simple "step-by-step" checklists.

Once documented, anyone on the team can do them. It also creates your training program for new employees. It can also be used to write the job description of that assistant you want the company to hire for you to do all your work.

Even if you aren't on a team, or there are tasks that only you do, documenting has benefits. You have to think less when you do the task. Just like the adage that "we automate because we are lazy", it is also true that "we document because we are impatient".

Many sysadamins dislike writing documentation but writing a "step-by-step" checklist isn't that bad. Keeping it on a wiki is important: anyone can correct it and anyone can improve it.

For any task you might want to have separate "policy" and "procedure" documents. Policy is what management defines: All new users will receive a wireless mouse. Procedure is how the tasks get done: The wireless mouses are stored in the 3rd bin; charge it, and test it with the following steps, etc. Policy are only changed with management approval. Procedures are changed by the technicians, with change notifications sent to the author or other authority.

For More Information

See below links for more information on this topic:

 
Community Spotlight
LISA15