The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

2. Are "the 3 empowering policies" defined and published?

There are three public-facing policies you must have if a sysadmin team is going to be able to get any work done. This is as much about serving customers as it is enabling team efficiency.

If you are a manager that feels your team has bad time management skills, maybe it is your fault for not having or not enforcing these policies:

  • The acceptable methods for users to request help.
  • The definition of "an emergency".
  • The scope of service: Who, what and where.

One document can explain all three things in less than a page. This should be made available on the department's website or posters on the wall so that it is clearly communicated. This policy must also be backed up by management. That means they are willing to tell a user "no" when they ask for an exception. The exception process should not be a speed-bump, it should be a solid wall.

How do users get help?

An official protocol for how users are to request help enables all the benefits of the ticket system mentioned in the previous section. Without it all those benefits evaporate as users will go directly to the sysadmins who will, trying to be helpful, become interrupt-driven and ineffective.

A sysadmin must have the ability to tell users to go away when the user is not following the protocol. Without the ability to point to this policy sysadmins will either work on low priority, squeaky wheel tasks all day long, or each sysadmin will apply a different policy making the team look inconsistent, or sysadmins will communicate their frustration in unhealthy ways. Specifically, ways that are unhealthy for the user.

What is an emergency?

An official definition of an emergency enables a sysadmin to set priorities. Without this everything becomes an emergency and sysadmins become interrupt-driven and ineffective.

The policy is one way management communicates priorities to sysadmins. Otherwise sysadmins will guess and be wrong and be unfairly punished for their incorrect guesses; managers will be confounded by the "disconnect"; and users will see inconsistencies and assume favoritism, neglect, and incompetence.

This policy sets users' expectations. Those that call everything an emergency can be corrected of their illusion.

Every organization should have a definition of an emergency or a "code red". A newspaper's code red is anything preventing tomorrow's edition from being printed and loaded onto the 4am trucks. A factory's code red is anything stalling the assembly line. A payment service's code red is anything that is stopping the payment pipeline. Educational technology teams know that a class can't simply be rescheduled therefore an emergency is anything preventing the proper delivery of a lesson (possibly only if the technology center was warned ahead of time). A university defines a code red as anything preventing grant proposals from being submitted in time.

A "code yellow" is anything that, if left unattended, would lead to a "code red". For example, the payment pipeline might be functioning but the capacity forecasting sub-system is down. It is risky to take on new customers without being able to properly forecast capacity. The last estimate indicated about 2 weeks of spare capacity. Risk of a melt-down increases daily until the code yellow is resolved.

Anything else is "routine". Fancy sites may divide routine requests into high, medium and low priorities; new service creation, provisioning of existing services, and so on. But if you have none of that, start with defining what constitutes an emergency.

What is supported?

An official definition of what is supported enables sysadmins to say "no". It should define when, where, who, and what is supported. Do you provide support after 5pm? On weekends? Do you provide desk-side visits? Home visits? Do you support anyone off the street or just people in your division? What software and hardware are supported? Is there a support life-cycle or once something is supported are you fated to support it forever? Are new technologies supported automatically or only after an official request and an official positive reply?

Without the ability to say "no", sysadmins will support everything. An eager, helpful, sysadmin will spend countless hours trying to get an unsupportable video card to work when it would have been cheaper to have gifted him or her a supported card out of your own budget. A sysadmin, assumed lost or dead, will magically reappear having spent the day at a user's house fixing their Internet connection. Alternatively a curmudgeonly sysadmin will tell people something isn't supported just because they're busy.

 
Community Spotlight
LISA15