The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

"Ok, but... where do I start?"

That's the question we hear over and over again. There are literally hundreds of "best practices" in system administration. Which are the most important? Where do I start?

The Ops Report Card is a list of 32 fundamental "best practices" or "capabilities" that high performance sysadmin teams do. Use it as a checklist to examine where your team needs improvement.

You'll find 32 "yes/no" questions to ask about your team. Each is followed by an essay explaining the issue, why it is important, and resources to help get you started.

Do assessments work?

There is no magic here. It is possible to find a high-performing team that skips some of these best practices. However more likely than not that they don't. It is possible to find low-performing teams that incorporate all these best practices. However it is more likely that they do not. Adopting any one "best practice" may not improve your team. The problems may be deeper. Communication, maturity, skill and ability to execute can derail any best practice.

People constantly ask me how they can improve their sysadmin team. It takes only a brief discussion to find fundamental gaps that, when filled, will improve the team's productivity and the quality of service being provided.

These practices are fundamental. They are bedrock. Ignoring one creates a domino effect of other problems. These downstream problems multiply. If you are overworked maybe the solution isn't to work harder, but to fix the problem that is causing other problems.

Don't spend all your time mopping the floor if you haven't fixed the leak.

How do I use the OpsReportCard?

Answer the questions and count the number of yes's. That is your score. Most questions are self-explanatory with the exception of #2. Read the essay that follows each question to learn about the subject, why it is important, and resources to help you get started.

What should I tell my boss?

There is nothing more frustrating than the knowledge that a better way exists. That said, we find it best to have your boss do the assessment him or herself.

Do not become fanatical about this list. Do not implement a best practice just because it is on this list. Implement it if it will fix a real problem. Measure the effect and evaluate before moving on.

What's next?

Over the next year we plan on improving this website, adding more resources and information. We also plan on making a PDF available for people that prefer to read it in that format. Please check back often!
 
SREcon14
SREcon14