The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

3. Does the team record monthly metrics?

You need to be data-driven when you make decisions or sway upper levels of management.

The best way to develop your metrics is to create one metric per sentence or clause of your charter.

Example: The charter for a PC deployment team might be: To provide a high-end, standardized, computer to each employee starting their first day, refreshed on a 3-year cycle, at industry leading operational cost. The metrics might be: Number of weeks since the standard configuration was updated, cost of the current standard configuration, number of new employees this month, capital expenditures, operational expenditures. How many days new employees waited for their new machine ("buckets" for prior to arrival, first day, second day, 3rd, 4th 5th and more than 5th). Age of fleet ("buckets" for <1 year old, 1-2 years old, 2-3 years old, older than 3 years). Count of deployed machines that deviated from the standard.

If you don't have a charter, talk with your manager about writing one. Alternatively here's some simple "starter metrics" you can adopt today:

  • How many sysadmins do we have?
  • How many users do we provide service to?
  • How many machines do we manage?
  • How much total disk space? RAM? CPU cores?
  • How many "open" tickets are in our ticket system right now?
  • How many new tickets were created since last month?
  • Who (or what department) opened the most tickets this month?
  • What was the tickets per sysadmin average last month?
  • Pick 4-5 important SLAs and record how close you were to meeting them.
  • How much Internet bandwidth was consumed last month?

Record these on the first of every month. Put them in a spreadsheet. You'll be able to use this at budget time or during presentations when you want to explain what your group does.

That's it. Really. Recording those once a month is the difference between starting a presentation with a simple graph showing the rate growth in the number of machines on your network versus saying, "Hi, I'm Joe and we manage... umm... a lot of machines." At budget time being able to answer these basic questions is the foundation for other questions such as "How much does the average ticket cost?" (last year's budget total / last year's total ticket quantity). If we had 100 users, how much new disk will we need? (total disk space / number of users * 100).

Ultimately collecting these metrics should be automated. Until then, generate email to yourself on the first of each month with a reminder to do it manually.

For More Information

See below links for more information on this topic:

  • P2 - Chapter 22: Service Monitoring
  • P2 - Chapter 5: Services / 5.1.13 Monitoring
  • P2 - Chapter 31: Perception and Visibility / The System Status Web Page
  • P2 - Chapter 33: A Guide for Technical Managers / Sell Your Department to Senior Management
 
Community Spotlight
LISA15