The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

22. Do you have a PC refresh policy?

If you don't have a policy about when PC will be replaced, they'll never be replaced.

[By "PC" I mean the laptop and desktops that people use, not the servers.]

In the server room there is usually more thought about when each device gets replaced. Your PC environment generally needs some kind of repeatable, cyclic, process so that it stays fresh. Without it things get old and unsupportable, or people get upgrades as a status symbol and it becomes political. With a good policy things get better and more cost effective.

A certain fraction of your fleet should be old; that's just economical. However, extremely old machines are more expensive to maintain than to replace. It is a waste of your time to produce a work-around so that new software works on underpowered machines. It is a waste of your users' time to wait for a slow computer. It is bad time management and bad for productivity to have seriously old machines.

Companies often get into this situation. Sometimes they "save money" by not upgrading machines but it doesn't save money to have employees with tools that don't work well. Sometimes they just don't realize that computers don't last forever.

If you don't have a policy, here's a simple one you can start with: All computers are on a 3-year depreciation schedule. Every year the budget will include funds to replace 1/3rd of all machines. On the first day of each quarter enough machines will be ordered to replace the 9 percent oldest machines in the fleet.

CFOs like this because they like predictability. At one company the CFO was quite excited when I gave her control over which months the upgrades would happen. We agreed that 1/4 of the upgrades would happen each quarter; and she could pick which month that happened. She could even split it into individual monthly batches.

Instead of coming to the CFO to beg for new desktops now and then, it was a regular, scheduled activity. Less pain for everyone.

ProTip: At some companies servers are on a different depreciation schedule: they are designed to last longer and are on a 4-year depreciation schedule. On the other hand, their cost is amortized over all their users and therefore you can justify a 2-year schedule.

For More Information

See below links for more information on this topic:

 
Community Spotlight
LISA15