Characteristics of an Operations Interface

The original email message about this told a long story of a complicated process of debugging a software subsystem running in production, ultimately to arrive at the conclusion that the subsystem was operating perfectly, but the operations staff (UNIX sysadmins) were unable easily to establish this. It entailed looking in log files and hand-decoding highly non-obvious information based on an external document, including comments in source code. From an operational standpoint, it was unacceptable that operations staff be required to learn that much about such systems to be able to support them.

Out of that we evolved the concept of an operations interface document, having the following general characteristics. Some of them sound very complicated, but simple systems can have very simple interfaces. There's a sample document for Sendmail.

1. Uniform Terminology
Systems should be designed and documented, as much as possible, to conform to a uniform set of terms that describe features of the operations interface. Synonyms should be avoided. For instance, "fault" specifically describes a condition of the system that is not as it should be; it should not be called "error" or "exception" or another synonym, as these may have other defined meanings. Above all, all systems in the same operational administration should use the same term for a given thing.: 2. No Source
Does not require operations staff have access to, or read or understand, the source files for the product itself, including shell scripts or other interpreted code. It should be a black box to the extent possible.: 3. No Mods
Ops staff should not be required to make any permanent change to any system. If it used to work, and it breaks, they'll put it back. Anything beyond that is engineering or a software release, and must be done through the proper release procedures.: 4. Start/Stop
Ops must be able to stop, start, and restart the product exclusively by executing scripts in /etc/init.d with only the "stop", "start", and "restart" arguments, per System V standards. Every piece of the product which can foreseeably need to be restarted separately should have a separate script. For processes that fork permanent children, and try (and sometimes fail) to reap them when the parent is stopped, the stop script must find and kill them. It is also necessary, but not sufficient, to document the identifying names of all revelant children, in case they must be identified and killed manually.: 5. Fault Status
Ops staff must be able, by a written procedure, to determine whether the product has detected any kind of internal fault. Steps to clear it, or to escalate based on the type of fault, should be documented.: 6. Self-Test
Ops staff must be able, by a written procedure, to trigger a self test of the product, and see the result. Interpretation of the output should be as obvious as possible, as far as go/no-go; further interpretation can be directed by written procedures. This should be a "local", internal unit test, depending as little as possible on outside influences.: 7. Verify/Request
Ops staff must be able, by a written procedure, to trigger an end-to-end system test, by making a real request or otherwise proving that the entire system correctly performs its primary function. This will typically happen at the outermost interface of the system or subsystem.: 8. Error Messages
Errors and faults must produce meaningful and helpful messages, written to a standardized log (e.g. syslog). They must indicate (1) the time, in GMT if possible; (2) the identity of the process or subsystem making the report, with PID and argv[0] command name; (3) the general nature of the operation that was happening when the error occurred, e.g. "opening connection to subsystem Z"; (4) the system error code as returned by UNIX, if any, via perror(); (5) the specific arguments to a system call or other interface, e.g. full pathnames of files; and most important, (6) a recommendation or clue about what to _do_ about the condition. Messages can occupy more than one line or log file entry, but should be concise yet complete. They may refer to external documentation for more detailed recommendations about what action to take, but must capture all necessary details. In short, error messages are supposed to hand you _solutions_, not problems.: 9. Trace/Debug
Ops staff must be able to enable trace output or debug output for a given operation, or for all system operations, even if they are not completely trained in how to interpret it. Ops staff should be trained to capture the relevant output, and take the _first steps_ to interpret it, before calling the engineers. The principle is that ops staff have brains, and can use them beyond the edge of the written procedures, given _some_ information.: 10. Crash Dump
Ops staff must be able to cause the system or subsystem to write a crash dump for later analysis by the engineers. Written procedures must say how to cause it, where the dump is written, how to preserve it for analysis, and how to take the _first steps_ to analyze it (see Trace/Debug).: 11. Command Line Interface
The system must have a command-line interface, if more elaborate than the Bourne shell, and this must be documented. It must be usable from a single shell; no graphics. If a system-specific interface exists, it should be designed to resemble closely something with which the ops staff are already familiar, for instance the shell or a Cisco router. It need not duplicate something else; it obviously can't. But it shouldn't be utterly unlike anything that's ever been seen before.: 12. Safety
Any system-specific interface, if used for engineering debugging as well as operations, must have a "safety catch" of some kind, so that ops staff do not unintentionally enter commands that only engineers should. For instance, Cisco routers have "enable". It may or may not be protected by a password or other authentication, provided the system as a whole is adequately protected from unauthorized use.: 13. Documentation/Training
Any system put into production must be accompanied by written operations procedures, showing how all the above requirements are met; and by an overall system document or diagram showing how the parts interrelate _from a UNIX perspective_. That is, what processes are created and why; what files and network connections they open; and the general flow of information. Product groups must prepare training materials, and do the initial training of ops staff, for any new product. Product groups must maintain the training materials to be correct and complete, with each product change released to production.