Experience Extravaganza! (Article)

Description :: Information Technology (mostly Software) "Best Practices"

Batch jobs

Batch jobs are generally headless, scheduled programs designed to process large quantities of data in a single execution.

Batch jobs are not interactive; if an ambiguity comes up, nobody will be around to answer your program's question. You must be even more careful than usual in gathering requirements and researching possible failures or data oddities as your program is essentially "on its own" during its run. A few Unix systems have the ability for a batch job to output requests for data (prompts) to a common terminal, where an operator is expected to notice the message and reply to the suggested mailbox, if you will, with an answer. This facility is generally not available.

Scheduled jobs are generally expected to notify someone on failure, but what about success? If you send success messages regularly, those people entrusted with the task of watching the messages will become complacent; they will treat all notifications as spam, hardly even noticing the failures, and certainly never noticing the gaps. You should treat a failure to execute as a failure, and only notify support staff of failures. You should not rely on them to notice that a success message did not go out. This leads to the "who's watching the watcher" issue, which we cannot resolve. At the very least, having a single monitoring system that alerts you of failures and lack of successes is a single point of failure itself, but also a single point of monitoring; it's relatively easy to verify that the watchdog is running, while it's relatively difficult to make sure every job is properly set up to notify you when it does not execute.

Logging is essential, as it's likely noone will notice a failure until much later when conditions have already changed, tainting the crime scene; if you can capture information relevant for debugging purposes at the point of failure, do so.

Long-running jobs can be a scheduling nightmare; if their runtime varies very much, you should set up your jobs to depend on each other's outcome (end of execution, return code, output text, etc.), rather than defining rigid scheduling rules based on expected or worst-case runtimes.

System administrators may desire to kill a process that has been running for an unexpectedly long time. You should ensure that they know what the true expected runtimes are, that your process can survive such a kill gracefully, and attempt to provide some means of progress-monitoring to avoid such cases; if you cannot provide a percentage of completion, at least provide proof that your job is actively doing something, and is not hung. An example would be the output of a single '.' character for each item processed, with no indication of the total number of items to be processed. If you're feeling particularly gracious, provide proof that it's not processing the same item over and over again, e.g. by providing the current item's "id". Consider setting up automation around your worst-case runtimes, such as an automated kill if the process executes for far longer than expected, with appropriate logging and notification. If the process cannot die gracefully, or is too difficult to rerun on a regular basis, at least provide automated notification to a system administrator that immediate investigation is recommended.

Scheduling is a tricky business; 'cron', for example, is not smart about preventing a job from being executed multiple times concurrently, if its runtime exceeds its execution interval. Some other schedulers may provide this once-at-a-time functionality for you, but consider giving your jobs the ability to block concurrent execution, e.g. by taking out an exclusive lock on a shared resource for the duration of the job. Make sure such locks are properly released upon any exit condition, not just success; ensure that a job attempting to take out a lock will fail immediately rather than hang waiting for the resource, which could result in a messy pile-up.

Large-input jobs must decide if a failure during the process will fail the entire input or only part of the input. If you fail the entire input, consider whether you should skip those items on later execution (mark them as bad, to be ignored). If you fail only parts of your input, make sure a detailed error report is generated, based on which further attempts at processing may be made. The sender should be able to determine how to fix input data and retry, and it should be clear what action was (or was not) taken in response to the bad data.

Large-output jobs may provide the ability to view partial output, if you can, provide a means for restarting the process after the last savepoint so the entire set need not be regenerated.

Batch jobs often work off of a list of actions to be taken; in many cases, order matters. A failure may not simply cancel the current operation, it may need to prevent later files from being processed until the problem is resolved, to avoid dependency issues between related tasks.

Testing in production is not uncommon; testing a batch job can be difficult as a result of the long runtimes, so consider providing an option for a "short run", e.g. 10 records, just enough to verify that the process basically works without performing a full run. This relates to the issue of being able to restart a job mid-process.

Logging

Logging is the process of keeping logs of everything attempted, done, or failed, and the context within which this took place.

Do not log only the error, log the request and the context. It's not only a matter of accelerating the investigation; you may not be able to determine what the system settings were at the time an error occured, preventing you from duplicating the problem. Log what was requested of you (you can think of this in terms of function parameters, but pay particular attention to the outermost function, essentially what the user 'called') and the context (globals, if you will -- things like database connection strings, current date and time, environment variables, software version, etc.) Also log your current progress and your reaction to failure: did you rollback, or did you partially complete the request; if so, what did or did you not get done?

Uniquely identify failure points; error messages are nice, but embedding a globally-unique identifier (GUID) will aid research: several similar error messages in various parts of the code can be quickly differentiated.

Know your audience: programmers will have different needs than end-users when it comes to interpreting error messages; consider keeping two versions of your error messages, where the version intended for support-staff includes information such as the systems involved, the likely causes, and the usual solutions. Support staff also need access to the kind of information you would log (request, context, response) and your Help Desk may not have access to server log files. Provide a mechanism to give them access to this when contacted by an end-user.

[more things to talk about]
Accessibility
Internationalization
Scripting, Automation, Integration
Monitoring
Debugging
Deploying
Testing
User interface
Requirements
Project management
Crisis management
Security (passwords [don't give out root, use sudo], permissions, attacks, ports, encryption, etc.)
Safety (CRC, transactions, 2pc, 2pc when 2pc isn't available, confirmations, undo, backups [verify them!], fail-over, onames, etc.)
Efficiency (load-balancing, ...)
Design
Auditing (logging for the purpose not of debugging, but blaming or discovering)