Consistently handling long running jobs
The OneCompute Platform may very well handle long running jobs, though some effort is required to do this consistently.
Setting a cancel job timeout
By default, a job can run on forever; It is therefore recommended to set the expiry time of the job. If the job is not complete after this time, the job is cancelled by the job scheduler. This guards the user from accidentally overrunning when the calculation has reached a stalled state. To safely shutdown the calculation, the "worker" developer should implement a shutdown timeout as well, so that results can be stored to disk, as this could also be an unexpected long time calculation.
Also take into account that the pool might get full and all workunits may not start at time 0. Then make sure the longest running workunits are in top of the queue.
File transfer in long running jobs
If the standard file transfer mechanism of OneCompute is used, the SAS token created by the platform client is valid for 7 days and kept by the workerhost. If a job then lasts longer than 7 days, the workerhost is blocked from uploading the results to storage. The _joblogs stdout.txt and stderr.txt are only transferred to blobstorage on completion if the job property ContainerUri is set.
Creating SAS tokens valid for more than 7 days
The OneCompute platform client can give you a SAS token for any number of days, with the days option to use when defining the file transfer.
Summary
Below is listed a recipe for consistently handling long running jobs:
- Implement a timer to fire off a period before the cancel job event occurs. Enough time to safely shutdown a valid computation.
- Set the cancel timeout after the safely shutdown has occurred.
- Make sure the SAS token is valid a day longer than the cancel job timeout.
- Set the SAS token as a Job property.