Restart and Reschedule

At work, we have been somewhat puzzled by two concepts in HashiCorp Nomad: Restart and reschedule.

As per the documentation, which has been updated to be more approachable - very appreciated.

Restart configures a tasks behavior on task failure
Reschedule configures a groups rescheduling strategy.

Initially, there is a common pattern used in the Nomad job specs. The restart stanza can be specified at a task level and a group level. The group-level settings are applied to all tasks within the group and the task-level settings take precedence over the group-level restart configuration. The same occurs for reschedule, just with group and task being replaced with job and group.

However, what do they do and specify? In order to dig into this, we have to consider the life cycle of a task, and a task group.

Initially, a task group is in the running state. Either the task group and it’s task were running already, or it was scheduled and now entered the running state for the first time.
Now, something happens and one of the tasks of the task group fails. A common situation could be memory exhaustion, followed by the OOM-killer taking down a process. Or during startup, firewalling and other situations might end up with a container running on a node without access to the database, which also causes the application to fail.
This is when the restart stanza applies. If a task fails, nomad will attempt to restart the task in place to keep it running. While doing so, nomad respects the configured delays and backoff-configuration.
If this does not work, eventually the restart attempts will exceed the configured maximum restart attempts. At this point, nomad will switch gears and begin rescheduling the task group. It will stop the remaining tasks of the allocation with a failed tasks and set the entire allocation into a failed state.
And at this point, nomad will respect the reschedule delays and attempts and will eventually schedule a new allocation for the failed task group.

So, the simple summary: A task fails, the task will be restarted a few times, and eventually the entire task group will be failed and rescheduled.

The restart on task failure is capable of working around a number of isolated issues. For example, a go process might panic and exit. Or a process might allocate a lot of memory and end up OOM killed. In those circumstances, nomad will restart the task according to the restart stanza and the problem will be fixed. This allows nomad to temporarily work around some repeating crash by aggressive restarting.

The reschedule stanza seems to be more about misconfigured or failing nodes. I haven’t had this in nomad yet, but I’ve had to deal with servers with failing memory, which caused random segfaults for reasons. Or, there have been misconfigured nomad clients which had not been allowed on various firewalls yet, so the task had been failing as long as it was on a blocked host until it got rescheduled.

And as far as I know, but I’m not 100% sure, but I think the reschedule stanza is also used if an allocation is lost, which occurs if the nomad server loses communication to a nomad client. Again, the reschedule is used to work around and handle a broken nomad client.