Although containers are bringing a refreshing flexibility when deploying services in production, the management of those containers in such an environment still requires special care in order to keep the application up and running. In this regard, orchestration platforms like Docker, Kubernetes and Nomad have been trying to alleviate this responsibility, facilitating the task of deploying and maintaining the entire application stack in its desired state. This ensures that a service will be always running, tolerating machine failures, network erratic behavior or software updates and downtime.
The purpose of this talk is to explain the mechanisms used in the core Docker Engine orchestration platform (using a framework called swarmkit) to tolerate failures of services and machines, from cluster state replication and leader-election to container re-scheduling logic when a host goes down.