|
| 1 | +--- |
| 2 | +layout: global |
| 3 | +title: Kubernetes Implementation of the Spark Scheduler Backend |
| 4 | +--- |
| 5 | + |
| 6 | +# Scheduler Backend |
| 7 | + |
| 8 | +The general idea is to run Spark drivers and executors inside Kubernetes [Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod/). |
| 9 | +Pods are a co-located and co-scheduled group of one or more containers run in a shared context. The main component is KubernetesClusterSchedulerBackend, |
| 10 | +an implementation of CoarseGrainedSchedulerBackend, which manages allocating and destroying executors via the Kubernetes API. |
| 11 | +There are auxiliary and optional components: `ResourceStagingServer` and `KubernetesExternalShuffleService`, which serve specific purposes described further below. |
| 12 | + |
| 13 | +The scheduler backend is invoked in the driver associated with a particular job. The driver may run outside the cluster (client mode) or within (cluster mode). |
| 14 | +The scheduler backend manages [pods](http://kubernetes.io/docs/user-guide/pods/) for each executor. |
| 15 | +The executor code is running within a Kubernetes pod, but remains unmodified and unaware of the orchestration layer. |
| 16 | +When a job is running, the scheduler backend configures and creates executor pods with the following properties: |
| 17 | + |
| 18 | +- The pod's container runs a pre-built Docker image containing a Spark distribution (with Kubernetes integration) and |
| 19 | +invokes the Java runtime with the CoarseGrainedExecutorBackend main class. |
| 20 | +- The scheduler backend specifies environment variables on the executor pod to configure its runtime, p |
| 21 | +articularly for its JVM options, number of cores, heap size, and the driver's hostname. |
| 22 | +- The executor container has [resource limits and requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) |
| 23 | +that are set in accordance to the resource limits specified in the Spark configuration (executor.cores and executor.memory in the application's SparkConf) |
| 24 | +- The executor pods may also be launched into a particular [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/%5C), |
| 25 | +or target a particular subset of nodes in the Kubernetes cluster, based on the Spark configuration supplied. |
| 26 | + |
| 27 | +## Requesting Executors |
| 28 | + |
| 29 | +Spark requests for new executors through the `doRequestTotalExecutors(numExecutors: Int)` method. |
| 30 | +The scheduler backend keeps track of the request made by Spark core for the number of executors. |
| 31 | + |
| 32 | +A separate kubernetes-pod-allocator thread handles the creation of new executor pods with appropriate throttling and monitoring. |
| 33 | +This indirection is required because the Kubernetes API Server accepts requests for new executor pods optimistically, with the |
| 34 | +anticipation of being able to eventually run them. However, it is undesirable to have a very large number of pods that cannot be |
| 35 | +scheduled and stay pending within the cluster. Hence, the kubernetes-pod-allocator uses the Kubernetes API to make a decision to |
| 36 | +submit new requests for executors based on whether previous pod creation requests have completed. This gives us control over how |
| 37 | +fast a job scales up (which can be configured), and helps prevent Spark jobs from DOS-ing the Kubernetes API server with pod creation requests. |
| 38 | + |
| 39 | +## Destroying Executors |
| 40 | + |
| 41 | +Spark requests deletion of executors through the `doKillExecutors(executorIds: List[String])` |
| 42 | +method. |
| 43 | + |
| 44 | +The inverse behavior is required in the implementation of doKillExecutors(). When the executor |
| 45 | +allocation manager desires to remove executors from the application, the scheduler should find the |
| 46 | +pods that are running the appropriate executors, and tell the API server to stop these pods. |
| 47 | +It's worth noting that this code does not have to decide on the executors that should be |
| 48 | +removed. When `doKillExecutors()` is called, the executors that are to be removed have already been |
| 49 | +selected by the CoarseGrainedSchedulerBackend and ExecutorAllocationManager. |
0 commit comments