From 116bd821ae7b5ceb80ba35126823fa126422e0c7 Mon Sep 17 00:00:00 2001 From: Toni Finger Date: Wed, 27 Mar 2024 10:15:31 +0100 Subject: [PATCH 1/6] Inital development of decision record on "KubernetesLogging/Monitoring/Tracing" Signed-off-by: Toni Finger --- ...-0219-v1-k8s-monitoring-logging-tracing.md | 38 +++++++++++++++++++ 1 file changed, 38 insertions(+) create mode 100644 Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md diff --git a/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md new file mode 100644 index 000000000..c1d3e5f33 --- /dev/null +++ b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md @@ -0,0 +1,38 @@ +--- +title: Kubernetes Logging/Monitoring/Tracing +type: Decision Record +status: Draft +track: KaaS +--- + +## Motivation + +Either as an administrator or as a customer of a Kubernetes cluster, at some point you will need to debug useful information. +In order to obtain this information, mechanisms SHOULD be available to retrieve this information. +These mechanisms consist of: +* Logging +* Monitoring +* Tracing + +The aim of this decision record is to examine how Kubernetes handles thoes mechanisms. +Derived from this, this decision record provides a suggestion on how a Kubernetes cluster SHOULD be configured in order to provide meaningful and comprehensible information via logging, monitoring and tracing. + + + +## Decision + + + + +[k8s-debug]: https://kubernetes.io/docs/tasks/debug/ +[prometheus-operator]: https://github.com/prometheus-operator/prometheus-operator +[k8s-metrics]: https://github.com/kubernetes/metrics +[system-metrics]: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/ +[system-metrics_metric-lifecycle]: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#metric-lifecycle +[kube-state-metrics]: https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/ +[k8s-deprecating-a-metric]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-metric +[k8s-show-hidden-metrics]: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics +[system-traces]: https://kubernetes.io/docs/concepts/cluster-administration/system-traces/ +[system-logs]: https://kubernetes.io/docs/concepts/cluster-administration/system-logs/ +[monitor-node-health]: https://kubernetes.io/docs/tasks/debug/debug-cluster/monitor-node-health/ +[k8s-logging]: https://kubernetes.io/docs/concepts/cluster-administration/logging/ From 838f32adff1a93b2bacd9e99a082e50377e350f3 Mon Sep 17 00:00:00 2001 From: Toni Finger Date: Fri, 5 Apr 2024 16:44:43 +0200 Subject: [PATCH 2/6] Added more detail regarding "KubernetesLogging/Monitoring/Tracing" Signed-off-by: Toni Finger --- ...-0219-v1-k8s-monitoring-logging-tracing.md | 29 +++++++++++++++++-- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md index c1d3e5f33..0388b1b03 100644 --- a/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md +++ b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md @@ -5,22 +5,45 @@ status: Draft track: KaaS --- + ## Motivation -Either as an administrator or as a customer of a Kubernetes cluster, at some point you will need to debug useful information. +Either as an operators or as an end users of a Kubernetes cluster, at some point you will need to debug useful information. In order to obtain this information, mechanisms SHOULD be available to retrieve this information. These mechanisms consist of: * Logging * Monitoring * Tracing -The aim of this decision record is to examine how Kubernetes handles thoes mechanisms. +The aim of this decision record is to examine how Kubernetes handles thoes mechanisms. Derived from this, this decision record provides a suggestion on how a Kubernetes cluster SHOULD be configured in order to provide meaningful and comprehensible information via logging, monitoring and tracing. - ## Decision +A Kubernetes cluster MUST provide both monitoring and logging. +In addition, a Kubernetes cluster MAY provide traceability mechanisms, as this is important for time-based troubleshooting. +Therefore, a standardized concept for the setup of the overall mechanisms as well as the resources to be consumed MUST be defined. + +This concept SHALL define monitoring and logging in a federated structure. +Therefore, a monitoring and logging stack MUST be deployed on each k8s cluster. +A central monitoring can then fetch data from the clusters individual monitoring stacks. + + +### Monitoring + +> see: [Metrics For Kubernetes System Components][system-metrics] +> see: [Metrics for Kubernetes Object States][kube-state-metrics] + + +SCS KaaS infrastructure monitoring SHOULD be used as a diagnostic tool to alert operators and end users to system-related issues by analyzing metrics. +Therefore, it includes the collection and visualization of the corresponding metrics. +Optionally, an alerting mechanism COULD also be standardized. +This SHOULD contain a minimal set of important metrics that signal problematic conditions of a cluster in any case. + +> Describe one examples here in more detail + + From 7cde0eb4a10d5026565b3411d0a82c38fbebabd5 Mon Sep 17 00:00:00 2001 From: Toni Finger Date: Mon, 8 Apr 2024 22:01:24 +0200 Subject: [PATCH 3/6] Add proposal to standardize the use of the Kubernetes metrics server Signed-off-by: Toni Finger --- ...-0219-v1-k8s-monitoring-logging-tracing.md | 23 +++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md index 0388b1b03..dd12cfef2 100644 --- a/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md +++ b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md @@ -44,6 +44,29 @@ This SHOULD contain a minimal set of important metrics that signal problematic c > Describe one examples here in more detail +#### Kubernetes Metric Server + +Kubernetes provides a source for container resource metrics. +The main purpose of this source is to be used for Kubernetes' built-in auto-scaling [kubernetes-metrics-server][kubernetes-metrics-server-repo]. +However, it could also be used as a source of metrics for monitoring. +Therefore, this metrics server MUST also be readily accessible for the mono-monitoring setup. + +Furthermore, end users rely on certain metrics to debug their applications. +More specifically, an end user wants to have access to all metrics defined by Kubernetes itself. +The content of the metrics to be provided by the [kubernetes-metrics-server][kubernetes-metrics-server-repo] are bound to a Kubernetes version and are organized according to the [kubernetes metrics lifecycle][system-metrics_metric-lifecycle]). + +In order for an end user to be sure that these metrics are accessible, a cluster MUST provide the metrics in the respective version. + + + + +### Logging + +> see: [Logging Architecture][k8s-logging] + +### Tracing + +> see: [Traces For Kubernetes System Components][system-traces] From eb64ae2a5a13a9199577f60575fc0ab272d477de Mon Sep 17 00:00:00 2001 From: Toni Finger Date: Tue, 11 Jun 2024 14:17:17 +0200 Subject: [PATCH 4/6] Adding additional information Signed-off-by: Toni Finger --- ...s-0219-v1-k8s-monitoring-logging-tracing.md | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md index dd12cfef2..85f50b18b 100644 --- a/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md +++ b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md @@ -41,7 +41,7 @@ Therefore, it includes the collection and visualization of the corresponding met Optionally, an alerting mechanism COULD also be standardized. This SHOULD contain a minimal set of important metrics that signal problematic conditions of a cluster in any case. -> Describe one examples here in more detail +> TODO: Describe one examples here in more detail #### Kubernetes Metric Server @@ -49,7 +49,7 @@ This SHOULD contain a minimal set of important metrics that signal problematic c Kubernetes provides a source for container resource metrics. The main purpose of this source is to be used for Kubernetes' built-in auto-scaling [kubernetes-metrics-server][kubernetes-metrics-server-repo]. However, it could also be used as a source of metrics for monitoring. -Therefore, this metrics server MUST also be readily accessible for the mono-monitoring setup. +Therefore, this metrics server MUST also be readily accessible for the monitoring setup. Furthermore, end users rely on certain metrics to debug their applications. More specifically, an end user wants to have access to all metrics defined by Kubernetes itself. @@ -58,10 +58,23 @@ The content of the metrics to be provided by the [kubernetes-metrics-server][kub In order for an end user to be sure that these metrics are accessible, a cluster MUST provide the metrics in the respective version. +#### Prometheus Operator +One of the most commonly used monitoring tools in connection with Kubernetes is Prometheus +Therefore, every k8s cluster CLOUD have a [prometheus-operator][prometheus-operator] deployed to all control plane clusters as an optional default. +The operator SHOULD at least be rolled out to all control plane nodes. + + +#### Security + +Communication between the Prometheus services (expoter, database, federation, etc.) SHOULD be accomplished using "[mutual][mutual-auth] TLS" (mTLS). ### Logging +In Kubernetes clusters, log data is not persistent and is discarded after a container is stopped or destroyed. +This makes it difficult to debug crashed pods of a deployment after they have been destroyed. +Therefore, the SCS stack SHOULD also optionally provide a logging stack that solves this problem by storing the log file in a self-managed database beyond the lifetime of a container. + > see: [Logging Architecture][k8s-logging] ### Tracing @@ -82,3 +95,4 @@ In order for an end user to be sure that these metrics are accessible, a cluster [system-logs]: https://kubernetes.io/docs/concepts/cluster-administration/system-logs/ [monitor-node-health]: https://kubernetes.io/docs/tasks/debug/debug-cluster/monitor-node-health/ [k8s-logging]: https://kubernetes.io/docs/concepts/cluster-administration/logging/ +[mutual-auth]: https://en.wikipedia.org/wiki/Mutual_authentication From cd53e9df41769d7bf36e269a136e4a4f05ee2fd6 Mon Sep 17 00:00:00 2001 From: tonifinger <129007376+tonifinger@users.noreply.github.com> Date: Mon, 11 Nov 2024 09:36:55 +0100 Subject: [PATCH 5/6] Update Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md Co-authored-by: Michal Gubricky Signed-off-by: tonifinger <129007376+tonifinger@users.noreply.github.com> --- Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md index 85f50b18b..7df45d0c4 100644 --- a/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md +++ b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md @@ -15,7 +15,7 @@ These mechanisms consist of: * Monitoring * Tracing -The aim of this decision record is to examine how Kubernetes handles thoes mechanisms. +The aim of this decision record is to examine how Kubernetes handles those mechanisms. Derived from this, this decision record provides a suggestion on how a Kubernetes cluster SHOULD be configured in order to provide meaningful and comprehensible information via logging, monitoring and tracing. From 25d4a3637c87d579ed723f16b38e67c452031536 Mon Sep 17 00:00:00 2001 From: Toni Finger Date: Mon, 25 Nov 2024 20:48:48 +0100 Subject: [PATCH 6/6] Update "scs-0219-v1-k8s-monitoring-logging-tracing.md" Signed-off-by: Toni Finger --- .../scs-0219-v1-k8s-monitoring-logging-tracing.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md index 7df45d0c4..c6ffe6f16 100644 --- a/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md +++ b/Standards/scs-0219-v1-k8s-monitoring-logging-tracing.md @@ -27,8 +27,7 @@ Therefore, a standardized concept for the setup of the overall mechanisms as wel This concept SHALL define monitoring and logging in a federated structure. Therefore, a monitoring and logging stack MUST be deployed on each k8s cluster. -A central monitoring can then fetch data from the clusters individual monitoring stacks. - +A central monitoring system can then fetch data from the individual clusters' monitoring stacks to Grafana to visualize the collected metrics. ### Monitoring @@ -38,8 +37,9 @@ A central monitoring can then fetch data from the clusters individual monitoring SCS KaaS infrastructure monitoring SHOULD be used as a diagnostic tool to alert operators and end users to system-related issues by analyzing metrics. Therefore, it includes the collection and visualization of the corresponding metrics. -Optionally, an alerting mechanism COULD also be standardized. -This SHOULD contain a minimal set of important metrics that signal problematic conditions of a cluster in any case. + +Alongside, an alerting mechanism MUST also be standardized. +This MUST contain a minimal set of important metrics that signal problematic conditions of a cluster in any case. > TODO: Describe one examples here in more detail @@ -61,7 +61,7 @@ In order for an end user to be sure that these metrics are accessible, a cluster #### Prometheus Operator One of the most commonly used monitoring tools in connection with Kubernetes is Prometheus -Therefore, every k8s cluster CLOUD have a [prometheus-operator][prometheus-operator] deployed to all control plane clusters as an optional default. +Therefore, every k8s cluster COULD have a [prometheus-operator][prometheus-operator] deployed to all control plane nodes per default. The operator SHOULD at least be rolled out to all control plane nodes. @@ -79,9 +79,6 @@ Therefore, the SCS stack SHOULD also optionally provide a logging stack that sol ### Tracing -> see: [Traces For Kubernetes System Components][system-traces] - - [k8s-debug]: https://kubernetes.io/docs/tasks/debug/ [prometheus-operator]: https://github.com/prometheus-operator/prometheus-operator @@ -96,3 +93,5 @@ Therefore, the SCS stack SHOULD also optionally provide a logging stack that sol [monitor-node-health]: https://kubernetes.io/docs/tasks/debug/debug-cluster/monitor-node-health/ [k8s-logging]: https://kubernetes.io/docs/concepts/cluster-administration/logging/ [mutual-auth]: https://en.wikipedia.org/wiki/Mutual_authentication +[kubernetes-metrics-server-repo]: https://github.com/kubernetes-sigs/metrics-server?tab=readme-ov-file#kubernetes-metrics-server +