Skip to content

Conversation

alyssacgoins
Copy link
Contributor

@alyssacgoins alyssacgoins commented Jul 28, 2025

Description of your changes:
Implement an option for pod-to-pod TLS protocol at the API server and launcher levels. PR includes an e2e test run with TLS enabled.
Checklist:

@google-oss-prow google-oss-prow bot requested review from gmfrasca and mprahl July 28, 2025 19:13
@alyssacgoins alyssacgoins changed the title feat (backend/frontend): Implement TLS for apiserver HTTP/GRPC [WIP] feat (backend/frontend): Implement TLS for apiserver HTTP/GRPC Jul 29, 2025
@alyssacgoins alyssacgoins force-pushed the carry-cert-commits branch 17 times, most recently from 26ecdbc to ffd271f Compare August 5, 2025 19:29
@google-oss-prow google-oss-prow bot added size/XXL and removed size/XL labels Aug 5, 2025
@alyssacgoins alyssacgoins force-pushed the carry-cert-commits branch 6 times, most recently from ac1e748 to c8b071a Compare August 6, 2025 15:33
}
}
// Append mounted custom CA cert to System CA bundle.
customCa, err := os.ReadFile("etc/pki/tls/cert/ca.crt")
Copy link
Collaborator

@mprahl mprahl Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add test coverage to catch this in the future.

Suggested change
customCa, err := os.ReadFile("etc/pki/tls/cert/ca.crt")
customCA, err := os.ReadFile("/etc/pki/tls/cert/ca.crt")

} else {
// If a webhook TLS key/cert are not specified, default to API server's TLS key/cert if specified.
if *tlsCertPath != "" && *tlsCertKeyPath != "" {
if !common.FileExists(*tlsCertPath) || !common.FileExists(*tlsCertPath) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !common.FileExists(*tlsCertPath) || !common.FileExists(*tlsCertPath) {
if !common.FileExists(*tlsCertPath) || !common.FileExists(*tlsCertKeyPath) {


if tlsConfig != nil {
// local client connections via http proxy to grpc should not require tls
tlsConfig.InsecureSkipVerify = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed? Do we not have the CA cert already to trust?

}
return metadata.NewClient(mlmdConfig.Address, mlmdConfig.Port)
tlsConfig := util.GetTLSConfig(*caCertPath)
return metadata.NewClient(mlmdConfig.Address, mlmdConfig.Port, *metadataTLSEnabled, tlsConfig)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove *metadataTLSEnabled as an input argument and just check if tlsConfig is not nil.

spec: spec,
executors: deploy.GetExecutors(),
}
mlPipelineTLSEnabled, err := GetMLPipelineServiceTLSEnabled()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Since every pipeline run will call this, we could just call this inside an init function in backend/src/apiserver/common/config.go and assign it to a variable since the config value will not change during runtime.

}}

func GetMLPipelineServiceTLSEnabled() (bool, error) {
mlPipelineServiceTLSEnabledStr := os.Getenv(MLPipelineTLSEnabledEnvVar)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a viper configuration like the other configurations?

Name: "SSL_CERT_FILE",
Value: common.TLSCertCAPath,
})
sslCertDir := strings.Join(certDirectories, ":")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just set the directory of common.TLSCertCAPath?

// if CA Bundle env vars are specified.
func ConfigureCABundle(tmpl *wfapi.Template) {
caBundleDir := common.CABundleDir
caBundleSecretName := os.Getenv(common.CaBundleSecretName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a viper configuration as well?

Comment on lines 568 to 578
if err != nil {
return "", err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file not being found is a valid error that we should handle and just continue to the next option.


// If TLS enabled, save system CA (with mounted custom CA appended, if applicable) to temp file for executor access.
if tlsEnabled {
if caBundleTmpPath, err := compileTempCABundle(); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only required when there's a custom CA.

}
sslCertFilePath := os.Getenv("SSL_CERT_FILE")
if sslCertFilePath != "" {
systemCAs = append(systemCAs, sslCertFilePath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sslCertFilePath should be first in the list to check.

)

// GetTLSConfig returns TLS config set with system CA certs as well as custom CA stored at input caCertPath.
func GetTLSConfig(caCertPath string) *tls.Config {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this return an error and let the caller decided if it should be fatal?

if caBundleTmpPath, err := compileTempCABundle(); err != nil {
return nil, err
} else {
os.Setenv("REQUESTS_CA_BUNDLE", caBundleTmpPath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: It'd be nice if the list of environment variables was configurable for the admin. Perhaps a separate GitHub issue.

useProxy = flag.Bool("useProxy", false, "Whether to run the proxy tests")
cacheEnabled = flag.Bool("cacheEnabled", true, "Whether cache is enabled tests")
uploadPipelinesWithKubernetes = flag.Bool("uploadPipelinesWithKubernetes", false, "Whether to use Kubernetes for uploading pipelines or use the REST API")
tlsEnabled = flag.Bool("tlsEnabled", false, "Whether TLS is enabled tests")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tlsEnabled = flag.Bool("tlsEnabled", false, "Whether TLS is enabled tests")
tlsEnabled = flag.Bool("tlsEnabled", false, "Whether TLS is enabled on the API server")

cacheEnabled = flag.Bool("cacheEnabled", true, "Whether cache is enabled tests")
uploadPipelinesWithKubernetes = flag.Bool("uploadPipelinesWithKubernetes", false, "Whether to use Kubernetes for uploading pipelines or use the REST API")
tlsEnabled = flag.Bool("tlsEnabled", false, "Whether TLS is enabled tests")
caCertPath = flag.String("caCertPath", "", "The path to the CA certificate to trust on connections to the ML pipeline API server and metadata server.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
caCertPath = flag.String("caCertPath", "", "The path to the CA certificate to trust on connections to the ML pipeline API server and metadata server.")
caCertPath = flag.String("caCertPath", "", "The path to the CA certificate to trust on connections to the ML pipeline API server and metadata server")

return api_server.NewPipelineUploadClient(clientConfig, isDebugMode, tlsCfg)
}

func GetTLSConfig(caCertPath string) *tls.Config {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this return an error instead?

### API Server
The API server is mounted with a TLS key/cert pair and serves requests over TLS.
### Scheduledworkflow Controller
The Scheduledworkflow Controller is mounted with a TLS key and sends HTTPS requests to the API server.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Scheduledworkflow Controller is mounted with a TLS key and sends HTTPS requests to the API server.
The Scheduledworkflow Controller is mounted with the TLS cert CA and sends HTTPS requests to the API server.

### Persistence Agent
The Persistence Agent is mounted with the TLS cert CA and sends HTTPS requests to the API server.
### KFP UI
The UI deployment is mounted with the TLS cert CA and sends HTTPS requests to the metadata-envoy deployment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the API server?

Comment on lines 17 to 18
The driver and launcher pods compiled by the API server are configured with the TLS cert CA in addition to system CAs in
order to send HTTPS requests to the both the API server and metadata-gRPC deployment. No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The driver and launcher pods compiled by the API server are configured with the TLS cert CA in addition to system CAs in
order to send HTTPS requests to the both the API server and metadata-gRPC deployment.
The driver and launcher pods are configured with the TLS cert CA in addition to system CAs in
order to send HTTPS requests to the both the API server and metadata-gRPC deployment.

Comment on lines 27 to 62
command: ["/bin/sh", "/config-writer/generate-config.sh"]
env:
- name: DBCONFIG_USER
valueFrom:
secretKeyRef:
name: mysql-secret
key: username
- name: DBCONFIG_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-secret
key: password
- name: MYSQL_DATABASE
valueFrom:
configMapKeyRef:
name: pipeline-install-config
key: mlmdDb
- name: MYSQL_HOST
valueFrom:
configMapKeyRef:
name: pipeline-install-config
key: dbHost
- name: MYSQL_PORT
valueFrom:
configMapKeyRef:
name: pipeline-install-config
key: dbPort
volumeMounts:
- name: tls-certs
mountPath: /etc/pki/tls/certs
- name: grpc-tls-config
mountPath: /etc/grpc
- name: config-writer
mountPath: /config-writer
- name: mysql-secret
mountPath: /etc/mysql-secret
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a ConfigMap, could we just inline it? This is what GTP-5 spit out for inlining.

Suggested change
command: ["/bin/sh", "/config-writer/generate-config.sh"]
env:
- name: DBCONFIG_USER
valueFrom:
secretKeyRef:
name: mysql-secret
key: username
- name: DBCONFIG_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-secret
key: password
- name: MYSQL_DATABASE
valueFrom:
configMapKeyRef:
name: pipeline-install-config
key: mlmdDb
- name: MYSQL_HOST
valueFrom:
configMapKeyRef:
name: pipeline-install-config
key: dbHost
- name: MYSQL_PORT
valueFrom:
configMapKeyRef:
name: pipeline-install-config
key: dbPort
volumeMounts:
- name: tls-certs
mountPath: /etc/pki/tls/certs
- name: grpc-tls-config
mountPath: /etc/grpc
- name: config-writer
mountPath: /config-writer
- name: mysql-secret
mountPath: /etc/mysql-secret
command: ["/bin/sh", "-ec"]
args:
- |
escape() { sed ':a;N;$!ba;s/\n/\\n/g' "$1"; }
cat <<EOF >/etc/grpc/config.proto
connection_config {
mysql {
host: "${MYSQL_HOST}"
port: ${MYSQL_PORT}
database: "${MYSQL_DATABASE}"
user: "${DBCONFIG_USER}"
password: "${DBCONFIG_PASSWORD}"
}
}
ssl_config {
server_key: "$(escape /etc/pki/tls/certs/tls.key)"
server_cert: "$(escape /etc/pki/tls/certs/tls.crt)"
custom_ca: "$(escape /etc/pki/tls/certs/ca.crt)"
client_verify: false
}
EOF
env:
- name: DBCONFIG_USER
valueFrom:
secretKeyRef:
name: mysql-secret
key: username
- name: DBCONFIG_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-secret
key: password
- name: MYSQL_DATABASE
valueFrom:
configMapKeyRef:
name: pipeline-install-config
key: mlmdDb
- name: MYSQL_HOST
valueFrom:
configMapKeyRef:
name: pipeline-install-config
key: dbHost
- name: MYSQL_PORT
valueFrom:
configMapKeyRef:
name: pipeline-install-config
key: dbPort
volumeMounts:
- name: tls-certs
mountPath: /etc/pki/tls/certs
- name: grpc-tls-config
mountPath: /etc/grpc
- name: mysql-secret
mountPath: /etc/mysql-secret

COPY third_party/metadata_envoy/license.txt /third_party/license.txt

ENTRYPOINT ["/usr/local/bin/envoy", "-c"]
CMD ["/etc/envoy.yaml"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps change it to /etc/envoy/envoy-config.yaml since we always expect a config there?

command:
- "/bin/apiserver"
args:
- "--config=/config"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use JSON patches to append the new args so we don't have to maintain multiple args lists when we add a new argument.

diff --git a/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/kustomization.yaml b/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/kustomization.yaml
index e9d8dd8f4..efee2fdf0 100644
--- a/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/kustomization.yaml
+++ b/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/kustomization.yaml
@@ -11,18 +11,30 @@ resources:
 namespace: kubeflow
 
 patches:
-  - path: patches/ml-pipeline-apiserver-deployment.yaml
-    target:
+  - target:
       kind: Deployment
       name: ml-pipeline
+    patch: |-
+      - op: add
+        path: /spec/template/spec/containers/0/args/-
+        value: "--tlsCertPath=/etc/pki/tls/certs/tls.crt"
+      - op: add
+        path: /spec/template/spec/containers/0/args/-
+        value: "--tlsCertKeyPath=/etc/pki/tls/certs/tls.key"
   - path: patches/ml-pipeline-apiserver-service.yaml
     target:
       kind: Service
       name: ml-pipeline
-  - path: patches/ml-pipeline-scheduledworkflow-deployment.yaml
-    target:
+  - target:
       kind: Deployment
       name: ml-pipeline-scheduledworkflow
+    patch: |-
+      - op: add
+        path: /spec/template/spec/containers/0/args/-
+        value: "-mlPipelineServiceTLSEnabled=true"
+      - op: add
+        path: /spec/template/spec/containers/0/args/-
+        value: "-mlPipelineServiceTLSCert=/etc/pki/tls/certs/tls.crt"
   - path: patches/metadata-envoy-deployment.yaml
     target:
       kind: Deployment
@@ -46,6 +58,19 @@ patches:
     target:
       kind: Deployment
       name: ml-pipeline-persistenceagent
+  - target:
+      kind: Deployment
+      name: ml-pipeline-persistenceagent
+    patch: |-
+      - op: add
+        path: /spec/template/spec/containers/0/args/-
+        value: "--mlPipelineAPIServerName=ml-pipeline.kubeflow.svc.cluster.local"
+      - op: add
+        path: /spec/template/spec/containers/0/args/-
+        value: "--mlPipelineAPIServerHttpPort=8888"
+      - op: add
+        path: /spec/template/spec/containers/0/args/-
+        value: "--mlPipelineAPIServerBasePath=/apis/v2beta1"
   - path: patches/ml-pipeline-ui-deployment.yaml
     target:
       kind: Deployment
diff --git a/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-apiserver-deployment.yaml b/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-apiserver-deployment.yaml
index 48cb02ff5..149d180be 100644
--- a/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-apiserver-deployment.yaml
+++ b/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-apiserver-deployment.yaml
@@ -7,14 +7,6 @@ spec:
     spec:
       containers:
         - name: ml-pipeline-api-server
-          command:
-            - "/bin/apiserver"
-          args:
-            - "--config=/config"
-            - "--sampleconfig=/config/sample_config.json"
-            - "-logtostderr=true"
-            - "--tlsCertPath=/etc/pki/tls/certs/tls.crt"
-            - "--tlsCertKeyPath=/etc/pki/tls/certs/tls.key"
           env:
             - name: METADATA_TLS_ENABLED
               value: "true"
diff --git a/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-persistenceagent-deployment.yaml b/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-persistenceagent-deployment.yaml
index fccb9d2fe..9561bbaa2 100644
--- a/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-persistenceagent-deployment.yaml
+++ b/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-persistenceagent-deployment.yaml
@@ -9,10 +9,6 @@ spec:
       containers:
         - name: ml-pipeline-persistenceagent
           command: ["/bin/sh", "-c", "persistence_agent --logtostderr=true --namespace=${NAMESPACE} --ttlSecondsAfterWorkflowFinish=${TTL_SECONDS_AFTER_WORKFLOW_FINISH} --numWorker ${NUM_WORKERS} --executionType ${EXECUTIONTYPE} --logLevel=${LOG_LEVEL} -mlPipelineServiceTLSEnabled=true -caCertPath=/etc/pki/tls/certs/ca.crt"]
-          args:
-            - "--mlPipelineAPIServerName=ml-pipeline.kubeflow.svc.cluster.local"
-            - "--mlPipelineAPIServerHttpPort=8888"
-            - "--mlPipelineAPIServerBasePath=/apis/v2beta1"
           env:
             - name: NAMESPACE
               valueFrom:
diff --git a/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-scheduledworkflow-deployment.yaml b/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-scheduledworkflow-deployment.yaml
index a79fcbf76..56686e2b5 100644
--- a/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-scheduledworkflow-deployment.yaml
+++ b/manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-scheduledworkflow-deployment.yaml
@@ -6,15 +6,7 @@ spec:
   template:
     spec:
       containers:
-        - name: ml-pipeline-scheduledworkflow
-          command:
-            - "/bin/controller"
-          args:
-            - "-logtostderr=true"
-            - "-logLevel=info"
-            - "-mlPipelineServiceTLSEnabled=true"
-            - "-mlPipelineServiceTLSCert=/etc/pki/tls/certs/tls.crt"
-            - "-namespace=kubeflow"
+      - name: ml-pipeline-scheduledworkflow
           volumeMounts:
           - name: tls-certs
             mountPath: /etc/pki/tls/certs

@alyssacgoins alyssacgoins force-pushed the carry-cert-commits branch 5 times, most recently from 03236e1 to d2cb472 Compare September 26, 2025 01:46
Update from utilizing config added via Dockerfile, to match TLS config formatting.

Signed-off-by: agoins <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants