-
Notifications
You must be signed in to change notification settings - Fork 108
Update the lakeflow-pipelines template according to the latest Lakeflow conventions #3558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…he lakeflow-pipelines template
@@ -1,57 +1,96 @@ | |||
{ | |||
"welcome_message": "\nWelcome to the template for Lakeflow Declarative Pipelines!", | |||
"//": "This template is based on an upcoming version of the default-python template, but comes with a SQL variation + certain defaults that are specific to Lakeflow Pipelines.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 note that this databricks_template_schema has several "skipped" questions so it follows the same structure as the existing pipelines-default template
@@ -0,0 +1,17 @@ | |||
{{/* The latest LTS DBR version; this should be updated a few months after each LTS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 these changes are already in default-python
@@ -1,33 +0,0 @@ | |||
{{- define `pipeline_name` -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 note that default-python
never used templates like this (mostly because of the syntactic noise they add and because they don't work in filenames)
notebook_task: | ||
notebook_path: sample_notebook.ipynb | ||
{{- if $serverless}} | ||
environment_key: default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 note that this fixes an issue in existing templates where notebook tasks didn't get an environment_key
if there was no Python package. We always need an environment to be able to pick an environment version.
@@ -0,0 +1,93 @@ | |||
"""This file configures pytest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡this file is already in default-python
and is based on #3254.
return DatabricksSession.builder.getOrCreate() | ||
|
||
@pytest.fixture() | ||
def load_fixture(spark: SparkSession): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 this function is new and is referenced from fixtures.gitkeep.tmpl
. It makes it easy to use fixtures in tests:
Add JSON or CSV files here. In tests, use them with `load_fixture()`:
def test_using_fixture(load_fixture):
data = load_fixture("my_data.json")
assert len(data) >= 1
299 failing tests:
|
"[python]": { | ||
"editor.defaultFormatter": "ms-python.black-formatter", | ||
"editor.defaultFormatter": "ms-python.python", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this formatter compatible with black
? We were using black before because that's what's being used in the workspace. Using different formatters can cause thrashing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should use PEP8. Cursor and my AI told me that the setting above is deprecated and I need to use what it says here. But I'll double check or just take this minor change out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this change and added black-formatter
to the list of recommended extensions
...t/my_lakeflow_pipelines/resources/lakeflow_pipelines_etl/lakeflow_pipelines_etl.pipeline.yml
Outdated
Show resolved
Hide resolved
|
||
trigger: | ||
# Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger | ||
periodic: | ||
interval: 1 | ||
unit: DAYS | ||
|
||
email_notifications: | ||
on_failure: ${var.notifications} | ||
#email_notifications: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the commented code intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for default-python
I used this approach where there are never emails by default (since they were not effective) + there is no special variable for it (since variables come with a cost)
...lines/resources/lakeflow_pipelines_etl/transformations/sample_trips_my_lakeflow_pipelines.py
Outdated
Show resolved
Hide resolved
@@ -1,57 +1,96 @@ | |||
{ | |||
"welcome_message": "\nWelcome to the template for Lakeflow Declarative Pipelines!", | |||
"//": "This template is based on an upcoming version of the default-python template, but comes with a SQL variation + certain defaults that are specific to Lakeflow Pipelines.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: We have to test the new template schema with the workspace creation UI to see if everything still makes sense and looks OK.
cc @ilyakuz-db
"default": "yes", | ||
"enum": [ | ||
"yes", | ||
"no (advanced: I will customize the schema configuration later in databricks.yml)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we ask for the schema here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the scenario where users don't select the per-user schema then they need to decide what they want to do instead. Things get complicated:
- Some customers want one catalog per user
- Some customers want some kind of schema-mangling approach
- Some customers might be okay with just a single hardcoded schema for dev and a single hard-coded schema for prod, but this seem so limiting that it seems better to use
default
and ask customers to configure things themselves if they don't like this. - Some customers may do yet other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't test this end-to-end yet because of the missing pyproject.toml
.
{{skip "{{.project_name}}/resources/{{.project_name}}_pipeline/transformations/sample_zones_{{.project_name}}.py"}} | ||
{{skip "{{.project_name}}/resources/{{.project_name}}_pipeline/transformations/sample_trips_{{.project_name}}.py"}} | ||
{{if not $python_package}} | ||
{{skip "{{.project_name}}/pyproject.toml"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no pyproject.toml
in the output, even if you selected python
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the pipelines-default
pipeline we never had the question to ask about the Python package, so there is no pyproject.toml
in that case. If you want to try this already before we merge things into default-python
you can play with databricks_template_schema.vnext.json
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the vnext for?
This is currently untested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed not used right now, it's just a reviewable reference for what the template_schema
could look like for default-python
. See also the "//"
comment at the top. Compared to the one for default-pipelines
, it doesn't disable any of the questions and it has minor wording changes.
"properties": { | ||
"lakeflow_only": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the intent? Use from the UI only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly, this is just for retrofitting the default-python
template to be suitable as a replacement for lakeflow-pipelines
.
// Pylance settings (VS Code) | ||
// Set typeCheckingMode to "basic" to enable type checking! | ||
"python.analysis.typeCheckingMode": "off", | ||
"python.analysis.extraPaths": ["src", "lib", "resources"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that lakeflow_pipelines_etl.transformations
is importable.
We should make sure that this is not the case and produces squigglies in the editor, because it won't be importable from the real pipeline either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "resources"
entry makes it so that all packages under "resources"
can indeed be resolved for imports. This is important for imports across pipeline packags, e.g. the utilities package.
We should make sure that this is not the case and produces squigglies in the editor, because it won't be importable from the real pipeline either.
I don't see any way to do that since IDEs like VS Code assume a single compilation unit in a project. We have more than one. And I don't think it's a good option to simply imports in pipelines.
@@ -0,0 +1,12 @@ | |||
from pyspark.sql.functions import col, when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utilities should only be produced in the workspace (if at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed these.
Though this is unfortunately inconsistent with the one from Black
This adds a new version of the
default-python
template. To enable the first integration and evaluation of this template, this first pull request replaces the internallakeflow-pipelines
template. Replacingdefault-python
will be the next milestone.Changes
default-python
to follow the Lakeflow conventions (per-pipeline sources go intoresources/
; shared sources go into (lib/
)lakeflow-pipelines
template (this required disabling some of the template questions and components)Why
Lakeflow comes with new conventions that emphasize modularity. This PR represents an initial milestone towards updating the DABs
default-python
template.default-sql
will follow in a later milestone.Tests
Existing tests were updated.