-
Notifications
You must be signed in to change notification settings - Fork 108
Update the lakeflow-pipelines template according to the latest Lakeflow conventions #3558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
181627d
a7f1447
a7ff123
ea97b0c
d3acf61
75bd17f
d708592
fbb1617
e588d1f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
{ | ||
"recommendations": [ | ||
"databricks.databricks", | ||
"ms-python.vscode-pylance", | ||
"redhat.vscode-yaml" | ||
"redhat.vscode-yaml", | ||
"ms-python.black-formatter" | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
from databricks.sdk.runtime import spark | ||
from pyspark.sql import DataFrame | ||
|
||
|
||
def find_all_taxis() -> DataFrame: | ||
"""Find all taxi data.""" | ||
return spark.read.table("samples.nyctaxi.trips") |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,5 +4,7 @@ dist/ | |
__pycache__/ | ||
*.egg-info | ||
.venv/ | ||
scratch/** | ||
!scratch/README.md | ||
**/explorations/** | ||
**/!explorations/README.md |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
This folder is reserved for Databricks Asset Bundles resource definitions. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,15 @@ | ||
# The main pipeline for my_lakeflow_pipelines | ||
|
||
resources: | ||
pipelines: | ||
{{template `pipeline_name` .}}: | ||
name: {{template `pipeline_name` .}} | ||
serverless: true | ||
channel: "PREVIEW" | ||
lakeflow_pipelines_etl: | ||
name: lakeflow_pipelines_etl | ||
## Catalog is required for serverless compute | ||
catalog: ${var.catalog} | ||
schema: ${var.schema} | ||
serverless: true | ||
root_path: "." | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Supporting We are planning on using Maybe we need to make the CLI aware of the special conventions in this template? @pietern @andrewnester |
||
|
||
libraries: | ||
- glob: | ||
include: transformations/** |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +1,21 @@ | ||
# The job that triggers my_lakeflow_pipelines_pipeline. | ||
# The job that triggers lakeflow_pipelines_etl. | ||
|
||
resources: | ||
jobs: | ||
my_lakeflow_pipelines_job: | ||
name: my_lakeflow_pipelines_job | ||
lakeflow_pipelines_job: | ||
name: lakeflow_pipelines_job | ||
|
||
trigger: | ||
# Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger | ||
periodic: | ||
interval: 1 | ||
unit: DAYS | ||
|
||
email_notifications: | ||
on_failure: ${var.notifications} | ||
#email_notifications: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the commented code intentional? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah for |
||
# on_failure: | ||
# - [email protected] | ||
|
||
tasks: | ||
- task_key: refresh_pipeline | ||
pipeline_task: | ||
pipeline_id: ${resources.pipelines.my_lakeflow_pipelines_pipeline.id} | ||
pipeline_id: ${resources.pipelines.lakeflow_pipelines_etl.id} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,12 @@ | ||
import dlt | ||
from pyspark import pipelines as dp | ||
from pyspark.sql.functions import col | ||
from utilities import utils | ||
|
||
|
||
# This file defines a sample transformation. | ||
# Edit the sample below or add new transformations | ||
# using "+ Add" in the file browser. | ||
|
||
|
||
@dlt.table | ||
@dp.table | ||
def sample_trips_my_lakeflow_pipelines(): | ||
return spark.read.table("samples.nyctaxi.trips").withColumn("trip_distance_km", utils.distance_km(col("trip_distance"))) | ||
return spark.read.table("samples.nyctaxi.trips") |
This file was deleted.
This file was deleted.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
{ | ||
"recommendations": [ | ||
"databricks.databricks", | ||
"ms-python.vscode-pylance", | ||
"redhat.vscode-yaml" | ||
"redhat.vscode-yaml", | ||
"ms-python.black-formatter" | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that
lakeflow_pipelines_etl.transformations
is importable.We should make sure that this is not the case and produces squigglies in the editor, because it won't be importable from the real pipeline either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
"resources"
entry makes it so that all packages under"resources"
can indeed be resolved for imports. This is important for imports across pipeline packags, e.g. the utilities package.I don't see any way to do that since IDEs like VS Code assume a single compilation unit in a project. We have more than one. And I don't think it's a good option to simply imports in pipelines.