Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@

>>> [CLI] bundle init lakeflow-pipelines --config-file ./input.json --output-dir output

Welcome to the template for Lakeflow Declarative Pipelines!

Please answer the below to tailor your project to your preferences.
You can always change your mind and change your configuration in the databricks.yml file later.

Note that [DATABRICKS_URL] is used for initialization
(see https://docs.databricks.com/dev-tools/cli/profiles.html for how to change your profile).

Your new project has been created in the 'my_lakeflow_pipelines' directory!
Your new project has been created in the 'my_lakeflow_pipelines' directory!

Refer to the README.md file for "getting started" instructions!
Please refer to the README.md file for "getting started" instructions.

>>> [CLI] bundle validate -t dev
Name: my_lakeflow_pipelines
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"recommendations": [
"databricks.databricks",
"ms-python.vscode-pylance",
"redhat.vscode-yaml"
"redhat.vscode-yaml",
"ms-python.black-formatter"
]
}
Original file line number Diff line number Diff line change
@@ -1,19 +1,37 @@
{
"python.analysis.stubPath": ".vscode",
"databricks.python.envFile": "${workspaceFolder}/.env",
"jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
"jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------",
"python.testing.pytestArgs": [
"."
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"python.analysis.extraPaths": ["resources/my_lakeflow_pipelines_pipeline"],
"files.exclude": {
"**/*.egg-info": true,
"**/__pycache__": true,
".pytest_cache": true,
"dist": true,
},
"files.associations": {
"**/.gitkeep": "markdown"
},

// Pylance settings (VS Code)
// Set typeCheckingMode to "basic" to enable type checking!
"python.analysis.typeCheckingMode": "off",
"python.analysis.extraPaths": ["src", "lib", "resources"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that lakeflow_pipelines_etl.transformations is importable.

We should make sure that this is not the case and produces squigglies in the editor, because it won't be importable from the real pipeline either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "resources" entry makes it so that all packages under "resources" can indeed be resolved for imports. This is important for imports across pipeline packags, e.g. the utilities package.

We should make sure that this is not the case and produces squigglies in the editor, because it won't be importable from the real pipeline either.

I don't see any way to do that since IDEs like VS Code assume a single compilation unit in a project. We have more than one. And I don't think it's a good option to simply imports in pipelines.

"python.analysis.diagnosticMode": "workspace",
"python.analysis.stubPath": ".vscode",

// Pyright settings (Cursor)
// Set typeCheckingMode to "basic" to enable type checking!
"cursorpyright.analysis.typeCheckingMode": "off",
"cursorpyright.analysis.extraPaths": ["src", "lib", "resources"],
"cursorpyright.analysis.diagnosticMode": "workspace",
"cursorpyright.analysis.stubPath": ".vscode",

// General Python settings
"python.defaultInterpreterPath": "./.venv/bin/python",
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter",
"editor.formatOnSave": true,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,38 +2,53 @@

The 'my_lakeflow_pipelines' project was generated by using the Lakeflow Pipelines template.

## Setup
* `lib/`: Python source code for this project.
* `lib/shared`: Shared source code across all jobs/pipelines/etc.
* `resources/lakeflow_pipelines_etl`: Pipeline code and assets for the lakeflow_pipelines_etl pipeline.
* `resources/`: Resource configurations (jobs, pipelines, etc.)

1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html
## Getting started

2. Authenticate to your Databricks workspace, if you have not done so already:
```
$ databricks auth login
```
Choose how you want to work on this project:

(a) Directly in your Databricks workspace, see
https://docs.databricks.com/dev-tools/bundles/workspace.

3. Optionally, install developer tools such as the Databricks extension for Visual Studio Code from
https://docs.databricks.com/dev-tools/vscode-ext.html. Or the PyCharm plugin from
https://www.databricks.com/blog/announcing-pycharm-integration-databricks.
(b) Locally with an IDE like Cursor or VS Code, see
https://docs.databricks.com/vscode-ext.

(c) With command line tools, see https://docs.databricks.com/dev-tools/cli/databricks-cli.html

## Deploying resources
# Using this project using the CLI

1. To deploy a development copy of this project, type:
The Databricks workspace and IDE extensions provide a graphical interface for working
with this project. It's also possible to interact with it directly using the CLI:

1. Authenticate to your Databricks workspace, if you have not done so already:
```
$ databricks configure
```

2. To deploy a development copy of this project, type:
```
$ databricks bundle deploy --target dev
```
(Note that "dev" is the default target, so the `--target` parameter
is optional here.)

2. Similarly, to deploy a production copy, type:
```
$ databricks bundle deploy --target prod
```
This deploys everything that's defined for this project.
For example, the default template would deploy a pipeline called
`[dev yourname] lakeflow_pipelines_etl` to your workspace.
You can find that resource by opening your workpace and clicking on **Jobs & Pipelines**.

3. Use the "summary" comand to review everything that was deployed:
3. Similarly, to deploy a production copy, type:
```
$ databricks bundle summary
$ databricks bundle deploy --target prod
```
Note the default template has a includes a job that runs the pipeline every day
(defined in resources/lakeflow_pipelines_etl/lakeflow_pipelines_job.job.yml). The schedule
is paused when deploying in development mode (see
https://docs.databricks.com/dev-tools/bundles/deployment-modes.html).

4. To run a job or pipeline, use the "run" command:
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@ variables:
description: The catalog to use
schema:
description: The schema to use
notifications:
description: The email addresses to use for failure notifications

targets:
dev:
Expand All @@ -30,18 +28,15 @@ targets:
variables:
catalog: main
schema: ${workspace.current_user.short_name}
notifications: []

prod:
mode: production
workspace:
host: [DATABRICKS_URL]
# We explicitly deploy to /Workspace/Users/[USERNAME] to make sure we only have a single copy.
root_path: /Workspace/Users/[USERNAME]/.bundle/${bundle.name}/${bundle.target}
variables:
catalog: main
schema: prod
permissions:
- user_name: [USERNAME]
level: CAN_MANAGE
variables:
catalog: main
schema: default
notifications: [[USERNAME]]
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from databricks.sdk.runtime import spark
from pyspark.sql import DataFrame


def find_all_taxis() -> DataFrame:
"""Find all taxi data."""
return spark.read.table("samples.nyctaxi.trips")
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,7 @@ dist/
__pycache__/
*.egg-info
.venv/
scratch/**
!scratch/README.md
**/explorations/**
**/!explorations/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This folder is reserved for Databricks Asset Bundles resource definitions.
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# my_lakeflow_pipelines_pipeline
# my_lakeflow_pipelines

This folder defines all source code for the my_lakeflow_pipelines_pipeline pipeline:
This folder defines all source code for the my_lakeflow_pipelines pipeline:

- `explorations`: Ad-hoc notebooks used to explore the data processed by this pipeline.
- `transformations`: All dataset definitions and transformations.
- `utilities` (optional): Utility functions and Python modules used in this pipeline.
- `data_sources` (optional): View definitions describing the source data for this pipeline.
- `explorations/`: Ad-hoc notebooks used to explore the data processed by this pipeline.
- `transformations/`: All dataset definitions and transformations.
- `utilities/` (optional): Utility functions and Python modules used in this pipeline.
- `data_sources/` (optional): View definitions describing the source data for this pipeline.

## Getting Started

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
"source": [
"# !!! Before performing any data analysis, make sure to run the pipeline to materialize the sample datasets. The tables referenced in this notebook depend on that step.\n",
"\n",
"display(spark.sql(\"SELECT * FROM main.[USERNAME].my_lakeflow_pipelines\"))"
"display(spark.sql(\"SELECT * FROM main.[USERNAME].sample_trips_my_lakeflow_pipelines\"))"
]
}
],
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# The main pipeline for my_lakeflow_pipelines

resources:
pipelines:
{{template `pipeline_name` .}}:
name: {{template `pipeline_name` .}}
serverless: true
channel: "PREVIEW"
lakeflow_pipelines_etl:
name: lakeflow_pipelines_etl
## Catalog is required for serverless compute
catalog: ${var.catalog}
schema: ${var.schema}
serverless: true
root_path: "."
Copy link
Contributor

@andersrexdb andersrexdb Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting databricks bundle generate with this root path will be tricky. We can't just fetch the root_path + glob as that will ignore lib/.

We are planning on using generate to support import existing resources into a DAB in the workspace.

Maybe we need to make the CLI aware of the special conventions in this template? @pietern @andrewnester


libraries:
- glob:
include: transformations/**
Original file line number Diff line number Diff line change
@@ -1,19 +1,21 @@
# The job that triggers my_lakeflow_pipelines_pipeline.
# The job that triggers lakeflow_pipelines_etl.

resources:
jobs:
my_lakeflow_pipelines_job:
name: my_lakeflow_pipelines_job
lakeflow_pipelines_job:
name: lakeflow_pipelines_job

trigger:
# Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
periodic:
interval: 1
unit: DAYS

email_notifications:
on_failure: ${var.notifications}
#email_notifications:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the commented code intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for default-python I used this approach where there are never emails by default (since they were not effective) + there is no special variable for it (since variables come with a cost)

# on_failure:
# - [email protected]

tasks:
- task_key: refresh_pipeline
pipeline_task:
pipeline_id: ${resources.pipelines.my_lakeflow_pipelines_pipeline.id}
pipeline_id: ${resources.pipelines.lakeflow_pipelines_etl.id}
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
import dlt
from pyspark import pipelines as dp
from pyspark.sql.functions import col
from utilities import utils


# This file defines a sample transformation.
# Edit the sample below or add new transformations
# using "+ Add" in the file browser.


@dlt.table
@dp.table
def sample_trips_my_lakeflow_pipelines():
return spark.read.table("samples.nyctaxi.trips").withColumn("trip_distance_km", utils.distance_km(col("trip_distance")))
return spark.read.table("samples.nyctaxi.trips")
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import dlt
from pyspark import pipelines as dp
from pyspark.sql.functions import col, sum


Expand All @@ -7,7 +7,7 @@
# using "+ Add" in the file browser.


@dlt.table
@dp.table
def sample_zones_my_lakeflow_pipelines():
# Read from the "sample_trips" table, then sum all the fares
return spark.read.table("sample_trips_my_lakeflow_pipelines").groupBy(col("pickup_zip")).agg(sum("fare_amount").alias("total_fare"))
return spark.read.table(f"sample_trips_my_lakeflow_pipelines").groupBy(col("pickup_zip")).agg(sum("fare_amount").alias("total_fare"))

This file was deleted.

This file was deleted.

This file was deleted.

10 changes: 7 additions & 3 deletions acceptance/bundle/templates/lakeflow-pipelines/sql/output.txt
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@

>>> [CLI] bundle init lakeflow-pipelines --config-file ./input.json --output-dir output

Welcome to the template for Lakeflow Declarative Pipelines!

Please answer the below to tailor your project to your preferences.
You can always change your mind and change your configuration in the databricks.yml file later.

Note that [DATABRICKS_URL] is used for initialization
(see https://docs.databricks.com/dev-tools/cli/profiles.html for how to change your profile).

Your new project has been created in the 'my_lakeflow_pipelines' directory!
Your new project has been created in the 'my_lakeflow_pipelines' directory!

Refer to the README.md file for "getting started" instructions!
Please refer to the README.md file for "getting started" instructions.

>>> [CLI] bundle validate -t dev
Name: my_lakeflow_pipelines
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"recommendations": [
"databricks.databricks",
"ms-python.vscode-pylance",
"redhat.vscode-yaml"
"redhat.vscode-yaml",
"ms-python.black-formatter"
]
}
Loading
Loading