Databricks data quality with declarative pipeline

Databricks Spark Declarative Pipelines go beyond simplifying pipeline maintenance—they also address data quality, which is paramount for any data application.

Using expectations, you can define data quality checks that are applied to every record flowing through the pipeline. These checks are typically standard conditions, similar to what you would write in SQL WHERE or HAVING clauses.

Additionally, multiple expectations can be defined on the same dataset, allowing you to enforce several data quality rules in a clean and declarative way.

Types of expectations

Warn – sends the non-quality record to downstream (default option)
Fail – when the non-quality record found, it fails the pipeline
Drop – drops the non-quality records, and sends only quality records to downstream

How to set rules in the SDP & different options for setting it.

Defining rules in python dictionary:

Python syntax to call in notebook:

dp.expect dp.expect_all	Used to set for warning
dp.expect_or_drop dp.expect_all_or_drop	Used to drop invalid records.
dp.expect_or_fail dp.expect_all_or_fail	Fail the pipeline, so manual intervention required.

Let’s try. In this article, I have experimented with different types of expectations.

Source code link

Demoing warn & drop

Below is the developed pipeline, where incoming raw data is validated and written into two separate tables.
The order_cleaned table retains only valid (good) records, dropping all invalid ones.
In contrast, order_full processes all records, with a warning message of how many invalid records were encountered as warnings.

Demoing Fail due to invalid records

Same implementation with a stricter condition.

During execution in SDP, the pipeline failed because the data did not meet the defined expectations. In this scenario, the failure is intentional and acts as a hard guardrail.

Manual intervention is required to understand why the expectation was violated and to perform further analysis and debugging. This represents a use case where no invalid data is ever allowed—invalid records are neither expected nor tolerated, and receiving or producing such data is treated as a critical failure.

I will further explore the expectations capability, how the process more stabilized and reconciled.

Until then!!!

Conclusion

Spark Declarative Pipelines make data quality a first-class concern by enforcing expectations directly in the pipeline, not as an afterthought. Whether you choose to warn, drop, or fail, SDP gives you clear, declarative control to balance data reliability with operational flexibility.

References

https://learn.microsoft.com/en-us/azure/databricks/ldp/expectations https://learn.microsoft.com/en-us/azure/databricks/ldp/expectation-patterns

KaniniPro

Databricks data quality with declarative pipeline

Leave a comment Cancel reply

Let’s connect

Recent posts

Databricks Serverless Compute

Processing ADLS delta-table using DuckDB

DeltaLake change tracking with CDF & Row Tracking

Introducing Lakeflow Spark Declarative Pipelines

SQL Queries that make the code simple

Databricks data quality with declarative pipeline

Share this:

Leave a comment Cancel reply

Let’s connect

Recent posts