KaniniPro

Databricks

Incremental load (SCD 1 & 2) with Spark declarative pipelines

Published by

Arulraj Gopal

on

December 28, 2025

Incremental load is an efficient approach for moving data into downstream systems by ensuring that only the changes between the previous run and the current run are processed. However, setting this up is not trivial. There are multiple proven strategies—such as batch-based processing using watermarks to track progress, or streaming…
Continue reading →: Incremental load (SCD 1 & 2) with Spark declarative pipelines
Databricks

Getting started with Databricks SDP

Published by

Arulraj Gopal

on

December 22, 2025

Spark Declarative Pipelines are one of the flagship capabilities of Databricks, enabling data engineers to focus purely on business logic while abstracting away infrastructure concerns such as cluster provisioning and management etc. In this article, we will explore how to get started with Spark Declarative Pipelines using Databricks. Prerequisite –…
Continue reading →: Getting started with Databricks SDP
Databricks

Tracking Table and Column Lineage in Databricks Unity Catalog

Published by

Arulraj Gopal

on

December 14, 2025

Data governance is one of the most integral parts of any data project, and data lineage plays a key role in understanding and tracking the true source of data. What is data lineage? Data lineage provides end-to-end visibility of how data moves across systems—from its origin, through every transformation, to…
Continue reading →: Tracking Table and Column Lineage in Databricks Unity Catalog
Databricks

Azure Databricks setup with unitycatalog

Published by

Arulraj Gopal

on

December 8, 2025

Once an organization decides to adopt Databricks, the next critical responsibility is setting it up correctly and maintaining it effectively. Databricks is not a static platform — it offers multiple features, deployment models, and constantly evolving capabilities. Because of this, teams must understand both Databricks best practices and the specific…
Continue reading →: Azure Databricks setup with unitycatalog
Databricks

Deploying Lakeflow Jobs with Databricks Asset Bundles

Published by

Arulraj Gopal

on

November 30, 2025

Databricks Lakeflow Jobs provide a powerful way to orchestrate notebooks and data processes directly inside Databricks without relying on external orchestration tools like Azure Data Factory, Airflow, or Dagster. A key requirement for modern data engineering is keeping job definitions as code and deploying them consistently across environments. This is…
Continue reading →: Deploying Lakeflow Jobs with Databricks Asset Bundles
Databricks

Databricks CLI Explained: The Power of Automation Beyond the UI

Published by

Arulraj Gopal

on

November 24, 2025

Databricks provides a rich user interface that makes it easy to interact with notebooks, jobs, clusters, and data objects. But as your platform grows, teams mature, and automation becomes a requirement, the Databricks Command Line Interface (CLI) becomes an indispensable tool. In this blog, we’ll explore what the Databricks CLI…
Continue reading →: Databricks CLI Explained: The Power of Automation Beyond the UI
Databricks

Key Practices That Make Databricks DE Life Easy

Published by

Arulraj Gopal

on

November 16, 2025

Focusing on performance is important—but that doesn’t mean a data team cost comes cheap. As requirements grow more complex, you need skilled data engineers, and that naturally increases cost.One of the most effective ways to reduce that cost is to keep your code simple. Databricks gives us several built-in features…
Continue reading →: Key Practices That Make Databricks DE Life Easy
delta-lake, spark

Clustering by Z-order demystified

Published by

Arulraj Gopal

on

November 9, 2025

Clustering is one of the famous techniques in big data systems, especially in lakehouse architecture, It is data layout optimization that arranges data on disk so that, when querying, instead of reading all files in the lakehouse, only a limited number of files will be read using file metadata stats…
Continue reading →: Clustering by Z-order demystified
spark

Spark Bucketing Demystified

Published by

Arulraj Gopal

on

November 2, 2025

When working with massive data, shuffling is one of the costliest operations, and engineers make every effort to reduce it as much as possible. This is achieved through two data-skipping pillars: partitioning and clustering. Clustering is preferred for high-cardinality field searches, and lakehouse architectures like Delta Lake, Iceberg, and Hudi…
Continue reading →: Spark Bucketing Demystified
Databricks, delta-lake, spark

Clustering options in delta-lake

Published by

Arulraj Gopal

on

October 25, 2025

When working with large-scale data in a lakehouse architecture, performance matters — a lot. One of the most effective ways to boost query performance is through a concept called data skipping. In simple terms, data skipping helps the query engine avoid scanning unnecessary data. Instead of reading every record, it…
Continue reading →: Clustering options in delta-lake

KaniniPro

Let’s connect

Recent posts

Databricks Identity Sync from Microsoft Entra ID

Secrets Management in Azure Databricks

Databricks SQL Introduction

Databricks Serverless Compute

Processing ADLS delta-table using DuckDB

DeltaLake change tracking with CDF & Row Tracking