Union vs UnionAll in spark

Unlike traditional structured query databases, the difference between union and unionAll in Spark is unusual and not very intuitive.

Below is the exercise,

Two dataframes created with some of duplicate values.

Ideally, in any traditional database union removes the duplicates from both the dataset (ie table) and returns only unique values which is considered as costlier, whereas unionAll just combines both the dataset and returns all the records with duplicates if any.

But for some reason, spark doesn’t work in that way, and it can lead to cost impact if the development not considered this.

So, it is necessary to perform distinct or drop duplicates method after union, in order to remove the duplicates.

Keep learning !!! Happy Engineering!!!

KaniniPro

Union vs UnionAll in spark

Leave a comment Cancel reply

Let’s connect

Recent posts

Databricks Serverless Compute

Processing ADLS delta-table using DuckDB

DeltaLake change tracking with CDF & Row Tracking

Introducing Lakeflow Spark Declarative Pipelines

SQL Queries that make the code simple

Databricks data quality with declarative pipeline

Share this:

Leave a comment Cancel reply

Let’s connect

Recent posts