Unlike traditional structured query databases, the difference between union and unionAll in Spark is unusual and not very intuitive.
Below is the exercise,
Two dataframes created with some of duplicate values.

Ideally, in any traditional database union removes the duplicates from both the dataset (ie table) and returns only unique values which is considered as costlier, whereas unionAll just combines both the dataset and returns all the records with duplicates if any.
But for some reason, spark doesn’t work in that way, and it can lead to cost impact if the development not considered this.

So, it is necessary to perform distinct or drop duplicates method after union, in order to remove the duplicates.

Keep learning !!! Happy Engineering!!!
Leave a comment