KaniniPro

  • ABOUT
  • spark

    Demystifying Apache Spark: Jobs, Stages, and Tasks

    Published by

    Arulraj Gopal

    on

    August 24, 2025

    Spark driver program is a JVM process that runs spark application main program and coordinates the jobs, stages, and tasks. Spark executor is a JVM process launched on worker nodes executes the tasks. A job can contain multiple stages, and each stage can further contain multiple tasks. But how does…

    Continue reading →: Demystifying Apache Spark: Jobs, Stages, and Tasks
  • spark

    How Spark Saves Time & Cost by Being Lazy

    Published by

    Arulraj Gopal

    on

    August 16, 2025

    Understanding Spark internals is important because it directly impacts how effectively you can utilize Spark for performance, scalability, and cost efficiency. One key aspect to note in Spark is the concept of lazy evaluation. Before diving into the main topic, let’s first take a quick look at what actions and…

    Continue reading →: How Spark Saves Time & Cost by Being Lazy
  • spark

    Spark Performance Pitfalls: The Hidden Cost of Misplaced Unions

    Published by

    Arulraj Gopal

    on

    August 10, 2025

    Using the union operation in the wrong place can be costly in Spark. For example, if you read data from a source, process it, then split it into two DataFrames based on certain filter conditions, perform additional operations on each DataFrame separately, and finally union them back together, Spark will…

    Continue reading →: Spark Performance Pitfalls: The Hidden Cost of Misplaced Unions
  • ETL

    Handling Orphan Data

    Published by

    Arulraj Gopal

    on

    August 2, 2025

    Data integrity and consistency are the foundation of data reliability. Without them, every model, report, or insight becomes a guess at best — and a risk at worst. Orphan records are a commonly encountered issue in data management, and effectively handling them is crucial to maintaining high-quality, usable datasets. Unlike…

    Continue reading →: Handling Orphan Data
  • ETL

    Parameterize ADF linked service for multiuse

    Published by

    Arulraj Gopal

    on

    July 26, 2025

    Parameterization is one of the important best practise activities in data pipeline design like any system designing to keep the code clean and easy maintainable. Parameterizing linked services in Azure Data Factory is crucial for reusability and flexibility across different environments or functions It enables dynamic connections by allowing values…

    Continue reading →: Parameterize ADF linked service for multiuse
  • spark

    Unique key in spark DataFrame

    Published by

    Arulraj Gopal

    on

    July 21, 2025

    Creating a unique key within a data pipeline is essential for reliably identifying individual records, especially in scenarios where the source dataset lacks a natural primary key and where record traceability is required in later stages of processing. In distributed processing frameworks like Apache Spark, which operate in-memory and leverage…

    Continue reading →: Unique key in spark DataFrame
  • spark

    Different ways of removing duplicates in spark

    Published by

    Arulraj Gopal

    on

    July 13, 2025

    Removing duplicates in any data processing systems is essential, like other systems spark has some good ways to get rid of duplicates. We will look into the different ways of removing duplicates spark and application of that. Distinct & Drop duplicates. Distinct and drop duplicates are most common ways and…

    Continue reading →: Different ways of removing duplicates in spark
  • spark

    Union vs UnionAll in spark

    Published by

    Arulraj Gopal

    on

    July 5, 2025

    Unlike traditional structured query databases, the difference between union and unionAll in Spark is unusual and not very intuitive. Below is the exercise, Two dataframes created with some of duplicate values. Ideally, in any traditional database union removes the duplicates from both the dataset (ie table) and returns only unique…

    Continue reading →: Union vs UnionAll in spark
  • Data Projects

    Stock Price Streaming using Apache Kafka

    Published by

    Arulraj Gopal

    on

    July 17, 2024

    In today’s fast-paced and highly volatile financial markets, having access to real-time stock quotes is crucial for making informed and precise decisions. Traditional methods of obtaining stock quotes often involve delays, which can lead to missed opportunities. This project aims to develop a robust and scalable pipeline that ingests live…

    Continue reading →: Stock Price Streaming using Apache Kafka
  • Basics of Computers

    What is Machine Language?

    Published by

    Arulraj Gopal

    on

    July 12, 2024

    Is it something that machines speaks? C, C++, Java are machine languages? The language which machine understands is machine language. So, what machine understands? Obviously, it is 0 and 1. Machine understand only digital values. So, if we need to interact with machine, the only way is to communicate is…

    Continue reading →: What is Machine Language?
Previous Page

Let’s connect

  • LinkedIn
  • Mail

Recent posts

  • Databricks data quality with declarative pipeline

  • Schema Drift Made Easy with Spark Declarative Pipelines

  • Incremental load (SCD 1 & 2) with Spark declarative pipelines

  • Introducing Lakeflow Spark Declarative Pipelines

  • Tracking Table and Column Lineage in Databricks Unity Catalog

  • Azure Databricks setup with unitycatalog

  • Subscribe Subscribed
    • KaniniPro
    • Already have a WordPress.com account? Log in now.
    • KaniniPro
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar

Notifications