KaniniPro

  • ABOUT
  • spark

    Different ways of controlling spark partitions

    Published by

    Arulraj Gopal

    on

    September 28, 2025

    In Apache Spark, partitions are the fundamental units of parallelism that determine how data is split and processed across executors. Each partition is handled by a single task, meaning more partitions generally lead to better parallelism and workload distribution. However, too few partitions can underutilize cluster resources, while too many…

    Continue reading →: Different ways of controlling spark partitions
  • spark

    Stop Confusing Spark Partitions with Hive Partitions

    Published by

    Arulraj Gopal

    on

    September 21, 2025

    A common confusion arises between Spark partitions and Hive partitions—these are completely different concepts. Spark partitions are in-memory objects controlled by methods like repartition, coalesce, or configuration settings. Balancing partitions in a DataFrame is essential to ensure the right level of parallelism. Too few partitions can leave the cluster underutilized,…

    Continue reading →: Stop Confusing Spark Partitions with Hive Partitions
  • Parquet File Format Demystified

    Published by

    Arulraj Gopal

    on

    September 13, 2025

    Many people think Parquet is a columnar format. That is not entirely true.Parquet is actually a hybrid format—it blends the strengths of both row-based and columnar storage. This has become the go-to file format for data engineering. So, what makes it special? Why is it so widely adopted in modern…

    Continue reading →: Parquet File Format Demystified
  • delta-lake

    Delta Table: Under the Hood

    Published by

    Arulraj Gopal

    on

    September 6, 2025

    In the big data world, data lakes brought several advantages—they are scalable, cost-efficient and flexible enough to store data in any format like unstructured and semi structured. However, data-lakes lack the warehouse-like capabilities needed for ACID, performance, and structured analytics. This is where Delta Lake (lake house architecture) comes in.…

    Continue reading →: Delta Table: Under the Hood
  • Databricks

    Databricks System Tables

    Published by

    Arulraj Gopal

    on

    August 31, 2025

    Most data engineering professionals have worked with Databricks at some level—whether exploring it casually or using it deeply in production. One of the most essential aspects of managing Databricks is understanding and monitoring costs. At a high level, the total cost of a Databricks workspace can be tracked using cloud…

    Continue reading →: Databricks System Tables
  • spark

    Demystifying Apache Spark: Jobs, Stages, and Tasks

    Published by

    Arulraj Gopal

    on

    August 24, 2025

    Spark driver program is a JVM process that runs spark application main program and coordinates the jobs, stages, and tasks. Spark executor is a JVM process launched on worker nodes executes the tasks. A job can contain multiple stages, and each stage can further contain multiple tasks. But how does…

    Continue reading →: Demystifying Apache Spark: Jobs, Stages, and Tasks
  • spark

    How Spark Saves Time & Cost by Being Lazy

    Published by

    Arulraj Gopal

    on

    August 16, 2025

    Understanding Spark internals is important because it directly impacts how effectively you can utilize Spark for performance, scalability, and cost efficiency. One key aspect to note in Spark is the concept of lazy evaluation. Before diving into the main topic, let’s first take a quick look at what actions and…

    Continue reading →: How Spark Saves Time & Cost by Being Lazy
  • spark

    Spark Performance Pitfalls: The Hidden Cost of Misplaced Unions

    Published by

    Arulraj Gopal

    on

    August 10, 2025

    Using the union operation in the wrong place can be costly in Spark. For example, if you read data from a source, process it, then split it into two DataFrames based on certain filter conditions, perform additional operations on each DataFrame separately, and finally union them back together, Spark will…

    Continue reading →: Spark Performance Pitfalls: The Hidden Cost of Misplaced Unions
  • ETL

    Handling Orphan Data

    Published by

    Arulraj Gopal

    on

    August 2, 2025

    Data integrity and consistency are the foundation of data reliability. Without them, every model, report, or insight becomes a guess at best — and a risk at worst. Orphan records are a commonly encountered issue in data management, and effectively handling them is crucial to maintaining high-quality, usable datasets. Unlike…

    Continue reading →: Handling Orphan Data
  • ETL

    Parameterize ADF linked service for multiuse

    Published by

    Arulraj Gopal

    on

    July 26, 2025

    Parameterization is one of the important best practise activities in data pipeline design like any system designing to keep the code clean and easy maintainable. Parameterizing linked services in Azure Data Factory is crucial for reusability and flexibility across different environments or functions It enables dynamic connections by allowing values…

    Continue reading →: Parameterize ADF linked service for multiuse
Previous Page Next Page

Let’s connect

  • LinkedIn
  • Mail

Recent posts

  • Databricks Serverless Compute

  • Processing ADLS delta-table using DuckDB

  • DeltaLake change tracking with CDF & Row Tracking

  • Introducing Lakeflow Spark Declarative Pipelines

  • SQL Queries that make the code simple

  • Databricks data quality with declarative pipeline

  • Subscribe Subscribed
    • KaniniPro
    • Already have a WordPress.com account? Log in now.
    • KaniniPro
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar

Notifications