KaniniPro

  • ABOUT
  • spark

    Spark Bucketing Demystified

    Published by

    Arulraj Gopal

    on

    November 2, 2025

    When working with massive data, shuffling is one of the costliest operations, and engineers make every effort to reduce it as much as possible. This is achieved through two data-skipping pillars: partitioning and clustering. Clustering is preferred for high-cardinality field searches, and lakehouse architectures like Delta Lake, Iceberg, and Hudi…

    Continue reading →: Spark Bucketing Demystified
  • Databricks, delta-lake, spark

    Clustering options in delta-lake

    Published by

    Arulraj Gopal

    on

    October 25, 2025

    When working with large-scale data in a lakehouse architecture, performance matters — a lot. One of the most effective ways to boost query performance is through a concept called data skipping. In simple terms, data skipping helps the query engine avoid scanning unnecessary data. Instead of reading every record, it…

    Continue reading →: Clustering options in delta-lake
  • OLAP, spark

    3 Proven OLAP Query Concepts That Boost Efficiency

    Published by

    Arulraj Gopal

    on

    October 19, 2025

    Not every performance gain comes from fancy techniques like broadcasting, partitioning, or caching. Sometimes, the right way of querying makes all the difference — and that’s where strong data skills and a deep understanding of technology fundamentals come in. Here are some sample scenarios where queries can perform efficiently regardless…

    Continue reading →: 3 Proven OLAP Query Concepts That Boost Efficiency
  • spark

    Hive Partitioning Unlocked

    Published by

    Arulraj Gopal

    on

    October 12, 2025

    Data skipping is a crucial performance optimization technique, especially in OLAP (Online Analytical Processing) environments. One of the most effective ways to enable data skipping is through partitioning — a technique widely used in lake house architecture or any other storage. Key principle Instead of scanning the entire dataset in…

    Continue reading →: Hive Partitioning Unlocked
  • spark

    Repartition and coalesce in spark

    Published by

    Arulraj Gopal

    on

    October 4, 2025

    In Spark, repartition and coalesce are two options used to rebalance DataFrame partitions for better performance and data management. The key technical differences are shown below: At first glance, coalesce seems more efficient and is often preferred. However, in certain situations repartition can be much more effective. When to Prefer…

    Continue reading →: Repartition and coalesce in spark
  • spark

    Different ways of controlling spark partitions

    Published by

    Arulraj Gopal

    on

    September 28, 2025

    In Apache Spark, partitions are the fundamental units of parallelism that determine how data is split and processed across executors. Each partition is handled by a single task, meaning more partitions generally lead to better parallelism and workload distribution. However, too few partitions can underutilize cluster resources, while too many…

    Continue reading →: Different ways of controlling spark partitions
  • spark

    Stop Confusing Spark Partitions with Hive Partitions

    Published by

    Arulraj Gopal

    on

    September 21, 2025

    A common confusion arises between Spark partitions and Hive partitions—these are completely different concepts. Spark partitions are in-memory objects controlled by methods like repartition, coalesce, or configuration settings. Balancing partitions in a DataFrame is essential to ensure the right level of parallelism. Too few partitions can leave the cluster underutilized,…

    Continue reading →: Stop Confusing Spark Partitions with Hive Partitions
  • Parquet File Format Demystified

    Published by

    Arulraj Gopal

    on

    September 13, 2025

    Many people think Parquet is a columnar format. That is not entirely true.Parquet is actually a hybrid format—it blends the strengths of both row-based and columnar storage. This has become the go-to file format for data engineering. So, what makes it special? Why is it so widely adopted in modern…

    Continue reading →: Parquet File Format Demystified
  • delta-lake

    Delta Table: Under the Hood

    Published by

    Arulraj Gopal

    on

    September 6, 2025

    In the big data world, data lakes brought several advantages—they are scalable, cost-efficient and flexible enough to store data in any format like unstructured and semi structured. However, data-lakes lack the warehouse-like capabilities needed for ACID, performance, and structured analytics. This is where Delta Lake (lake house architecture) comes in.…

    Continue reading →: Delta Table: Under the Hood
  • Databricks

    Databricks System Tables

    Published by

    Arulraj Gopal

    on

    August 31, 2025

    Most data engineering professionals have worked with Databricks at some level—whether exploring it casually or using it deeply in production. One of the most essential aspects of managing Databricks is understanding and monitoring costs. At a high level, the total cost of a Databricks workspace can be tracked using cloud…

    Continue reading →: Databricks System Tables
Previous Page Next Page

Let’s connect

  • LinkedIn
  • Mail

Recent posts

  • Databricks data quality with declarative pipeline

  • Schema Drift Made Easy with Spark Declarative Pipelines

  • Incremental load (SCD 1 & 2) with Spark declarative pipelines

  • Introducing Lakeflow Spark Declarative Pipelines

  • Tracking Table and Column Lineage in Databricks Unity Catalog

  • Azure Databricks setup with unitycatalog

  • Subscribe Subscribed
    • KaniniPro
    • Already have a WordPress.com account? Log in now.
    • KaniniPro
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar

Notifications