KaniniPro

  • ABOUT
  • Databricks

    Azure Databricks setup with unitycatalog

    Published by

    Arulraj Gopal

    on

    December 8, 2025

    Once an organization decides to adopt Databricks, the next critical responsibility is setting it up correctly and maintaining it effectively. Databricks is not a static platform — it offers multiple features, deployment models, and constantly evolving capabilities. Because of this, teams must understand both Databricks best practices and the specific…

    Continue reading →: Azure Databricks setup with unitycatalog
  • Databricks

    Deploying Lakeflow Jobs with Databricks Asset Bundles

    Published by

    Arulraj Gopal

    on

    November 30, 2025

    Databricks Lakeflow Jobs provide a powerful way to orchestrate notebooks and data processes directly inside Databricks without relying on external orchestration tools like Azure Data Factory, Airflow, or Dagster. A key requirement for modern data engineering is keeping job definitions as code and deploying them consistently across environments. This is…

    Continue reading →: Deploying Lakeflow Jobs with Databricks Asset Bundles
  • Databricks

    Databricks CLI Explained: The Power of Automation Beyond the UI

    Published by

    Arulraj Gopal

    on

    November 24, 2025

    Databricks provides a rich user interface that makes it easy to interact with notebooks, jobs, clusters, and data objects. But as your platform grows, teams mature, and automation becomes a requirement, the Databricks Command Line Interface (CLI) becomes an indispensable tool. In this blog, we’ll explore what the Databricks CLI…

    Continue reading →: Databricks CLI Explained: The Power of Automation Beyond the UI
  • Databricks

    Key Practices That Make Databricks DE Life Easy

    Published by

    Arulraj Gopal

    on

    November 16, 2025

    Focusing on performance is important—but that doesn’t mean a data team cost comes cheap. As requirements grow more complex, you need skilled data engineers, and that naturally increases cost.One of the most effective ways to reduce that cost is to keep your code simple. Databricks gives us several built-in features…

    Continue reading →: Key Practices That Make Databricks DE Life Easy
  • delta-lake, spark

    Clustering by Z-order demystified

    Published by

    Arulraj Gopal

    on

    November 9, 2025

    Clustering is one of the famous techniques in big data systems, especially in lakehouse architecture, It is data layout optimization that arranges data on disk so that, when querying, instead of reading all files in the lakehouse, only a limited number of files will be read using file metadata stats…

    Continue reading →: Clustering by Z-order demystified
  • spark

    Spark Bucketing Demystified

    Published by

    Arulraj Gopal

    on

    November 2, 2025

    When working with massive data, shuffling is one of the costliest operations, and engineers make every effort to reduce it as much as possible. This is achieved through two data-skipping pillars: partitioning and clustering. Clustering is preferred for high-cardinality field searches, and lakehouse architectures like Delta Lake, Iceberg, and Hudi…

    Continue reading →: Spark Bucketing Demystified
  • Databricks, delta-lake, spark

    Clustering options in delta-lake

    Published by

    Arulraj Gopal

    on

    October 25, 2025

    When working with large-scale data in a lakehouse architecture, performance matters — a lot. One of the most effective ways to boost query performance is through a concept called data skipping. In simple terms, data skipping helps the query engine avoid scanning unnecessary data. Instead of reading every record, it…

    Continue reading →: Clustering options in delta-lake
  • OLAP, spark

    3 Proven OLAP Query Concepts That Boost Efficiency

    Published by

    Arulraj Gopal

    on

    October 19, 2025

    Not every performance gain comes from fancy techniques like broadcasting, partitioning, or caching. Sometimes, the right way of querying makes all the difference — and that’s where strong data skills and a deep understanding of technology fundamentals come in. Here are some sample scenarios where queries can perform efficiently regardless…

    Continue reading →: 3 Proven OLAP Query Concepts That Boost Efficiency
  • spark

    Hive Partitioning Unlocked

    Published by

    Arulraj Gopal

    on

    October 12, 2025

    Data skipping is a crucial performance optimization technique, especially in OLAP (Online Analytical Processing) environments. One of the most effective ways to enable data skipping is through partitioning — a technique widely used in lake house architecture or any other storage. Key principle Instead of scanning the entire dataset in…

    Continue reading →: Hive Partitioning Unlocked
  • spark

    Repartition and coalesce in spark

    Published by

    Arulraj Gopal

    on

    October 4, 2025

    In Spark, repartition and coalesce are two options used to rebalance DataFrame partitions for better performance and data management. The key technical differences are shown below: At first glance, coalesce seems more efficient and is often preferred. However, in certain situations repartition can be much more effective. When to Prefer…

    Continue reading →: Repartition and coalesce in spark
Previous Page Next Page

Let’s connect

  • LinkedIn
  • Mail

Recent posts

  • Databricks Serverless Compute

  • Processing ADLS delta-table using DuckDB

  • DeltaLake change tracking with CDF & Row Tracking

  • Introducing Lakeflow Spark Declarative Pipelines

  • SQL Queries that make the code simple

  • Databricks data quality with declarative pipeline

  • Subscribe Subscribed
    • KaniniPro
    • Already have a WordPress.com account? Log in now.
    • KaniniPro
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar

Notifications