Optimizing Null Checks and Multi-Granularity Deduplication in Scalable SQL Pipelines

Abstract
Large-scale data pipelines, especially those built on Delta Lake and Spark SQL, require robust mechanisms for null-checking, deduplication, and handling diverse data granularities across multiple ingestion passes. This article presents a novel approach to unify passes with varying granularities, consolidate them efficiently, and apply EXCEPT ALL logic for deduplication before final merging. Our strategy significantly reduces shuffle overhead and ensures high data integrity and transformation fidelity.

1. Introduction
Data pipelines processing upwards of 750 million records often suffer performance hits due to redundant transformations, excessive joins, and multi-pass logic inconsistencies. This becomes more complex when working with datasets like PurchaseInvoiceDetail and LogisticsShipment, where various passes contain the same data in different granularities.
To address this, we propose:

  • Granularity Alignment via Union
  • Staging using Temporary Delta Tables
  • Pre-Merge Deduplication using EXCEPT ALL

1.1 Understanding “Pass” in Data Pipelines
In this context, a “pass” refers to an individual execution or ingestion of data through a pipeline. Each pass represents a distinct layer of data derived from different business logic or upstream sources.
There are typically multiple passes in production pipelines, each differing by:

  • Granularity: One may contain raw transaction-level data, another aggregated metrics.
  • Source: From ERP systems, forecasting tools, or manual uploads.
  • Transformation Scope: Ranging from simple filtering to deep aggregation.
    To maintain data consistency while improving performance, these passes must be aligned, unioned, and deduplicated appropriately before merging into the main Delta table.

2. Problem Statement
Common challenges include:

  • Inconsistent granularity across passes
  • Multiple aggregation layers causing shuffle bottlenecks
  • Traditional deduplication (ROW_NUMBER, DISTINCT) is compute-heavy
  • Final merge stages often introduce delay and complexity

3. Solution Overview
Our approach involves:

  1. Granularity Alignment: Union same-level passes before merging.
  2. Temporary Staging: Store each unioned/aggregated pass in its own Delta table.
  3. Pre-Merge Deduplication: Use EXCEPT ALL before final merge to filter out redundancies.

Figure 1: Optimzation null checks and multi-granularity deduplication in scalabal Spark-SQL piplines

4. Detailed Flow

4.1 Multi-Pass Handling
We categorize data passes into:

  • Pass A: Raw records with base-level granularity
  • Pass B: Aggregated data (e.g., PO, Vendor, Material)
  • Pass C: Derived views with partial aggregations

Instead of merging directly, same-granularity passes are first aligned via UNION.
Example:

4.2 Staging and Surrogate Keys

Apply surrogate key and null check:

4.3 Pre-Merge Deduplication

Optionally, union with aggregated data:

5. Key Benefits

Optimization LeverBeforeAfter
Runtime~90 min~40 min
Shuffle Tasks2000+~500
Data Loss RiskHighNone
Debugging ComplexityHighLow

6. Best Practices

  • Union only same-granularity passes
  • Use xxhash64 for performant surrogate keys
  • Always deduplicate using EXCEPT ALL before final merge
  • Avoid using window functions in large pipelines
  • Separate aggregated and raw data flows

7. Conclusion
This modular architecture provides a scalable solution to null checks and deduplication in Spark SQL pipelines. By aligning granular passes, staging intermediate logic, and using EXCEPT ALL for deduplication, we reduce runtime, avoid data loss, and simplify pipeline management. This approach has proven effective in real-world production workloads involving billions of records.

Abhishek Tyagi, Data & AI Engineer | Kritrim Intelligence

Categories: Data Analytics, Articles

Leave a Reply