Optimizing Null Checks and Multi-Granularity Deduplication in Scalable SQL Pipelines

Abstract
Large-scale data pipelines, especially those built on Delta Lake and Spark SQL, require robust mechanisms for null-checking, deduplication, and handling diverse data granularities across multiple ingestion passes. This article presents a novel approach to unify passes with varying granularities, consolidate them efficiently, and apply EXCEPT ALL logic for deduplication before final merging. Our strategy significantly reduces shuffle overhead and ensures high data integrity and transformation fidelity.

1. Introduction
Data pipelines processing upwards of 750 million records often suffer performance hits due to redundant transformations, excessive joins, and multi-pass logic inconsistencies. This becomes more complex when working with datasets like PurchaseInvoiceDetail and LogisticsShipment, where various passes contain the same data in different granularities.
To address this, we propose:

Granularity Alignment via Union
Staging using Temporary Delta Tables
Pre-Merge Deduplication using EXCEPT ALL

1.1 Understanding “Pass” in Data Pipelines
In this context, a “pass” refers to an individual execution or ingestion of data through a pipeline. Each pass represents a distinct layer of data derived from different business logic or upstream sources.
There are typically multiple passes in production pipelines, each differing by:

Granularity: One may contain raw transaction-level data, another aggregated metrics.
Source: From ERP systems, forecasting tools, or manual uploads.
Transformation Scope: Ranging from simple filtering to deep aggregation.
To maintain data consistency while improving performance, these passes must be aligned, unioned, and deduplicated appropriately before merging into the main Delta table.

2. Problem Statement
Common challenges include:

Inconsistent granularity across passes
Multiple aggregation layers causing shuffle bottlenecks
Traditional deduplication (ROW_NUMBER, DISTINCT) is compute-heavy
Final merge stages often introduce delay and complexity

3. Solution Overview
Our approach involves:

Granularity Alignment: Union same-level passes before merging.
Temporary Staging: Store each unioned/aggregated pass in its own Delta table.
Pre-Merge Deduplication: Use EXCEPT ALL before final merge to filter out redundancies.

_{^{Figure 1: Optimzation null checks and multi-granularity deduplication in scalabal Spark-SQL piplines}}

4. Detailed Flow

4.1 Multi-Pass Handling
We categorize data passes into:

Pass A: Raw records with base-level granularity
Pass B: Aggregated data (e.g., PO, Vendor, Material)
Pass C: Derived views with partial aggregations

Instead of merging directly, same-granularity passes are first aligned via UNION.
Example:

CREATE OR REPLACE TEMP VIEW unified_pass_A AS  
SELECT * FROM pass_A_1  
UNION  
SELECT * FROM pass_A_2;

4.2 Staging and Surrogate Keys

CREATE TABLE temp_pass_A USING DELTA AS  
SELECT * FROM unified_pass_A;

Apply surrogate key and null check:

CREATE OR REPLACE TEMP VIEW temp_pass_A_keys AS  
SELECT *,  
       xxhash64(concat_ws('|', col1, col2, col3)) AS surrogate_key,  
       CASE WHEN col1 IS NULL THEN 1 ELSE 0 END AS flag_col1_null  
FROM temp_pass_A;

4.3 Pre-Merge Deduplication

CREATE OR REPLACE TEMP VIEW final_to_merge AS  
SELECT * FROM temp_pass_A_keys  
EXCEPT ALL  
SELECT * FROM already_existing_main_table;

Optionally, union with aggregated data:

CREATE OR REPLACE TEMP VIEW unified_merge_input AS  
SELECT * FROM final_to_merge  
UNION  
SELECT * FROM temp_pass_B;

5. Key Benefits

Optimization Lever	Before	After
Runtime	~90 min	~40 min
Shuffle Tasks	2000+	~500
Data Loss Risk	High	None
Debugging Complexity	High	Low

6. Best Practices

Union only same-granularity passes
Use xxhash64 for performant surrogate keys
Always deduplicate using EXCEPT ALL before final merge
Avoid using window functions in large pipelines
Separate aggregated and raw data flows

7. Conclusion
This modular architecture provides a scalable solution to null checks and deduplication in Spark SQL pipelines. By aligning granular passes, staging intermediate logic, and using EXCEPT ALL for deduplication, we reduce runtime, avoid data loss, and simplify pipeline management. This approach has proven effective in real-world production workloads involving billions of records.

— Abhishek Tyagi, Data & AI Engineer | Kritrim Intelligence

Tags: Big Data Processing, Data Deduplication, Data Engineering, Delta Lake, EXCEPT ALL SQL, Null Checks in SQL, Spark SQL, SQL Pipeline Optimization, Surrogate Key Generation

Optimizing Null Checks and Multi-Granularity Deduplication in Scalable SQL Pipelines

Leave a Reply Cancel reply

Search Here

Optimizing Null Checks and Multi-Granularity Deduplication in Scalable SQL Pipelines

Leave a Reply Cancel reply

Related Posts:-

Deep Learning: An Overview

Real-Time Gaze-Controlled Media Player: Hands-Free Interaction Using Eye Tracking

Unsupervised Learning

Search Here