Thursday, 25 June 2026

databricks interview questions and answer

Databricks interviews focus on your ability to use the Data Lakehouse architecture, optimize distributed computing (Apache Spark), and manage data governance (Unity Catalog). Preparation should span across architecture, Delta Lake features, and practical performance tuning. [1, 2, 3, 4, 5]
1. Core Architecture & Concepts
Q: What is the Databricks Lakehouse architecture?
A: It is a unified platform that combines the reliability, governance, and ACID transactions of a data warehouse with the scalability, flexibility, and low-cost storage of a data lake. It unifies batch and streaming data into a single system. [1, 2, 3, 4]
Q: What is Unity Catalog?
A: It is a fine-grained centralized governance solution that manages data access and permissions across all Databricks workspaces. It provides centralized auditing, data lineage, and row/column-level security. [1, 2, 3]
Q: Difference between an All-Purpose Cluster and a Job Cluster?
A:
  • All-Purpose: Used for interactive exploration, collaborative notebook sharing, and ad-hoc analysis. Takes longer to start and incurs higher costs. [1, 2, 3, 4, 5]
  • Job Cluster: Dedicated, automated clusters explicitly created to run a specific automated workflow or task. They terminate immediately upon job completion, saving computing costs. [1, 2, 3, 4, 5]

2. Delta Lake Features
Q: What are the main benefits of Delta Lake over standard Parquet?
A: Delta Lake provides ACID transactions (preventing corrupted reads/writes), enforces data schemas (schema enforcement), and allows you to query historical data states via "Time Travel". [1]
Q: Explain the difference between VACUUM and OPTIMIZE.
A: [1]
  • OPTIMIZE: Compacts small files in a Delta table into larger, optimized files, which speeds up read queries.
  • VACUUM: Permanently deletes physical data files that are no longer referenced by the Delta transaction log (e.g., historical data past the retention threshold). It is used to reclaim cloud storage and reduce costs. [1, 2]
Q: What is Schema Evolution in Delta Lake?
A: It allows a pipeline to automatically adapt to changes in data structure without breaking the target table. You typically enable it by using the .option("mergeSchema", "true") setting. [1, 2]

3. Spark Optimization & Debugging
Q: How do you handle Data Skew in Databricks?
A: Data skew occurs when a few specific keys hold disproportionately more data, slowing down the executors processing them. You can handle this by: [1, 2]
  • Salting the skewed keys (adding random prefixes to distribute data across executors).
  • Increasing the spark.sql.shuffle.partitions configuration.
  • Utilizing Adaptive Query Execution (AQE), which dynamically coalesces skewed partition groups. [1, 2, 3, 4, 5]

Q: When would you use a Broadcast Join?
A: When you are joining a very large table with a significantly smaller table. Spark sends (broadcasts) the small table to the memory of every executor, preventing expensive network shuffle operations. Risk: Running out of executor memory if the small table is too large. [1, 2]


 Databricks data engineering interviews comprehensively test architectural concepts, real-time optimization strategies, and robust programmatic problem-solving. [1, 2]

The following selection delivers 4 major Databricks interview questions spanning Architecture, Optimization, Data Ingestion, and a Coding Scenario. Each question includes detailed explanations, technical context, comprehensive examples, and specific test cases with sample PySpark/SQL implementations. [1, 2, 3, 4, 5]

1. Handling Heavy Data Skew with Salting
Question
What is Data Skew in Apache Spark, how does it negatively affect Databricks clusters, and how do you resolve it using a technical approach like Salting? [1, 2]
Technical Context
Data skew occurs when an asymmetric distribution causes a single or small subset of keys to hold a massive portion of the total records. During wide transformations (like a JOIN or groupBy), Spark hashes data keys to assign them to specific partitions. A skewed key forces a solitary executor node to process vastly more data than its peers. This creates a "long tail" problem where the cluster remains underutilized while one node runs out of memory (OOM) or bottlenecks the entire pipeline. [1, 2]
Salting breaks up this bottleneck by appending a randomized integer (a "salt") to the join keys of the skewed dataset, while exploding the lookup table to match those generated salt variations. [1, 2, 3]
Skewed Key ("A") ----> Appended Salt ----> Distributed Keys: "A_0", "A_1", "A_2"
Detailed Example & Implementation
Imagine joining a multi-terabyte Transactions Table (highly skewed on a popular product_id, such as 101) with a small Products Dimension Table. [1]
python
from pyspark.sql import functions as F

# 1. Define salt range
num_salts = 3

# 2. Add salt to the skewed large table
transactions_salted = transactions_df.withColumn(
    "salt", 
    F.concat(F.col("product_id"), F.lit("_"), F.randint(0, num_salts - 1))
)

# 3. Explode the small dimension table to match every possible salt value
salts_df = spark.range(0, num_salts).withColumnRenamed("id", "salt_val")

products_exploded = products_df.join(salts_df) \
    .withColumn("salt", F.concat(F.col("product_id"), F.lit("_"), F.col("salt_val"))) \
    .drop("salt_val")

# 4. Perform the optimized join on the salted key
optimized_joined_df = transactions_salted.join(
    products_exploded,
    on="salt",
    how="inner"
).drop("salt")
Use code with caution.
Test Cases
  • Test Case 1: Skewed Key Distribution
    • Input Data: transactions_df contains 1,000,000 records where product_id = 101 and 10 records where product_id = 202.
    • Expected Behavior: Without salting, Spark creates a single massive partition for 101. With a salt range of 3, the records for 101 split roughly evenly into 3 distinct processing buckets (101_0, 101_1, 101_2), balancing the cluster load.
  • Test Case 2: Join Cardinality and Correctness Verification
    • Input Data: Single transaction record [product_id: 101, amount: 50]. Dimension record [product_id: 101, name: 'Widget'].
    • Expected Output: The output must yield exactly 1 joined record: [product_id: 101, amount: 50, name: 'Widget']. The test guarantees that exploding the dimension table does not introduce artificial row duplication.

2. Implementing SCD Type 2 tracking in Delta Lake
Question
How do you efficiently implement a Slowly Changing Dimension Type 2 (SCD Type 2) pipeline within a Databricks Delta Lake table using native commands? [1, 2]
Technical Context
SCD Type 2 tracks historical data by creating multiple records for a given natural key, differentiating them with state tracking fields like start_date, end_date, and a boolean is_current flag. Delta Lake handles this pattern natively through the MERGE INTO statement. To perform a historical update, existing active records must be marked as expired (is_current = false), and completely new records must be appended as the new ground truth. [1, 2, 3, 4, 5]
Detailed Example & Implementation
We need to track changes to an internal Employees Delta Table. If an employee changes their department, the old record must terminate, and a new record must begin.
sql
-- Target table initialization configuration
CREATE TABLE delta.employees (
    emp_id INT,
    name STRING,
    department STRING,
    start_date DATE,
    end_date DATE,
    is_current BOOLEAN
) USING DELTA;

-- Merge logic processing staging data
MERGE INTO delta.employees AS target
USING (
    -- Combine new updates with an explicit null-key row to force insertions of changed values
    SELECT src.emp_id AS merge_key, src.* FROM staging_updates src
    UNION ALL
    SELECT NULL AS merge_key, src.* FROM staging_updates src
    JOIN delta.employees tgt ON src.emp_id = tgt.emp_id
    WHERE tgt.is_current = true AND src.department <> tgt.department
) AS source
ON target.emp_id = source.merge_key AND target.is_current = true
WHEN MATCHED AND target.department <> source.department THEN
    UPDATE SET target.end_date = source.start_date, target.is_current = false
WHEN NOT MATCHED THEN
    INSERT (emp_id, name, department, start_date, end_date, is_current)
    VALUES (source.emp_id, source.name, source.department, source.start_date, NULL, true);
Use code with caution.
Test Cases
  • Test Case 1: Processing a Brand New Record
    • Input Data: Staging record [emp_id: 5, name: 'Alice', department: 'HR', start_date: '2026-06-01'] targeting an empty target table.
    • Expected Output: One row added: [emp_id: 5, ..., department: 'HR', start_date: '2026-06-01', end_date: NULL, is_current: true].
  • Test Case 2: Historical Value Mutation (The Update)
    • Input Data: Target table contains active record for Alice in 'HR'. Staging receives: [emp_id: 5, name: 'Alice', department: 'Engineering', start_date: '2026-06-25'].
    • Expected Output: The target table must contain exactly two rows for emp_id: 5:
      1. The legacy row updated to: [department: 'HR', end_date: '2026-06-25', is_current: false].
      2. The newly created row added as: [department: 'Engineering', start_date: '2026-06-25', end_date: NULL, is_current: true].

3. Scaling File Ingestion with Auto Loader
Question
What is Databricks Auto Loader, how does it work under the hood, and what advantages does it offer over standard Spark Structured Streaming file readers? [1, 2, 3]
Technical Context
Traditional Spark directory streaming routinely scans cloud object storage containers to find newly added objects. As file quantities scale into the millions, directory listing operations become painfully slow and prohibitively expensive. [1, 2]
Databricks Auto Loader (cloudFiles) solves this using two operational mechanisms: [1, 2]
  1. Directory Listing Mode: It efficiently enumerates files and retains state directly inside RocksDB checkpoints, completely removing the need to re-scan previously ingested assets. [1]
  2. File Notification Mode: It automatically configures and listens to cloud notification systems (like AWS SNS/SQS or Azure Event Grid/Queue), directly queuing incoming files without running manual system storage scans. [1, 2, 3, 4]
Detailed Example & Implementation
Streaming JSON file extraction from a landing storage container with explicit automatic schema inference and evolution tracking. [1, 2]
python
# Streaming data ingestion via Auto Loader
checkpoint_path = "dbfs:/mnt/checkpoints/json_ingestion"
schema_location = "dbfs:/mnt/metadata/schemas/json_ingestion"

streaming_df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", schema_location)
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns") # Automatic schema capture
    .load("dbfs:/mnt/landing_zone/raw_events/"))

query = (streaming_df.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", checkpoint_path)
    .start("dbfs:/mnt/silver_zone/events_table"))
Use code with caution.
Test Cases
  • Test Case 1: Incremental Execution and Processing Latency
    • Input Data: 50,000 files exist in /raw_events/. The pipeline processes them completely. 10 new files are dropped into the bucket.
    • Expected Behavior: Auto Loader identifies and streams only the 10 incoming files instantly using the notification queue, skipping expensive directory scanning. [1, 2]
  • Test Case 2: Schema Evolution Verification
    • Input Data: System reads initial records with fields [id, timestamp]. A new batch introduces a modified structure containing a third column: [id, timestamp, geo_location].
    • Expected Output: The streaming target Delta table alters dynamically to include geo_location without failing the workflow, populating previous records with a NULL fallback. [1, 2]

4. Algorithmic Coding Challenge — IP Range to CIDR Merging
Question
Write a Python/PySpark solution that takes a series of network log IP addresses and collapses contiguous ranges into minimal CIDR blocks. [1]
Technical Context
This programmatic scenario evaluates your algorithmic fluency, object optimization strategies, and network architecture logic. Databricks profiles rely heavily on transforming IP data string representations into numerical 32-bit unsigned integer values to cleanly execute mathematical ranges, interval scans, and sliding optimization windows. [1, 2]
Detailed Example & Implementation
python
import ipaddress

def merge_ips_to_cidr(ip_list):
    """
    Parses a string list of individual IPv4 strings, collapses contiguity,
    and returns the optimized, minimum possible list of representative CIDR blocks.
    """
    if not ip_list:
        return []
        
    # Convert text strings into structured IP objects, sorting ascending
    ips = sorted([ipaddress.IPv4Address(ip) for ip in ip_list])
    
    merged_cidrs = []
    current_range = []
    
    for ip in ips:
        if not current_range:
            current_range.append(ip)
        # Check if the current IP address is strictly contiguous with the previous element
        elif int(ip) == int(current_range[-1]) + 1:
            current_range.append(ip)
        else:
            # Collapse completed range into native CIDR representations
            start_ip = current_range[0]
            end_ip = current_range[-1]
            merged_cidrs.extend(ipaddress.summarize_address_range(start_ip, end_ip))
            current_range = [ip]
            
    if current_range:
        merged_cidrs.extend(ipaddress.summarize_address_range(current_range[0], current_range[-1]))
        
    return [str(cidr) for cidr in merged_cidrs]

# Example execution verification
input_ips = ["192.168.1.0", "192.168.1.1", "192.168.1.2", "192.168.1.3", "192.168.1.5"]
print(merge_ips_to_cidr(input_ips))
# Output: ['192.168.1.0/30', '192.168.1.5/32']
Use code with caution.
Test Cases
  • Test Case 1: Perfect Contiguous Block Optimization
    • Input Data: ["10.0.0.0", "10.0.0.1", "10.0.0.2", "10.0.0.3"]
    • Expected Output: ['10.0.0.0/30'] (collapses four 32-bit addresses into a single subnet).
  • Test Case 2: Complete Discontinuity Handling
    • Input Data: ["192.168.0.1", "10.0.0.1"]
    • Expected Output: ['10.0.0.1/32', '192.168.0.1/32'] (order sorted, zero

No comments:

Post a Comment