Thursday, 25 June 2026

databricks interview questions and answer

Question : spark architecture databricks

Apache Spark on Databricks operates on a distributed master-slave architecture optimized for cloud environments. It separates the platform management (control plane) from the actual parallel data computation (compute plane). 

Platform Level: The Split-Plane Architecture
Databricks high-level architecture splits its operations into two distinct areas to balance security, governance, and scale: 
  • Control Plane: Databricks hosts this backend layer in its own cloud account. It manages the notebook UI, workspace configurations, job scheduling, security, and the Databricks Unity Catalog for central data governance. 
  • Compute Plane: This is where your actual data processing happens. It can run in your personal cloud account (Classic Compute) or safely in an isolated network inside Databricks' account (Serverless Compute). 

Infrastructure Level: Core Spark Cluster Nodes
The compute plane provisions clusters of Virtual Machines (VMs) structured around two main roles: 
  • Driver Node (Master): This node runs the SparkSession, which serves as the entry point for your application. It converts user code (Python, SQL, Scala, R) into an execution plan called a Directed Acyclic Graph (DAG). It then coordinates tasks and schedules them out to workers. 
  • Worker Nodes (Slaves): These machines receive the code tasks from the driver. They fetch data fragments, process calculations entirely in-memory, and stream results back to the driver. 
┌────────────────────────────────────────────────────────┐
│                      CONTROL PLANE                     │
│    (UI, Web App, Workflows, Unity Catalog, Security)   │
└───────────────────────────┬────────────────────────────┘
                            │ (Manages & Coordinates)
┌───────────────────────────▼────────────────────────────┐
│                      COMPUTE PLANE                     │
│                                                        │
│  ┌──────────────────────────────────────────────────┐  │
│  │               DRIVER NODE (Master)               │  │
│  │   [SparkSession] ──► [DAG/Plan] ──► [Scheduler]  │  │
│  └────────────────────────┬─────────────────────────┘  │
│                           │ (Sends Tasks)              │
│            ┌──────────────┴──────────────┐             │
│            ▼                             ▼             │
│  ┌───────────────────┐         ┌───────────────────┐   │
│  │ WORKER NODE       │         │ WORKER NODE       │   │
│  │ ┌───────────────┐ │         │ ┌───────────────┐ │   │
│  │ │ Executor JVM  │ │         │ │ Executor JVM  │ │   │
│  │ │  (Core Tasks) │ │         │ │  (Core Tasks) │ │   │
│  │ └───────────────┘ │         │ └───────────────┘ │   │
│  └───────────────────┘         └───────────────────┘   │
└────────────────────────────────────────────────────────┘

Internal Execution Level: JVM & Tasks
Within each worker node, the workload is further subdivided to parallelize operations: 
  • Executors: Independent Java Virtual Machine (JVM) processes launched on the worker nodes specifically to run your tasks. Each executor handles local memory storage and tracks data lineages. 
  • Slots/Cores: Every executor is configured with execution slots that mirror available CPU cores. A single slot handles exactly one data partition task at a time. 
  • Tasks: The smallest unit of execution. Tasks execute transformations (like filtering or calculations) on a single data block and run completely in parallel across your available cluster slots. 

Question : spark sql in databricks

Spark SQL is an Apache Spark module used for structured data processing within Databricks. It allows you to query large datasets using standard ANSI SQL syntax while leveraging Spark's distributed computing power. 
Execution Methods
You can run Spark SQL queries in Databricks notebooks using two primary methods: 
  • Using the %sql Magic Command:
    Turn any notebook cell into a native SQL editor by placing %sql at the very top.
    sql
    %sql
    SELECT user_id, COUNT(order_id) 
    FROM samples.tpch.orders 
    GROUP BY user_id;
    
    Using the spark.sql() API:
  • Embed SQL queries directly inside Python, Scala, or R cells. This returns a DataFrame for further programmatic manipulation.
    python
    # In a Python cell
    df = spark.sql("SELECT * FROM samples.tpch.customer WHERE market_segment = 'AUTOMOBILE'")
    display(df)
    

Interoperability with DataFrames
Spark SQL shares the same underlying optimization engine as DataFrames. This allows you to seamlessly switch between Python/Scala and pure SQL. 
  • Querying Files Directly: You can run SQL queries directly on files stored in cloud object storage without importing them first.
    sql
    %sql
    SELECT * FROM json.`/volumes/my_catalog/my_schema/my_volume/data.json`;
    
    Creating Temporary Views: Turn a PySpark DataFrame into a queryable SQL relation using temporary views.
  • python
    # Register DataFrame as a temporary view
    df.createOrReplaceTempView("my_temp_view")
    
    sql
    %sql
    -- Query it immediately using SQL
    SELECT * FROM my_temp_view WHERE status = 'active';
    

Advanced Features in Databricks
Databricks extends traditional Spark SQL capability with enterprise-grade features found in the Databricks SQL Language Reference
  • Delta Lake Optimization: Built-in support for MERGE INTO, UPDATE, and DELETE syntax.
  • SQL Scripting: Procedural control flow like loops, variables, and IF-ELSE statements directly inside your SQL code.
  • Unity Catalog Integration: Secure data governance mapping 3-tier namespaces (catalog.schema.table).
  • COPY INTO: An idempotent SQL command to bulk load files from object storage directly into Delta table

For more Details

Question : Azure Databricks transformation and action


In Azure Databricks, transformations and actions represent the two core types of operations used to manipulate data within Apache Spark DataFrames. The fundamental distinction between them relies on lazy evaluation, where transformations construct a logical execution plan, and actions trigger the physical computation of that plan to return or save results. 

Key Concepts
  • Lazy Evaluation: Azure Databricks does not execute operations immediately. It stores the processing logic as a sequence of steps.
  • Directed Acyclic Graph (DAG): Spark builds a mathematical graph (DAG) to map out the most efficient way to compute the final requested output.
  • Catalyst Optimizer: An internal engine that optimizes the DAG to reduce memory usage and minimize disk read/write operations. 

1. Transformations
Transformations are operations that take a DataFrame as input and return a new DataFrame as output. Because DataFrames are immutable, transformations do not alter the source data. They are lazily evaluated and merely define the instructions for processing. 
Types of Transformations
  • Narrow Transformations: Operations where each input partition contributes to exactly one output partition. No data movement across the network (shuffling) is required. 
  • Wide Transformations: Operations where data from multiple input partitions must be combined into a single output partition. This requires a costly cluster operation called a shuffle. 
Common Transformation Operations
  • filter() or where(): Selects specific rows based on a condition (Narrow).
  • select(): Isolates specific columns from the dataset (Narrow).
  • withColumn(): Creates a new column or modifies an existing one (Narrow).
  • join(): Merges two datasets based on a common key (Wide).
  • groupBy(): Aggregates records based on matching column criteria (Wide).
  • distinct(): Removes duplicate rows from the dataset (Wide). 

2. Actions
Actions are operations that instruct the Spark engine to compute a concrete result from the accumulated transformations. When an action is executed, Spark evaluates the optimization plan, reads the data, applies the pipeline transformations, and triggers a Spark Job. 
Common Action Operations
  • display(): Renders the data visually in a tabular format within Databricks notebooks.
  • count(): Evaluates the total number of rows in the DataFrame.
  • collect(): Returns all rows of the DataFrame back to the Driver node as an array (use with caution on huge datasets to avoid out-of-memory errors).
  • show(): Prints the first 20 rows of a DataFrame in a console-like layout.
  • take(n): Retrieves the first n rows of the DataFrame.
  • write(): Permanently exports the processed data into cloud storage formats like Delta Lake tables, CSV, Parquet, or JSON. 

Summary Comparison
Metric TransformationsActions
OutputA new DataFrameA concrete value, file, or console print
ExecutionLazy (Deferred evaluation)Eager (Immediate execution)
Cluster BehaviorUpdates the internal execution plan (DAG)Triggers a Spark Job across cluster nodes
Primary UseData cleaning, filtering, and structuringOutputting, inspecting, or storing results

Python Code Example
The following code illustrates how transformations construct a query plan without executing it, until an action is explicitly invoked: 
python
# 1. READ OPERATION (Transformation)
df = spark.read.table("hive_metastore.default.raw_sales")

# 2. TRANSFORMATIONS (No data processing happens yet; Spark just builds the plan)
filtered_df = df.filter(df["amount"] > 100)                    # Narrow transformation
transformed_df = filtered_df.groupBy("country").sum("amount")   # Wide transformation

# 3. ACTION (This triggers the actual computation across the cluster)
final_result = transformed_df.collect()

# 4. ACTION (Saves the output to a permanent Delta Lake table)
transformed_df.write.format("delta").mode("overwrite").saveAsTable("gold_sales_summary")

Question : azure databricks window functions
Azure Databricks window functions operate on a group of rows (called a window) and calculate a return value for each individual row based on that group. Unlike standard GROUP BY operations which collapse your dataset, window functions retain the full granularity of your original rows while appending calculated columns like running totals or rankings. 

Key Components of a Window Function
Every window function relies on an OVER clause which defines the Window Specification using three core blocks: 
  • PARTITION BY: Groups rows into subsets (e.g., separating data by department or category). If omitted, the window spans the entire dataset.
  • ORDER BY: Sorts rows within each partition. This is required for ranking and boundary-sensitive analytic functions.
  • ROWS or RANGE (Window Frame): Defines a sliding boundary of rows relative to the current row (e.g., calculating a 7-day moving average). 

Core Types of Window Functions
1. Ranking Functions
These functions assign numerical orders or ranks to rows within your partition. 
  • row_number(): Assigns a unique sequentially ascending integer starting at 1.
  • rank(): Assigns ranks but leaves gaps if there are duplicate values (e.g., 1, 2, 2, 4).
  • dense_rank(): Assigns ranks without leaving gaps for duplicate values (e.g., 1, 2, 2, 3). 
2. Analytic & Value Functions
These fetch data points from other rows relative to your current row position. 
  • lag(expr, offset): Grabs a value from a previous row.
  • lead(expr, offset): Grabs a value from a subsequent row.
  • first_value(expr) / last_value(expr): Returns the first or last value in the window frame. 
3. Aggregate Functions
Standard mathematical operations executed dynamically over the sliding window. 
  • sum(), avg(), min(), max(), count() [1]

Implementation Comparison
You can implement these functions using either Databricks SQL or the PySpark DataFrame API. 
Databricks SQL Approach
Below is a pure SQL example demonstrating a running total and employee ranking.
sql
SELECT 
    employee_id,
    department,
    salary,
    -- Running total of salary inside each department
    SUM(salary) OVER(
        PARTITION BY department 
        ORDER BY salary 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_salary_total,
    -- Rank employees by salary within their department
    DENSE_RANK() OVER(
        PARTITION BY department 
        ORDER BY salary DESC
    ) AS salary_rank
FROM employees;
PySpark DataFrame API Approach
To build the equivalent pipeline using PySpark, leverage the pyspark.sql.Window class. 
python
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# 1. Define the window specifications
running_total_spec = (Window
                      .partitionBy("department")
                      .orderBy("salary")
                      .rowsBetween(Window.unboundedPreceding, Window.currentRow))

rank_spec = (Window
             .partitionBy("department")
             .orderBy(F.col("salary").desc()))

# 2. Apply specifications to the DataFrame
df_result = df.withColumn("running_salary_total", F.sum("salary").over(running_total_spec)) \
              .withColumn("salary_rank", F.dense_rank().over(rank_spec))

display(df_result)
Databricks Pro-Tip: The QUALIFY Clause
When working in Databricks SQL, filtering results based on a window function (like grabbing only salary_rank = 1) traditionally required subqueries or CTEs. Databricks simplifies this with the QUALIFY clause, allowing you to filter window metrics natively: 
sql
SELECT employee_id, department, salary
FROM employees
QUALIFY DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) = 1;

Question: azure databricks dataframe and dataset

In Azure Databricks, DataFrames and Datasets are the primary distributed data abstractions provided by the Apache Spark API for structured and semi-structured big data processing. While they share the same underlying optimization engines, a DataFrame is an untyped, tabular collection of data available across Python, Scala, R, and SQL, whereas a Dataset is a strongly-typed, object-oriented collection exclusive to Scala and Java. 
Quick Comparison
The following table contrasts the key structural differences between the two abstractions:
Feature DataFrameDataset
Type SafetyUntyped (Run-time errors)Strongly-typed (Compile-time safety)
Language SupportPython (PySpark), Scala, Java, R, SQLScala and Java only
Data RepresentationDistributed collection of generic Row objectsDistributed collection of JVM objects (e.g., custom Case Classes)
OptimizationFully optimized via Catalyst & Tungsten enginesFully optimized, but lamda functions can bypass some optimizations

Key Concepts
1. What is a DataFrame?
  • Tabular Structure: Organizes data into a two-dimensional table of named rows and columns, similar to a relational database table. 
  • Lazy Evaluation: Transformations (e.g., filter(), select()) build an execution plan but do not process data until an action (e.g., show(), count(), display()) is called. 
  • Unified Performance: Leverages Spark's Catalyst Optimizer to achieve identical high-speed execution regardless of whether you write code in Python, Scala, or SQL. 
  • Schema Inference: Automatically detects column names and data types when reading structured files like Parquet, Delta Lake, or CSV with headers. 
2. What is a Dataset?
  • An Extension of DataFrames: Technically, a DataFrame is just an alias for an untyped Dataset of Rows (Dataset[Row]).
  • Compile-Time Safety: Catches syntax and type mismatches during compilation rather than failing mid-way through a long production cluster job.
  • Object-Oriented: Maps row items directly to domain-specific Java or Scala objects. 

When to Use Which?
Use DataFrames when:
  • You primarily use Python (PySpark) or SQL inside your Azure Databricks notebooks.
  • You require rapid data manipulation, aggregations, visualization, and out-of-the-box performance optimizations.
  • You want to seamlessly convert data to or from the Pandas API on Spark for data science tasks. [
Use Datasets when:
  • You are building complex production pipelines using Scala or Java.
  • Your pipeline requires strict type-safety guarantees to prevent unexpected schema errors downstream.
  • You want to map rows to a strongly-typed domain model using Scala Case Classes. 
Code Examples
Creating and Using a DataFrame (Python / PySpark)
python
# Read an example CSV file into a DataFrame
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/databricks-datasets/samples/population-vs-price/data_geo.csv")

# Filter and select columns (Transformation)
filtered_df = df.filter(df["2015 Median Price"] > 500000).select("State", "2015 Median Price")

# Display the results using Databricks built-in tool (Action)
display(filtered_df)
Creating and Using a Dataset (Scala)
scala
// Define a strict schema using a Case Class
case class HomePrice(State: String, Population: Long, MedianPrice: Double)

// Read data and map it directly to the Case Class object to create a Dataset
val dataset = spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/databricks-datasets/samples/population-vs-price/data_geo.csv")
  .as[HomePrice] // Enforces strong typing

// Compile-time safe filtering
val highPriceDS = dataset.filter(row => row.MedianPrice > 500000)

// Display results
display(highPriceDS)




Question : azure databricks join


Azure Databricks supports standard SQL and PySpark syntax to join multiple datasets using a common key. You can execute these joins via Databricks SQL or Spark DataFrames. [1, 2, 3, 4, 5]
PySpark Syntax
In Python notebooks, use the df.join() method.
python
# Syntax
df_result = df_left.join(df_right, df_left.key == df_right.key, how="inner")
Common PySpark Join Types:
  • inner: Default type. Returns rows with matching keys in both DataFrames.
  • left / leftouter: Keeps all rows from the left DataFrame and appends matching right side rows. Missing matches become null.
  • right / rightouter: Keeps all rows from the right DataFrame and appends matching left side rows.
  • full / fullouter: Returns all records from both sides, filling un-matched values with null.
  • semi / leftsemi: Returns columns only from the left DataFrame where a matching key exists on the right.
  • anti / leftanti: Returns columns only from the left DataFrame where no matching key exists on the right. [1, 2, 3]

SQL Syntax
When writing standard SQL or running cells with %sql, use the standard ANSI-SQL JOIN syntax: [1, 2]
sql
SELECT * 
FROM table_a AS a
INNER JOIN table_b AS b
  ON a.id = b.id;
Optimization Strategies
Because large-scale joins can trigger expensive network shuffling, Azure Databricks provides tools to optimize performance:
  • Broadcast Joins: Databricks automatically broadcasts small tables to all worker nodes to avoid full shuffles.
  • You can explicitly force this using a Join Hint: SELECT /*+ BROADCAST(b) */ * FROM table_a a JOIN table_b b ON a.id = b.id. [1, 2, 3]
  • Photon Engine: Keep the Photon Engine active on your cluster configuration,
as it natively accelerates join evaluation and chooses optimal physical execution plans.
  • Analyze Tables: Keep table statistics fresh so the Databricks query optimizer can order
joins accurately. Run ANALYZE TABLE table_name COMPUTE STATISTICS before processing.




Databricks interviews focus on your ability to use the Data Lakehouse architecture, optimize distributed computing (Apache Spark), and manage data governance (Unity Catalog). Preparation should span across architecture, Delta Lake features, and practical performance tuning. 
1. Core Architecture & Concepts
Q: What is the Databricks Lakehouse architecture?
A: It is a unified platform that combines the reliability, governance, and ACID transactions of a data warehouse with the scalability, flexibility, and low-cost storage of a data lake. It unifies batch and streaming data into a single system. 
Q: What is Unity Catalog?
A: It is a fine-grained centralized governance solution that manages data access and permissions across all Databricks workspaces. It provides centralized auditing, data lineage, and row/column-level security. 
Q: Difference between an All-Purpose Cluster and a Job Cluster?
A:
  • All-Purpose: Used for interactive exploration, collaborative notebook sharing, and ad-hoc analysis. Takes longer to start and incurs higher costs. 
  • Job Cluster: Dedicated, automated clusters explicitly created to run a specific automated workflow or task. They terminate immediately upon job completion, saving computing costs. 

2. Delta Lake Features
Q: What are the main benefits of Delta Lake over standard Parquet?
A: Delta Lake provides ACID transactions (preventing corrupted reads/writes), enforces data schemas (schema enforcement), and allows you to query historical data states via "Time Travel". 
Q: Explain the difference between VACUUM and OPTIMIZE.
A: [1]
  • OPTIMIZE: Compacts small files in a Delta table into larger, optimized files, which speeds up read queries.
  • VACUUM: Permanently deletes physical data files that are no longer referenced by the Delta transaction log (e.g., historical data past the retention threshold). It is used to reclaim cloud storage and reduce costs. 
Q: What is Schema Evolution in Delta Lake?
A: It allows a pipeline to automatically adapt to changes in data structure without breaking the target table. You typically enable it by using the .option("mergeSchema", "true") setting. 

3. Spark Optimization & Debugging
Q: How do you handle Data Skew in Databricks?
A: Data skew occurs when a few specific keys hold disproportionately more data, slowing down the executors processing them. You can handle this by: [1, 2]
  • Salting the skewed keys (adding random prefixes to distribute data across executors).
  • Increasing the spark.sql.shuffle.partitions configuration.
  • Utilizing Adaptive Query Execution (AQE), which dynamically coalesces skewed partition groups. [1, 2, 3, 4, 5]

Q: When would you use a Broadcast Join?
A: When you are joining a very large table with a significantly smaller table. Spark sends (broadcasts) the small table to the memory of every executor, preventing expensive network shuffle operations. Risk: Running out of executor memory if the small table is too large. [1, 2]


 Databricks data engineering interviews comprehensively test architectural concepts, real-time optimization strategies, and robust programmatic problem-solving. [1, 2]

The following selection delivers 4 major Databricks interview questions spanning Architecture, Optimization, Data Ingestion, and a Coding Scenario. Each question includes detailed explanations, technical context, comprehensive examples, and specific test cases with sample PySpark/SQL implementations. [1, 2, 3, 4, 5]

1. Handling Heavy Data Skew with Salting
Question
What is Data Skew in Apache Spark, how does it negatively affect Databricks clusters, and how do you resolve it using a technical approach like Salting? [1, 2]
Technical Context
Data skew occurs when an asymmetric distribution causes a single or small subset of keys to hold a massive portion of the total records. During wide transformations (like a JOIN or groupBy), Spark hashes data keys to assign them to specific partitions. A skewed key forces a solitary executor node to process vastly more data than its peers. This creates a "long tail" problem where the cluster remains underutilized while one node runs out of memory (OOM) or bottlenecks the entire pipeline. [1, 2]
Salting breaks up this bottleneck by appending a randomized integer (a "salt") to the join keys of the skewed dataset, while exploding the lookup table to match those generated salt variations. [1, 2, 3]
Skewed Key ("A") ----> Appended Salt ----> Distributed Keys: "A_0", "A_1", "A_2"
Detailed Example & Implementation
Imagine joining a multi-terabyte Transactions Table (highly skewed on a popular product_id, such as 101) with a small Products Dimension Table. [1]
python
from pyspark.sql import functions as F

# 1. Define salt range
num_salts = 3

# 2. Add salt to the skewed large table
transactions_salted = transactions_df.withColumn(
    "salt", 
    F.concat(F.col("product_id"), F.lit("_"), F.randint(0, num_salts - 1))
)

# 3. Explode the small dimension table to match every possible salt value
salts_df = spark.range(0, num_salts).withColumnRenamed("id", "salt_val")

products_exploded = products_df.join(salts_df) \
    .withColumn("salt", F.concat(F.col("product_id"), F.lit("_"), F.col("salt_val"))) \
    .drop("salt_val")

# 4. Perform the optimized join on the salted key
optimized_joined_df = transactions_salted.join(
    products_exploded,
    on="salt",
    how="inner"
).drop("salt")
Use code with caution.
Test Cases
  • Test Case 1: Skewed Key Distribution
    • Input Data: transactions_df contains 1,000,000 records where product_id = 101 and 10 records where product_id = 202.
    • Expected Behavior: Without salting, Spark creates a single massive partition for 101. With a salt range of 3, the records for 101 split roughly evenly into 3 distinct processing buckets (101_0, 101_1, 101_2), balancing the cluster load.
  • Test Case 2: Join Cardinality and Correctness Verification
    • Input Data: Single transaction record [product_id: 101, amount: 50]. Dimension record [product_id: 101, name: 'Widget'].
    • Expected Output: The output must yield exactly 1 joined record: [product_id: 101, amount: 50, name: 'Widget']. The test guarantees that exploding the dimension table does not introduce artificial row duplication.

2. Implementing SCD Type 2 tracking in Delta Lake
Question
How do you efficiently implement a Slowly Changing Dimension Type 2 (SCD Type 2) pipeline within a Databricks Delta Lake table using native commands? [1, 2]
Technical Context
SCD Type 2 tracks historical data by creating multiple records for a given natural key, differentiating them with state tracking fields like start_date, end_date, and a boolean is_current flag. Delta Lake handles this pattern natively through the MERGE INTO statement. To perform a historical update, existing active records must be marked as expired (is_current = false), and completely new records must be appended as the new ground truth. [1, 2, 3, 4, 5]
Detailed Example & Implementation
We need to track changes to an internal Employees Delta Table. If an employee changes their department, the old record must terminate, and a new record must begin.
sql
-- Target table initialization configuration
CREATE TABLE delta.employees (
    emp_id INT,
    name STRING,
    department STRING,
    start_date DATE,
    end_date DATE,
    is_current BOOLEAN
) USING DELTA;

-- Merge logic processing staging data
MERGE INTO delta.employees AS target
USING (
    -- Combine new updates with an explicit null-key row to force insertions of changed values
    SELECT src.emp_id AS merge_key, src.* FROM staging_updates src
    UNION ALL
    SELECT NULL AS merge_key, src.* FROM staging_updates src
    JOIN delta.employees tgt ON src.emp_id = tgt.emp_id
    WHERE tgt.is_current = true AND src.department <> tgt.department
) AS source
ON target.emp_id = source.merge_key AND target.is_current = true
WHEN MATCHED AND target.department <> source.department THEN
    UPDATE SET target.end_date = source.start_date, target.is_current = false
WHEN NOT MATCHED THEN
    INSERT (emp_id, name, department, start_date, end_date, is_current)
    VALUES (source.emp_id, source.name, source.department, source.start_date, NULL, true);
Use code with caution.
Test Cases
  • Test Case 1: Processing a Brand New Record
    • Input Data: Staging record [emp_id: 5, name: 'Alice', department: 'HR', start_date: '2026-06-01'] targeting an empty target table.
    • Expected Output: One row added: [emp_id: 5, ..., department: 'HR', start_date: '2026-06-01', end_date: NULL, is_current: true].
  • Test Case 2: Historical Value Mutation (The Update)
    • Input Data: Target table contains active record for Alice in 'HR'. Staging receives: [emp_id: 5, name: 'Alice', department: 'Engineering', start_date: '2026-06-25'].
    • Expected Output: The target table must contain exactly two rows for emp_id: 5:
      1. The legacy row updated to: [department: 'HR', end_date: '2026-06-25', is_current: false].
      2. The newly created row added as: [department: 'Engineering', start_date: '2026-06-25', end_date: NULL, is_current: true].

3. Scaling File Ingestion with Auto Loader
Question
What is Databricks Auto Loader, how does it work under the hood, and what advantages does it offer over standard Spark Structured Streaming file readers? [1, 2, 3]
Technical Context
Traditional Spark directory streaming routinely scans cloud object storage containers to find newly added objects. As file quantities scale into the millions, directory listing operations become painfully slow and prohibitively expensive. [1, 2]
Databricks Auto Loader (cloudFiles) solves this using two operational mechanisms: [1, 2]
  1. Directory Listing Mode: It efficiently enumerates files and retains state directly inside RocksDB checkpoints, completely removing the need to re-scan previously ingested assets. [1]
  2. File Notification Mode: It automatically configures and listens to cloud notification systems (like AWS SNS/SQS or Azure Event Grid/Queue), directly queuing incoming files without running manual system storage scans. [1, 2, 3, 4]
Detailed Example & Implementation
Streaming JSON file extraction from a landing storage container with explicit automatic schema inference and evolution tracking. [1, 2]
python
# Streaming data ingestion via Auto Loader
checkpoint_path = "dbfs:/mnt/checkpoints/json_ingestion"
schema_location = "dbfs:/mnt/metadata/schemas/json_ingestion"

streaming_df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", schema_location)
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns") # Automatic schema capture
    .load("dbfs:/mnt/landing_zone/raw_events/"))

query = (streaming_df.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", checkpoint_path)
    .start("dbfs:/mnt/silver_zone/events_table"))
Use code with caution.
Test Cases
  • Test Case 1: Incremental Execution and Processing Latency
    • Input Data: 50,000 files exist in /raw_events/. The pipeline processes them completely. 10 new files are dropped into the bucket.
    • Expected Behavior: Auto Loader identifies and streams only the 10 incoming files instantly using the notification queue, skipping expensive directory scanning. [1, 2]
  • Test Case 2: Schema Evolution Verification
    • Input Data: System reads initial records with fields [id, timestamp]. A new batch introduces a modified structure containing a third column: [id, timestamp, geo_location].
    • Expected Output: The streaming target Delta table alters dynamically to include geo_location without failing the workflow, populating previous records with a NULL fallback. [1, 2]

4. Algorithmic Coding Challenge — IP Range to CIDR Merging
Question
Write a Python/PySpark solution that takes a series of network log IP addresses and collapses contiguous ranges into minimal CIDR blocks. [1]
Technical Context
This programmatic scenario evaluates your algorithmic fluency, object optimization strategies, and network architecture logic. Databricks profiles rely heavily on transforming IP data string representations into numerical 32-bit unsigned integer values to cleanly execute mathematical ranges, interval scans, and sliding optimization windows. [1, 2]
Detailed Example & Implementation
python
import ipaddress

def merge_ips_to_cidr(ip_list):
    """
    Parses a string list of individual IPv4 strings, collapses contiguity,
    and returns the optimized, minimum possible list of representative CIDR blocks.
    """
    if not ip_list:
        return []
        
    # Convert text strings into structured IP objects, sorting ascending
    ips = sorted([ipaddress.IPv4Address(ip) for ip in ip_list])
    
    merged_cidrs = []
    current_range = []
    
    for ip in ips:
        if not current_range:
            current_range.append(ip)
        # Check if the current IP address is strictly contiguous with the previous element
        elif int(ip) == int(current_range[-1]) + 1:
            current_range.append(ip)
        else:
            # Collapse completed range into native CIDR representations
            start_ip = current_range[0]
            end_ip = current_range[-1]
            merged_cidrs.extend(ipaddress.summarize_address_range(start_ip, end_ip))
            current_range = [ip]
            
    if current_range:
        merged_cidrs.extend(ipaddress.summarize_address_range(current_range[0], current_range[-1]))
        
    return [str(cidr) for cidr in merged_cidrs]

# Example execution verification
input_ips = ["192.168.1.0", "192.168.1.1", "192.168.1.2", "192.168.1.3", "192.168.1.5"]
print(merge_ips_to_cidr(input_ips))
# Output: ['192.168.1.0/30', '192.168.1.5/32']
Use code with caution.
Test Cases
  • Test Case 1: Perfect Contiguous Block Optimization
    • Input Data: ["10.0.0.0", "10.0.0.1", "10.0.0.2", "10.0.0.3"]
    • Expected Output: ['10.0.0.0/30'] (collapses four 32-bit addresses into a single subnet).
  • Test Case 2: Complete Discontinuity Handling
    • Input Data: ["192.168.0.1", "10.0.0.1"]
    • Expected Output: ['10.0.0.1/32', '192.168.0.1/32'] (order sorted, zero