Thursday, 25 June 2026

databricks interview questions and answer

Question :  Developed Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns


Architecture & Project Context
Q: Can you describe the end-to-end architecture of this Spark application in Databricks?
  • Answer: We ingested data from multiple file formats stored in cloud storage into Databricks.
  • We used Delta Lake's medallion architecture (Bronze, Silver, Gold).
  • Raw data was loaded into Bronze tables.
  • Cleaned and transformed data moved to Silver tables.
  • Aggregated data for customer usage insights went to Gold tables.
  • Final insights were queried by BI tools like Tableau. [1, 2, 3, 4, 5]
🛠️ Data Extraction & Formats
Q: What file formats did you work with, and how did Spark handle them?
  • Answer: We processed Parquet, JSON, and CSV files.
  • Parquet was used for optimal read performance due to columnar storage.
  • JSON handled semi-structured, nested customer interaction data.
  • CSVs were mostly used for legacy master data lookups.
  • We used spark.read.format() to load them. [1, 2, 3, 4, 5]
Q: How did you handle schema evolution when customer usage data formats changed?
  • Answer: We leveraged Delta Lake’s native schema evolution capabilities.
  • We enabled .option("mergeSchema", "true") during writes.
  • This allowed new columns to be added automatically.
  • It prevented job failures when source schemas changed. [1, 2, 3, 4, 5]
🔄 Transformation & Aggregation
Q: What kind of Spark-SQL functions did you use for customer usage analysis?
  • Answer: We heavily used window functions like ROW_NUMBER() and LAG().
  • These helped analyze session durations and usage trends.
  • We used built-in functions like date_trunc() for time-series aggregation.
  • Complex nested JSON fields were parsed using explode() and outer_join. [1, 2, 3, 4, 5]
Q: How did you handle data aggregation from multiple mismatched sources?
  • Answer: We standardized data types during the extraction phase.
  • We broadcast-joined smaller lookup tables to large usage logs.
  • We used COALESCE() to handle missing customer identifiers across sources. [1]
⚡ Performance Tuning & Optimization
Q: How did you optimize your Spark-SQL queries in Databricks for large datasets?
  • Answer: We utilized Delta Lake's OPTIMIZE and Z-ORDER BY commands.
  • We Z-ordered tables by customer_id and timestamp.
  • This drastically improved query speed via data skipping.
  • We also eliminated data skew using salting techniques. [1, 2, 3, 4, 5]
Q: Did you face any out-of-memory (OOM) errors, and how did you resolve them?
  • Answer: Yes, we encountered OOM errors due to large shuffles.
  • We resolved this by increasing spark.sql.shuffle.partitions.
  • We also enabled Databricks Adaptive Query Execution (AQE).
  • AQE automatically coalesced shuffle partitions at runtime. [1, 2, 3, 4]



Question : Participate in Design, Development & Deployment activities. Build complex queries and mappings which would need constant interaction with the business users to understand their requirements. And it also includes providing appropriate suggestions to the business team on improving the performance and accuracy of their existing systems.

Here is how you can answer this interview question to highlight your technical expertise, business acumen, and consultative approach.
The Strategy
Structure your answer using the STAR method (Situation, Task, Action, Result). Frame yourself not just as a developer who takes orders, but as a strategic partner who translates business needs into scalable solutions and proactively optimizes performance. [1, 2, 3]

Suggested Answer
"In my previous roles, I’ve always viewed the gap between technical teams and business stakeholders as a bridge I actively help build.
Situation & Task:
For instance, in my last project, the business team was struggling with slow reporting times and inaccurate data reconciliation. My task was to design and develop a new data mapping and query pipeline, while simultaneously redesigning their existing legacy system to be faster and more reliable. [1, 2, 3, 4, 5]
Action:
First, I conducted iterative requirement-gathering sessions with the business stakeholders to understand their exact pain points and end goals. Instead of just building what they asked for, I took a consultative approach. I mapped out their complex data requirements and, during the process, identified bottlenecks in their current logic. I proactively suggested and implemented optimized SQL queries, indexing, and streamlined data mappings that reduced data processing time significantly. [1, 2, 3, 4, 5]
Result:
By maintaining constant communication, I ensured the business team understood how the new system would operate. The result was a \(40\%\) reduction in report generation time, a massive improvement in data accuracy, and the business team was empowered with a system that directly answered their strategic needs." [1, 2]

Key Highlights to Emphasize (Pro-Tips)
To stand out, weave these phrases and concepts into your interview:
  • Active Listening: Mention that you focus on understanding the why behind a business request rather than just the what. [1]
  • Translational Skills: Highlight your ability to translate complex technical jargon (like ETL mappings, query execution plans, or database schemas) into plain, actionable business language. [1, 2]
  • Proactive Optimization: Use keywords like indexing, partitioning, caching, or normalization when discussing performance suggestions. Show that you care about system health, not just making it work. [1]
  • Iterative Feedback: Stress that you don't just deliver at the end. You provide mock-ups, prototypes, or regular demos to keep business users involved and ensure no time is wasted on building the wrong thing.


Part 1: The Behavioral Interview Question & Answer
Interviewer Question:
"I see on your resume that you optimized large-scale data processing pipelines in Azure Databricks, reducing processing time by 40%, while using Azure Data Factory for orchestration. Can you walk me through this project, the challenges you faced, and exactly how you achieved that 40% reduction?" [1, 2]
Your Answer (STAR Method):
Situation
"In my previous role, our enterprise data platform was struggling to keep up with daily data volume growth, which had scaled to multiple terabytes per day. Our data processing pipelines were regularly missing their SLAs, which delayed daily business intelligence reporting. We were pulling data from several disparate sources—including on-premises SQL databases, SaaS applications, and raw third-party cloud storage." [1, 2, 3, 4]
Task
"My responsibility was twofold: first, design a seamless, automated ingestion architecture to integrate these multi-source datasets; second, overhaul and optimize our core transformation pipelines within Azure Databricks to drastically lower execution times and stabilize costs." [1, 2, 3, 4]
Action
"I approached this by splitting the solution into orchestration and computation phases:
  • Orchestration & Integration: I used Azure Data Factory (ADF) to build a metadata-driven ingestion framework. I implemented incremental data loading using watermark tables to avoid full table scans. To manage the multiple sources efficiently, I configured dynamic linked services and utilized parameterized ForEach activities running in parallel with optimized Data Integration Units (DIUs). [1, 2, 3, 4, 5, 6, 7, 8]
  • Databricks Optimization: Once ADF staged the raw data into an Azure Data Lake (ADLS Gen2), I triggered Databricks notebooks via ADF. To achieve the 40% performance gain, I targeted three core Spark bottlenecks:
    1. Data Skew: I discovered heavy data skew on our primary join keys, causing specific cluster nodes to bottleneck. I fixed this by implementing salting techniques (adding random keys) to distribute the data evenly across partitions.
    2. Shuffle Reduction: I replaced expensive wide transformations with broadcast joins for our smaller lookup tables, eliminating massive network overhead.
    3. Storage Layer: I converted legacy formats into Delta Lake tables, leveraging Z-Order clustering on frequently filtered columns and setting up automated VACCUM and OPTIMIZE commands to eliminate the 'small file problem'." [1, 2, 3, 4, 5, 6]
Result
"As a result of these changes, overall end-to-end data processing time dropped by 40%, safely restoring our daily reporting SLA. Furthermore, because the Databricks clusters were active for a much shorter duration and scaled down efficiently using auto-scaling, we reduced our monthly Azure compute costs significantly." [1, 2, 3, 4, 5]

Part 2: Hard Technical Follow-Up Questions to Prepare For
If you give the answer above, a sharp interviewer will test your technical depth with these follow-up questions:
Q1: How exactly did ADF hand off control to Databricks, and how did you pass parameters between them?
  • Answer: I used the native Azure Databricks Notebook Activity inside ADF. To keep the pipeline dynamic, I passed parameters (like window_start_date or source_path) from ADF by defining base parameters in the ADF activity UI. In the Databricks PySpark notebook, I retrieved those values using dbutils.widgets.get("parameter_name"). [1, 2, 3, 4, 5]
Q2: What type of Integration Runtime (IR) did you use in ADF for your multi-source setup?
  • Answer: I used a combination. For cloud-to-cloud data movement (SaaS and cloud storage), I relied on the default Azure Integration Runtime. For pulling data securely from on-premises SQL databases without exposing them to the public internet, I deployed a Self-Hosted Integration Runtime on a virtual machine within the client’s private network. [1, 2, 3, 4, 5]
Q3: What is the difference between Spark's repartition and coalesce, and which one did you use during optimization?
  • Answer:
    • repartition() increases or decreases partitions but always forces a full data shuffle across the network.
    • coalesce() can only decrease the number of partitions and minimizes shuffling because it shifts data locally within existing nodes.
    • Application: I used coalesce() right before writing final outputs to the Data Lake to merge small partitions into a few large files efficiently without triggering a massive shuffle. [1, 2, 3]
Q4: How did you handle pipeline failures or data issues mid-execution?
  • Answer: Inside ADF, I implemented a robust try-catch pattern using activity dependencies. If the Databricks notebook failed, an 'Upon Failure' path triggered an ADF Web Activity that sent an alert webhook to our Microsoft Teams/Slack channel with the exact run ID. On the data layer, because we used Delta Lake, any failed notebook run rolled back automatically due to Delta’s ACID transaction capabilities, avoiding partial or corrupted data load

Q1: How do you identify a slow-performing SQL query or data mapping? [1, 2]
  • Analyze execution plans. Look for full table scans, high-cost joins, or sorting operations.
  • Monitor database statistics. Check CPU usage, I/O bottlenecks, and long-running query logs.
  • Review mapping bottlenecks. Identify whether the delay happens at the source, transformation, or target layer.
  • Use profiling tools. Utilize database-specific profilers (like Oracle TKPROF or SQL Server Profiler) to track execution time. [1, 2, 3, 4, 5]
Q2: What strategies do you use to optimize a heavy SQL query?
  • Index columns smartly. Add composite or clustered indexes on frequently filtered (WHERE) or joined (ON) columns.
  • Avoid SELECT *. Fetch only the specific columns needed to reduce I/O and memory overhead.
  • Replace subqueries. Convert correlated subqueries into JOIN statements or Common Table Expressions (CTEs) where appropriate.
  • Use partition pruning. Filter queries using partitioning keys to scan only relevant data slices.
  • Replace LIKE '%text%'. Avoid leading wildcards because they prevent the database from using indexes. [1, 2, 3, 4, 5]
Q3: How do you tune data integration (ETL/ELT) mappings for maximum efficiency?
  • Filter data early. Apply source qualifiers or filters immediately to minimize the volume of data moving through the pipeline.
  • Reduce cache size. Optimize memory allocation for Joiner, Aggregator, or Lookup transformations to avoid disk swapping.
  • Use bulk loading. Switch from normal insert mode to bulk loading to bypass database transaction logging during high-volume inserts.
  • Leverage pushdown optimization. Push transformation logic back to the source or target database to utilize native database engines. [1, 2, 3, 4, 5]

Data Validation & Unit Testing
Q4: What is your step-by-step approach to unit testing a new data mapping?
  • Row count validation. Verify that the total number of records loaded matches the expected source count after applying filters.
  • Column mapping verification. Ensure every source column correctly populates its designated target column.
  • Data type checking. Confirm data types did not truncate or change unexpectedly during transformation.
  • Boundary value testing. Test mappings with extreme data values, nulls, and empty strings to ensure stability. [1, 2, 3, 4, 5]
Q5: How do you validate data integrity between the source and target systems?
  • Minus/Except queries. Run a MINUS (or EXCEPT) SQL query between source and target datasets to find mismatched records.
  • Aggregate checksums. Compare sums, averages, and counts of critical numeric columns across both systems.
  • Sample data audits. Spot-check random data rows using unique identifiers to verify exact data replication. [1, 2, 3, 4, 5]

Scenario-Based Questions
Q6: Describe a time a query or mapping failed performance expectations. How did you fix it?
  • Situation: A daily data load mapping started taking four hours to complete instead of twenty minutes.
  • Task: Identify the bottleneck and restore the original performance baseline.
  • Action: I analyzed the execution plan and found a nested loop join causing a full table scan on a millions-row table. I created a composite index on the join keys and increased the integration tool's lookup cache size.
  • Result: The execution time dropped down to fifteen minutes, beating the original baseline. [1, 2, 3, 4, 5]
Q7: How do you handle and validate duplicate records during the mapping process?
  • Identify duplicates. Use GROUP BY and HAVING COUNT(*) > 1 queries to flag duplicate keys in the staging area.
  • Use deduplication logic. Implement standard transformations (like Row_Number or Unique functions) to keep only the latest record based on a timestamp.
  • Reject invalid data. Route duplicate records to an error logging table for investigation without stopping the main data pipeline. [1, 2, 3, 4, 5]

Q1: Walk me through the architecture of your DLT pipeline from ingestion to final delivery.
  • Answer: I implemented a standard Medallion Architecture using declarative Python files within Azure Databricks.
    • Bronze Layer: Acts as the raw landing zone. I used Auto Loader (cloudFiles) to ingest streaming JSON/CSV files incrementally from Azure Data Lake Storage (ADLS) Gen2.
    • Silver Layer: Cleanses, refines, and flattens the data. I applied DLT Expectations to quarantine corrupt records and performed necessary stream-to-static joins.
    • Gold Layer: Aggregates the business logic. I built business-level materialized views optimized with Z-ORDER to power downstream reporting. [1, 2, 3, 4, 5]
Q2: Why did you choose Delta Live Tables (DLT) over standard Structured Streaming notebooks?
  • Answer: DLT significantly reduces the operational overhead of managing state, schema evolution, and dependencies:
    • Declarative DAGs: DLT automatically builds and manages the data lineage and task dependencies between Bronze, Silver, and Gold tables.
    • Automated Infrastructure: It handles cluster autoscaling (Enhanced Autoscaling) and optimizes the underlying Delta tables without manual VACUUM or OPTIMIZE tuning.
    • Built-in Quality Control: It natively supports "Expectations" to prevent bad data from corrupting downstream production environments. [1, 2, 3, 4, 5]

2. Deep Technical & Implementation Questions
Q3: How did you configure Auto Loader within DLT to ingest data from Azure Storage?
  • Answer: I defined a streaming table using Python and passed the cloudFiles format to detect files incrementally. [1]
python
import dlt.pipelines as dp # Recommended 2026 syntax

@dp.table(name="bronze_events")
def bronze_events():
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .option("cloudFiles.schemaLocation", "abfss://container@storage.dfs.core.windows.net/schemas/bronze")
        .load("abfss://container@storage.dfs.core.windows.net/raw_incoming/")
    )
Use code with caution.
Q4: How does DLT handle Schema Evolution when a raw file structure changes in Azure Storage?
  • Answer: Auto Loader handles schema evolution gracefully based on the configuration:
    • By default, it uses Schema Inference to detect new columns on the fly.
    • I configured cloudFiles.schemaEvolutionMode to addNewColumns so the pipeline automatically catches new fields without crashing, appending them to the target Delta table. [1, 2, 3, 4, 5]
Q5: What is the difference between a Streaming Table and a Materialized View in DLT?
  • Answer: They handle data processing and historical tracking completely differently:
    • Streaming Table: Processes each incoming record exactly once. It only appends new data and preserves the full historical stream (ideal for Bronze and Silver layers).
    • Materialized View: Computes updates based on changes to the underlying source. If data changes or queries require complex aggregations, DLT recalculates the view to reflect the correct state (ideal for Gold layers). [1, 2, 3, 4, 5]

3. Scenario & Error Handling Questions (The "Real World" Tests)
Q6: How did you implement Data Quality checks, and what happens when an asset fails a rule?
  • Answer: I used DLT Expectations to enforce strict data constraints. I typically defined three levels of rules:
    • Track (expect): Logs data quality metrics to the event log but allows records to pass through.
    • Drop (expect_or_drop): Silently discards invalid records so they never hit the Silver layer.
    • Fail (expect_or_fail): Halts the entire pipeline immediately if critical fields (like transaction_id) are missing. [1, 2, 3, 4, 5]
Q7: Your pipeline is in "Continuous Mode", but upstream Azure Storage sends late-arriving data for the previous day. How do you handle this? [1]
  • Answer: DLT tables process streaming logs sequentially, so it automatically appends the late data. However, to ensure downstream Gold aggregates don't become inaccurate, I used Watermarking combined with Window Aggregations to define an acceptable threshold (e.g., 2 hours) for late-arriving events before closing the time window. [1, 2, 3, 4, 5]
Q8: How did you monitor the health and performance of your production DLT pipelines?
  • Answer: I relied on two primary vectors:
    • DLT UI Graph View: Provided real-time visualization of data flow volume, processing latencies, and data quality check failures per table.
    • Delta Live Tables Event Log: DLT writes telemetry data into a hidden Delta table. I wrote custom Databricks SQL queries on this event log to track ingestion trends, calculate data drift, and set up automated alerts. [1, 2, 3, 4]


Q1: Why did you choose to use both Python (PySpark) and Scala in the same project instead of sticking to just one? [1]
  • Answer: We used Scala for our core, heavy-duty data transformation pipelines because of its native performance advantages, type safety, and the fact that Apache Spark is written in Scala. We utilized Python (PySpark) primarily for downstream data analysis, quick prototyping, and integration with data science libraries like Pandas or Scikit-Learn to uncover the customer usage patterns. This hybrid approach allowed the data engineering team to build robust pipelines while letting analysts query the data easily. [1, 2, 3, 4, 5]
Q2: How did you decide when to use Spark SQL versus Spark DataFrames/Datasets in your application?
  • Answer: I preferred Spark SQL for complex aggregations and when collaborating with business analysts who were fluent in SQL, as it made the logic easy to review and audit. For complex programmatic transformations, iterative loops, or when I needed compile-time type safety (in Scala), I used the DataFrame or Dataset API. Under the hood, both use the Catalyst Optimizer, so the performance was nearly identical. [1, 2, 3]
Data Extraction & Ingestion Questions
Q3: You mentioned extracting data from multiple file formats. Which formats did you work with, and how did you handle schema evolution?
  • Answer: We extracted data from CSV, JSON, and Parquet formats. CSV and JSON were typically used for landing raw customer activity logs, while Parquet was used for our optimized data lake storage. To handle schema evolution—such as when new columns were added to customer usage logs—we enabled schema merging (.option("mergeSchema", "true")) when reading Parquet files, or we enforced a strict target schema using a defined struct type to safely handle missing or mismatched fields. [1, 2, 3]
Q4: Customer usage data can often have nested structures (like JSON logs). How do you handle nested arrays or maps in Spark?
  • Answer: To transform and analyze nested JSON data, I used Spark SQL functions like explode() to flatten arrays into individual rows, and dot notation (e.g., dataframe.select("user.profile.location")) to extract nested objects. If the nested structure was highly complex, I used from_json() along with an explicitly defined schema to parse the stringified JSON columns into structured columns efficiently. [1, 2, 3, 4]
Transformation & Aggregation Questions
Q5: What are some of the key aggregations you ran to "uncover insights into customer usage patterns"?
  • Answer: We focused on metrics like Daily Active Users (DAU), Monthly Active Users (MAU), average session duration, and feature utilization rates. I implemented these using Spark window functions (Window.partitionBy("user_id").orderBy("timestamp")) to calculate time-differences between consecutive customer actions, and group-by aggregations to bucket usage patterns by demographic or subscription tier. [1]
Q6: Data aggregation often triggers a shuffle. How did you optimize your Spark jobs to handle large-scale shuffles during aggregation?
  • Answer: Shuffles are expensive, so I optimized them in a few ways. First, I ensured that we used reduceByKey or DataFrame aggregations which perform map-side combinations to reduce data transfer. Second, I tuned spark.sql.shuffle.partitions based on our data size so that the tasks weren't too small or too large. Lastly, if we were joining a massive usage log table with a small customer metadata table, I used a broadcast join to completely eliminate the shuffle phase for that operation. [1, 2, 3, 4, 5]
Performance & Troubleshooting Questions
Q7: How did you handle data skewness in your customer usage data? For example, what if 1% of your users generated 50% of the total activity logs?
  • Answer: Data skew is a common issue with usage patterns. When aggregating by user_id, a few heavy users would slow down specific tasks. To solve this, I used "salting." I added a random isolated integer to the join/aggregation key to distribute the heavy users' data across multiple partitions. After performing the initial salted aggregation, I did a second aggregation without the salt to get the final total. [1, 2]
Q8: How did you debug a failing or slow Spark application in this project?
  • Answer: I started by checking the Spark UI. I would look at the "Stages" tab to identify which stage was taking the longest or failing. If I saw a high "Shuffle Read Time" or "Garbage Collection (GC) Time," I knew it was a memory or partitioning issue. I also checked the event timelines to see if a single task was running significantly longer than others, which indicated data skew. From there, I would adjust partition sizes, optimize the code, or allocate more executor memory accordingly. [1, 2, 3, 4, 5]
Q: How do you handle reading and optimizing multiple file formats (like Parquet, JSON, and CSV) in Databricks? [1]
  • Answer: PySpark uses the spark.read API to handle diverse formats. I use schema inference for Parquet since it stores metadata natively. For CSV and JSON, I explicitly define schemas using StructType to avoid performance-draining inference passes. I optimize storage by converting raw CSV/JSON into Delta Lake (Parquet) format to leverage ACID transactions and data skipping. [1, 2, 3, 4, 5]
Q: Why would you choose Delta Lake over standard Parquet formats in Databricks for customer usage analysis?
  • Answer: Delta Lake provides features crucial for usage analytics:
  • Time travel to audit historical usage data.
  • ACID transactions to prevent partial writes during failures.
  • MERGE INTO capabilities for easy upserts of customer profiles.
  • File compaction via OPTIMIZE and Z-ORDER to speed up queries. [1, 2, 3, 4, 5]

Data Transformation & Aggregation
Q: How do you implement data transformation and aggregation logic to uncover customer usage patterns?
  • Answer: I use a combination of PySpark DataFrame transformations and Spark-SQL.
  • Filtering & Cleaning: Remove duplicate sessions using .dropDuplicates().
  • Transformation: Use built-in functions like withColumn(), when(), and coalesce().
  • Aggregation: Use .groupBy("customer_id") combined with agg() to calculate metrics like total data consumed, active days, or peak usage hours. [1, 2, 3, 4, 5]
Q: Can you write a quick PySpark snippet to find the top 5 customers with the highest monthly data usage?
  • Answer:
python
from pyspark.sql import functions as F

top_customers = (df.withColumn("month", F.trunc("usage_date", "MM"))
                   .groupBy("customer_id", "month")
                   .agg(F.sum("data_consumed").alias("total_data"))
                   .orderBy(F.desc("total_data"))
                   .limit(5))
Use code with caution.

Optimization & Performance Tuning
Q: What strategy do you use when joining a massive customer usage dataset with a small customer demographics table?
  • Answer: I use a Broadcast Hash Join. By wrapping the small dataframe in broadcast(), Spark sends the small table to all worker nodes. This completely eliminates data shuffling across the network, which is the biggest bottleneck in Spark transformations. [1, 2, 3, 4, 5]
Q: How do you deal with data skewness when aggregating usage data by customer_id?
  • Answer: Data skew happens if a few power users generate millions of logs while average users generate few. To fix this:
  • I use salting, adding a random key prefix to distribute the data evenly.
  • I leverage Databricks Adaptive Query Execution (AQE), which automatically handles skew joins. [1, 2, 3]

Databricks Ecosystem & Production
Q: How do you balance using PySpark versus Spark-SQL in your Databricks notebooks?
  • Answer: I use PySpark for building dynamic, reusable data pipelines and handling complex file configurations. I switch to Spark-SQL using spark.sql() or %sql blocks for complex analytical aggregations and reporting, because SQL is highly readable for business analysts and performs identically due to the Catalyst Optimizer. [1, 2, 3, 4, 5]
Q: How do you productionize and schedule these customer usage insights pipelines in Databricks?
  • Answer: I use Databricks Workflows (Delta Live Tables or Job Clusters) to orchestrate the notebooks. The pipeline runs on an automated schedule, extracts daily raw files, transforms them using Delta lakes architecture (Bronze to Silver to Gold), and loads the aggregated insights into a gold table for BI tools. [1, 2, 3, 4, 5]

Q1: How do you optimize a PySpark DataFrame join if one table is very large and the other is small?
  • Answer: Use a broadcast join to copy the small DataFrame to all worker nodes. This eliminates the expensive data shuffling phase across the cluster.
  • Code Example: from pyspark.sql.functions import broadcast; large_df.join(broadcast(small_df), "id") [1, 2, 3, 4]
Q2: What is the difference between cache() and persist() in PySpark?
  • Answer: cache() stores the DataFrame in memory by default (MEMORY_AND_DISK). persist() allows custom storage levels, such as disk-only or serialized memory formats. [1, 2, 3, 4, 5]
Q3: How does Azure Databricks handle "Data Skew," and how do you fix it?
  • Answer: Data skew happens when one partition holds significantly more data than others, causing slow worker nodes. Fix it by using "salting" (adding a random key to distribute data) or enabling Databricks Adaptive Query Execution (AQE) skew join optimization. [1, 2, 3, 4, 5]
Q4: Explain the architecture of Delta Lake on Azure Databricks.
  • Answer: Delta Lake is an open-source storage layer on top of cloud storage. It provides ACID transactions, data versioning via a transaction log (Time Travel), and unified batch and streaming processing. [1, 2, 3, 4, 5]

🗄️ SQL Server
Q1: What is the difference between a Clustered and a Non-Clustered index in SQL Server?
  • Answer: A clustered index physically reorders the rows in the table based on the key (one per table). A non-clustered index stores data in one place and the index in another, acting like a pointer to the storage location. [1, 2, 3, 4, 5]
Q2: How do you find and resolve a deadlock in SQL Server?
  • Answer: Use SQL Server Profiler, Extended Events, or Trace Flag 1222 to capture deadlock graphs. Resolve it by ensuring transactions access tables in the same order, keeping transactions short, or using the NOLOCK hint. [1]

📝 Oracle Advanced PL/SQL & Tools
Q1: What is a Pipelined Function, and when should you use it?
  • Answer: A pipelined function returns data iteratively as a collection before the entire query finishes processing. Use it to improve performance in ETL processes by reducing memory consumption and allowing parallel processing.
  • Code Snippet: FUNCTION get_data RETURN test_type PIPELINED IS ... PIPE ROW(rec); RETURN; [1]
Q2: Explain the difference between a Strong Ref Cursor and a Weak Ref Cursor.
  • Answer: A strong Ref Cursor has a fixed, explicitly defined return type. A weak Ref Cursor does not have a defined return type and can associate with any query structure. [1, 2]
Q3: What are PL/SQL Collections, and which one allows negative indexes?
  • Answer: Collections are internal data structures like Associative Arrays, Nested Tables, and VARRAYs. Associative Arrays (Index-by tables) support negative integer indexes or string keys. [1, 2, 3, 4, 5]
Q4: Why would you use a Materialized View instead of a regular View, and what are its refresh modes?
  • Answer: Materialized Views physically store query results to boost performance on heavy analytical queries. They refresh via FAST (incremental using logs), COMPLETE (recompute all), or FORCE (automatic choice). [1, 2, 3]
Q5: What is the purpose of an External Table, and how does it differ from SQL*Loader?
  • Answer: External Tables allow querying flat files directly inside the database using standard SELECT statements without loading them. SQL*Loader is a command-line utility that permanently loads external data into physical database tables. [1]
Q6: What is a Hierarchical Query, and which clauses are required to build one?
  • Answer: It is a query used to display parent-child relationships (like an organizational chart). It requires the START WITH clause to define the root node and the CONNECT BY PRIOR clause to define the relationship link. [1, 2]

💼 Scenario-Based Behavioral Questions
Q1: "You need to migrate a legacy Oracle PL/SQL batch process into Azure Databricks. What is your approach?"
  • Answer:
    • Analyze the PL/SQL code to separate business logic from data storage.
    • Convert complex PL/SQL Cursors and Loops into set-based PySpark DataFrame operations.
    • Map Oracle tables to Azure Databricks Delta tables.
    • Schedule and orchestrate the new PySpark notebooks using Azure Data Factory (ADF) or Databricks Workflows. [1, 2]
Q2: "How do you ensure data quality when loading millions of rows from SQL Server into Databricks?"
  • Answer:
    • Use Delta Live Tables (DLT) expectations to flag or drop bad records automatically.
    • Implement row count validations and schema enforcement during the ingestion phase.
    • Log quarantine records to an isolated error table for auditing. [1, 2, 3]

Question : Data Engineer responsibility 

The primary responsibility of a Data Engineer specializing in Azure Databricks, PySpark, and SQL Server is to design, build, and optimize scalable end-to-end data pipelines (ETL/ELT) that ingest raw data from transactional databases like SQL Server, transform it using cluster computing via PySpark, and govern it using Azure Databricks lakehouse features. [1, 2, 3]

📋 Key Job Responsibilities
  • Pipeline Development: Build ETL/ELT pipelines using PySpark within Azure Databricks notebooks to ingest, process, and clean vast amounts of structured and unstructured data. [1, 2]
  • Architecture Implementation: Design and execute multi-layered storage architectures like the Medallion Architecture (Bronze, Silver, Gold layers) using Delta Lake. [1, 2]
  • Database Management: Connect to, query, and extract data from on-premises or cloud-based SQL Server databases, handling transactional records and history tables. [1]
  • Performance Optimization: Tune Spark clusters, manage partition strategies, optimize shuffle operations, and apply advanced data skipping like Z-ORDER indexing. [1, 2, 3]
  • Data Governance: Implement security, data lineage, and column/row-level permissions through Unity Catalog. [1, 2]
  • Orchestration & Target Loading: Automate and schedule production workflows using Databricks Workflows or Azure Data Factory, pushing aggregated business data back to SQL Server or Azure Synapse for reporting. [1, 2, 3]

💬 Common Interview Questions & Answers
Q1: How do you handle incremental data loading from SQL Server into Azure Databricks using PySpark?
Answer:
"I implement a hybrid strategy using watermarking and Databricks Auto Loader or PySpark JDBC connectors. First, I query SQL Server using a JDBC connection to pull data modified since the last timestamp record (stored in a watermark table). I read this incremental batch into a PySpark DataFrame. Then, I write the data into a target Delta Lake table using the .merge() command to effortlessly upsert (update/insert) changed or new records based on a unique primary key." [1, 2]
python
# Conceptual PySpark code snippet for reading from SQL Server
sql_server_df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:sqlserver://<server_name>.database.windows.net:1433;database=<db_name>") \
    .option("dbtable", f"(SELECT * FROM sales_table WHERE last_modified > '{last_watermark}') AS temp") \
    .option("user", username) \
    .option("password", password) \
    .load()
Use code with caution.
Q2: What is the Medallion Architecture, and why is it preferred over standard Parquet storage?
Answer:
"The Medallion Architecture organizes data into three logical layers: Bronze (raw, append-only ingestion), Silver (cleaned, deduplicated, and validated data), and Gold (aggregated, business-ready metrics for BI). [1]
We prefer Delta Lake over raw Parquet because Delta Lake supports ACID transactions, schema enforcement/evolution, data versioning via 'Time Travel', and optimal performance through the MERGE INTO statement." [1, 2, 3]
Q3: How do you optimize a PySpark job in Azure Databricks that suffers from severe Data Skew or Shuffle degradation?
Answer:
"To address data skew and heavy shuffles, I apply several techniques: [1, 2]
  • Adaptive Query Execution (AQE): I ensure spark.sql.adaptive.enabled is set to true, allowing Spark to balance partitions at runtime dynamically.
  • Broadcast Joins: If I am joining a large fact table from SQL Server with a small dimension table, I use broadcast(dim_df) to eliminate shuffles entirely. [1]
  • Salting: For extreme skew on join keys, I append a random integer ('salt') to distribute the concentrated data uniformly across cluster nodes.
  • Z-Ordering: I apply OPTIMIZE table Z-ORDER BY (column) on fields frequently used in filters to increase data skipping and decrease disk I/O." [1]
Q4: What is the difference between an Interactive Cluster and a Job Cluster in Azure Databricks? Which do you use for production?
Answer:
"An Interactive Cluster is manually created and stays active (until auto-termination limits hit) to allow multiple data engineers to perform ad-hoc development, testing, and exploration via shared notebooks. 
A Job Cluster is spun up programmatically by an orchestrator (like Azure Data Factory or Databricks Workflows) explicitly to execute a specific automated pipeline and terminates immediately after execution. For production environments, I strictly use Job Clusters because they are significantly cheaper and offer total workload isolation." 
Q5: How do you handle secure connectivity and sensitive credentials between Databricks and SQL Server?
Answer:
"I never hardcode credentials inside my PySpark notebooks. Instead, I store the SQL Server database password and user configurations inside Azure Key Vault. I then create an Azure Key Vault-backed Secret Scope in Azure Databricks. This allows me to securely fetch database connection strings at runtime directly into my PySpark context using the utility module: dbutils.secrets.get(scope = 'my-scope', key = 'sql-pass').


Question : spark architecture databricks

Apache Spark on Databricks operates on a distributed master-slave architecture optimized for cloud environments. It separates the platform management (control plane) from the actual parallel data computation (compute plane). 

Platform Level: The Split-Plane Architecture
Databricks high-level architecture splits its operations into two distinct areas to balance security, governance, and scale: 
  • Control Plane: Databricks hosts this backend layer in its own cloud account. It manages the notebook UI, workspace configurations, job scheduling, security, and the Databricks Unity Catalog for central data governance. 
  • Compute Plane: This is where your actual data processing happens. It can run in your personal cloud account (Classic Compute) or safely in an isolated network inside Databricks' account (Serverless Compute). 

Infrastructure Level: Core Spark Cluster Nodes
The compute plane provisions clusters of Virtual Machines (VMs) structured around two main roles: 
  • Driver Node (Master): This node runs the SparkSession, which serves as the entry point for your application. It converts user code (Python, SQL, Scala, R) into an execution plan called a Directed Acyclic Graph (DAG). It then coordinates tasks and schedules them out to workers. 
  • Worker Nodes (Slaves): These machines receive the code tasks from the driver. They fetch data fragments, process calculations entirely in-memory, and stream results back to the driver. 
┌────────────────────────────────────────────────────────┐
│                      CONTROL PLANE                     │
│    (UI, Web App, Workflows, Unity Catalog, Security)   │
└───────────────────────────┬────────────────────────────┘
                            │ (Manages & Coordinates)
┌───────────────────────────▼────────────────────────────┐
│                      COMPUTE PLANE                     │
│                                                        │
│  ┌──────────────────────────────────────────────────┐  │
│  │               DRIVER NODE (Master)               │  │
│  │   [SparkSession] ──► [DAG/Plan] ──► [Scheduler]  │  │
│  └────────────────────────┬─────────────────────────┘  │
│                           │ (Sends Tasks)              │
│            ┌──────────────┴──────────────┐             │
│            ▼                             ▼             │
│  ┌───────────────────┐         ┌───────────────────┐   │
│  │ WORKER NODE       │         │ WORKER NODE       │   │
│  │ ┌───────────────┐ │         │ ┌───────────────┐ │   │
│  │ │ Executor JVM  │ │         │ │ Executor JVM  │ │   │
│  │ │  (Core Tasks) │ │         │ │  (Core Tasks) │ │   │
│  │ └───────────────┘ │         │ └───────────────┘ │   │
│  └───────────────────┘         └───────────────────┘   │
└────────────────────────────────────────────────────────┘

Internal Execution Level: JVM & Tasks
Within each worker node, the workload is further subdivided to parallelize operations: 
  • Executors: Independent Java Virtual Machine (JVM) processes launched on the worker nodes specifically to run your tasks. Each executor handles local memory storage and tracks data lineages. 
  • Slots/Cores: Every executor is configured with execution slots that mirror available CPU cores. A single slot handles exactly one data partition task at a time. 
  • Tasks: The smallest unit of execution. Tasks execute transformations (like filtering or calculations) on a single data block and run completely in parallel across your available cluster slots. 

Question : spark sql in databricks

Spark SQL is an Apache Spark module used for structured data processing within Databricks. It allows you to query large datasets using standard ANSI SQL syntax while leveraging Spark's distributed computing power. 
Execution Methods
You can run Spark SQL queries in Databricks notebooks using two primary methods: 
  • Using the %sql Magic Command:
    Turn any notebook cell into a native SQL editor by placing %sql at the very top.
    sql
    %sql
    SELECT user_id, COUNT(order_id) 
    FROM samples.tpch.orders 
    GROUP BY user_id;
    
    Using the spark.sql() API:
  • Embed SQL queries directly inside Python, Scala, or R cells. This returns a DataFrame for further programmatic manipulation.
    python
    # In a Python cell
    df = spark.sql("SELECT * FROM samples.tpch.customer WHERE market_segment = 'AUTOMOBILE'")
    display(df)
    

Interoperability with DataFrames
Spark SQL shares the same underlying optimization engine as DataFrames. This allows you to seamlessly switch between Python/Scala and pure SQL. 
  • Querying Files Directly: You can run SQL queries directly on files stored in cloud object storage without importing them first.
    sql
    %sql
    SELECT * FROM json.`/volumes/my_catalog/my_schema/my_volume/data.json`;
    
    Creating Temporary Views: Turn a PySpark DataFrame into a queryable SQL relation using temporary views.
  • python
    # Register DataFrame as a temporary view
    df.createOrReplaceTempView("my_temp_view")
    
    sql
    %sql
    -- Query it immediately using SQL
    SELECT * FROM my_temp_view WHERE status = 'active';
    

Advanced Features in Databricks
Databricks extends traditional Spark SQL capability with enterprise-grade features found in the Databricks SQL Language Reference
  • Delta Lake Optimization: Built-in support for MERGE INTO, UPDATE, and DELETE syntax.
  • SQL Scripting: Procedural control flow like loops, variables, and IF-ELSE statements directly inside your SQL code.
  • Unity Catalog Integration: Secure data governance mapping 3-tier namespaces (catalog.schema.table).
  • COPY INTO: An idempotent SQL command to bulk load files from object storage directly into Delta table

For more Details

Question : Azure Databricks transformation and action


In Azure Databricks, transformations and actions represent the two core types of operations used to manipulate data within Apache Spark DataFrames. The fundamental distinction between them relies on lazy evaluation, where transformations construct a logical execution plan, and actions trigger the physical computation of that plan to return or save results. 

Key Concepts
  • Lazy Evaluation: Azure Databricks does not execute operations immediately. It stores the processing logic as a sequence of steps.
  • Directed Acyclic Graph (DAG): Spark builds a mathematical graph (DAG) to map out the most efficient way to compute the final requested output.
  • Catalyst Optimizer: An internal engine that optimizes the DAG to reduce memory usage and minimize disk read/write operations. 

1. Transformations
Transformations are operations that take a DataFrame as input and return a new DataFrame as output. Because DataFrames are immutable, transformations do not alter the source data. They are lazily evaluated and merely define the instructions for processing. 
Types of Transformations
  • Narrow Transformations: Operations where each input partition contributes to exactly one output partition. No data movement across the network (shuffling) is required. 
  • Wide Transformations: Operations where data from multiple input partitions must be combined into a single output partition. This requires a costly cluster operation called a shuffle. 
Common Transformation Operations
  • filter() or where(): Selects specific rows based on a condition (Narrow).
  • select(): Isolates specific columns from the dataset (Narrow).
  • withColumn(): Creates a new column or modifies an existing one (Narrow).
  • join(): Merges two datasets based on a common key (Wide).
  • groupBy(): Aggregates records based on matching column criteria (Wide).
  • distinct(): Removes duplicate rows from the dataset (Wide). 

2. Actions
Actions are operations that instruct the Spark engine to compute a concrete result from the accumulated transformations. When an action is executed, Spark evaluates the optimization plan, reads the data, applies the pipeline transformations, and triggers a Spark Job. 
Common Action Operations
  • display(): Renders the data visually in a tabular format within Databricks notebooks.
  • count(): Evaluates the total number of rows in the DataFrame.
  • collect(): Returns all rows of the DataFrame back to the Driver node as an array (use with caution on huge datasets to avoid out-of-memory errors).
  • show(): Prints the first 20 rows of a DataFrame in a console-like layout.
  • take(n): Retrieves the first n rows of the DataFrame.
  • write(): Permanently exports the processed data into cloud storage formats like Delta Lake tables, CSV, Parquet, or JSON. 

Summary Comparison
Metric TransformationsActions
OutputA new DataFrameA concrete value, file, or console print
ExecutionLazy (Deferred evaluation)Eager (Immediate execution)
Cluster BehaviorUpdates the internal execution plan (DAG)Triggers a Spark Job across cluster nodes
Primary UseData cleaning, filtering, and structuringOutputting, inspecting, or storing results

Python Code Example
The following code illustrates how transformations construct a query plan without executing it, until an action is explicitly invoked: 
python
# 1. READ OPERATION (Transformation)
df = spark.read.table("hive_metastore.default.raw_sales")

# 2. TRANSFORMATIONS (No data processing happens yet; Spark just builds the plan)
filtered_df = df.filter(df["amount"] > 100)                    # Narrow transformation
transformed_df = filtered_df.groupBy("country").sum("amount")   # Wide transformation

# 3. ACTION (This triggers the actual computation across the cluster)
final_result = transformed_df.collect()

# 4. ACTION (Saves the output to a permanent Delta Lake table)
transformed_df.write.format("delta").mode("overwrite").saveAsTable("gold_sales_summary")

Question : azure databricks window functions
Azure Databricks window functions operate on a group of rows (called a window) and calculate a return value for each individual row based on that group. Unlike standard GROUP BY operations which collapse your dataset, window functions retain the full granularity of your original rows while appending calculated columns like running totals or rankings. 

Key Components of a Window Function
Every window function relies on an OVER clause which defines the Window Specification using three core blocks: 
  • PARTITION BY: Groups rows into subsets (e.g., separating data by department or category). If omitted, the window spans the entire dataset.
  • ORDER BY: Sorts rows within each partition. This is required for ranking and boundary-sensitive analytic functions.
  • ROWS or RANGE (Window Frame): Defines a sliding boundary of rows relative to the current row (e.g., calculating a 7-day moving average). 

Core Types of Window Functions
1. Ranking Functions
These functions assign numerical orders or ranks to rows within your partition. 
  • row_number(): Assigns a unique sequentially ascending integer starting at 1.
  • rank(): Assigns ranks but leaves gaps if there are duplicate values (e.g., 1, 2, 2, 4).
  • dense_rank(): Assigns ranks without leaving gaps for duplicate values (e.g., 1, 2, 2, 3). 
2. Analytic & Value Functions
These fetch data points from other rows relative to your current row position. 
  • lag(expr, offset): Grabs a value from a previous row.
  • lead(expr, offset): Grabs a value from a subsequent row.
  • first_value(expr) / last_value(expr): Returns the first or last value in the window frame. 
3. Aggregate Functions
Standard mathematical operations executed dynamically over the sliding window. 
  • sum(), avg(), min(), max(), count() [1]

Implementation Comparison
You can implement these functions using either Databricks SQL or the PySpark DataFrame API. 
Databricks SQL Approach
Below is a pure SQL example demonstrating a running total and employee ranking.
sql
SELECT 
    employee_id,
    department,
    salary,
    -- Running total of salary inside each department
    SUM(salary) OVER(
        PARTITION BY department 
        ORDER BY salary 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_salary_total,
    -- Rank employees by salary within their department
    DENSE_RANK() OVER(
        PARTITION BY department 
        ORDER BY salary DESC
    ) AS salary_rank
FROM employees;
PySpark DataFrame API Approach
To build the equivalent pipeline using PySpark, leverage the pyspark.sql.Window class. 
python
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# 1. Define the window specifications
running_total_spec = (Window
                      .partitionBy("department")
                      .orderBy("salary")
                      .rowsBetween(Window.unboundedPreceding, Window.currentRow))

rank_spec = (Window
             .partitionBy("department")
             .orderBy(F.col("salary").desc()))

# 2. Apply specifications to the DataFrame
df_result = df.withColumn("running_salary_total", F.sum("salary").over(running_total_spec)) \
              .withColumn("salary_rank", F.dense_rank().over(rank_spec))

display(df_result)
Databricks Pro-Tip: The QUALIFY Clause
When working in Databricks SQL, filtering results based on a window function (like grabbing only salary_rank = 1) traditionally required subqueries or CTEs. Databricks simplifies this with the QUALIFY clause, allowing you to filter window metrics natively: 
sql
SELECT employee_id, department, salary
FROM employees
QUALIFY DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) = 1;

Question: azure databricks dataframe and dataset

In Azure Databricks, DataFrames and Datasets are the primary distributed data abstractions provided by the Apache Spark API for structured and semi-structured big data processing. While they share the same underlying optimization engines, a DataFrame is an untyped, tabular collection of data available across Python, Scala, R, and SQL, whereas a Dataset is a strongly-typed, object-oriented collection exclusive to Scala and Java. 
Quick Comparison
The following table contrasts the key structural differences between the two abstractions:
Feature DataFrameDataset
Type SafetyUntyped (Run-time errors)Strongly-typed (Compile-time safety)
Language SupportPython (PySpark), Scala, Java, R, SQLScala and Java only
Data RepresentationDistributed collection of generic Row objectsDistributed collection of JVM objects (e.g., custom Case Classes)
OptimizationFully optimized via Catalyst & Tungsten enginesFully optimized, but lamda functions can bypass some optimizations

Key Concepts
1. What is a DataFrame?
  • Tabular Structure: Organizes data into a two-dimensional table of named rows and columns, similar to a relational database table. 
  • Lazy Evaluation: Transformations (e.g., filter(), select()) build an execution plan but do not process data until an action (e.g., show(), count(), display()) is called. 
  • Unified Performance: Leverages Spark's Catalyst Optimizer to achieve identical high-speed execution regardless of whether you write code in Python, Scala, or SQL. 
  • Schema Inference: Automatically detects column names and data types when reading structured files like Parquet, Delta Lake, or CSV with headers. 
2. What is a Dataset?
  • An Extension of DataFrames: Technically, a DataFrame is just an alias for an untyped Dataset of Rows (Dataset[Row]).
  • Compile-Time Safety: Catches syntax and type mismatches during compilation rather than failing mid-way through a long production cluster job.
  • Object-Oriented: Maps row items directly to domain-specific Java or Scala objects. 

When to Use Which?
Use DataFrames when:
  • You primarily use Python (PySpark) or SQL inside your Azure Databricks notebooks.
  • You require rapid data manipulation, aggregations, visualization, and out-of-the-box performance optimizations.
  • You want to seamlessly convert data to or from the Pandas API on Spark for data science tasks. [
Use Datasets when:
  • You are building complex production pipelines using Scala or Java.
  • Your pipeline requires strict type-safety guarantees to prevent unexpected schema errors downstream.
  • You want to map rows to a strongly-typed domain model using Scala Case Classes. 
Code Examples
Creating and Using a DataFrame (Python / PySpark)
python
# Read an example CSV file into a DataFrame
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/databricks-datasets/samples/population-vs-price/data_geo.csv")

# Filter and select columns (Transformation)
filtered_df = df.filter(df["2015 Median Price"] > 500000).select("State", "2015 Median Price")

# Display the results using Databricks built-in tool (Action)
display(filtered_df)
Creating and Using a Dataset (Scala)
scala
// Define a strict schema using a Case Class
case class HomePrice(State: String, Population: Long, MedianPrice: Double)

// Read data and map it directly to the Case Class object to create a Dataset
val dataset = spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("/databricks-datasets/samples/population-vs-price/data_geo.csv")
  .as[HomePrice] // Enforces strong typing

// Compile-time safe filtering
val highPriceDS = dataset.filter(row => row.MedianPrice > 500000)

// Display results
display(highPriceDS)




Question : azure databricks join


Azure Databricks supports standard SQL and PySpark syntax to join multiple datasets using a common key. You can execute these joins via Databricks SQL or Spark DataFrames. [1, 2, 3, 4, 5]
PySpark Syntax
In Python notebooks, use the df.join() method.
python
# Syntax
df_result = df_left.join(df_right, df_left.key == df_right.key, how="inner")
Common PySpark Join Types:
  • inner: Default type. Returns rows with matching keys in both DataFrames.
  • left / leftouter: Keeps all rows from the left DataFrame and appends matching right side rows. Missing matches become null.
  • right / rightouter: Keeps all rows from the right DataFrame and appends matching left side rows.
  • full / fullouter: Returns all records from both sides, filling un-matched values with null.
  • semi / leftsemi: Returns columns only from the left DataFrame where a matching key exists on the right.
  • anti / leftanti: Returns columns only from the left DataFrame where no matching key exists on the right. [1, 2, 3]

SQL Syntax
When writing standard SQL or running cells with %sql, use the standard ANSI-SQL JOIN syntax: [1, 2]
sql
SELECT * 
FROM table_a AS a
INNER JOIN table_b AS b
  ON a.id = b.id;
Optimization Strategies
Because large-scale joins can trigger expensive network shuffling, Azure Databricks provides tools to optimize performance:
  • Broadcast Joins: Databricks automatically broadcasts small tables to all worker nodes to avoid full shuffles.
  • You can explicitly force this using a Join Hint: SELECT /*+ BROADCAST(b) */ * FROM table_a a JOIN table_b b ON a.id = b.id. [1, 2, 3]
  • Photon Engine: Keep the Photon Engine active on your cluster configuration,
as it natively accelerates join evaluation and chooses optimal physical execution plans.
  • Analyze Tables: Keep table statistics fresh so the Databricks query optimizer can order
joins accurately. Run ANALYZE TABLE table_name COMPUTE STATISTICS before processing.




Databricks interviews focus on your ability to use the Data Lakehouse architecture, optimize distributed computing (Apache Spark), and manage data governance (Unity Catalog). Preparation should span across architecture, Delta Lake features, and practical performance tuning. 
1. Core Architecture & Concepts
Q: What is the Databricks Lakehouse architecture?
A: It is a unified platform that combines the reliability, governance, and ACID transactions of a data warehouse with the scalability, flexibility, and low-cost storage of a data lake. It unifies batch and streaming data into a single system. 
Q: What is Unity Catalog?
A: It is a fine-grained centralized governance solution that manages data access and permissions across all Databricks workspaces. It provides centralized auditing, data lineage, and row/column-level security. 
Q: Difference between an All-Purpose Cluster and a Job Cluster?
A:
  • All-Purpose: Used for interactive exploration, collaborative notebook sharing, and ad-hoc analysis. Takes longer to start and incurs higher costs. 
  • Job Cluster: Dedicated, automated clusters explicitly created to run a specific automated workflow or task. They terminate immediately upon job completion, saving computing costs. 

2. Delta Lake Features
Q: What are the main benefits of Delta Lake over standard Parquet?
A: Delta Lake provides ACID transactions (preventing corrupted reads/writes), enforces data schemas (schema enforcement), and allows you to query historical data states via "Time Travel". 
Q: Explain the difference between VACUUM and OPTIMIZE.
A: [1]
  • OPTIMIZE: Compacts small files in a Delta table into larger, optimized files, which speeds up read queries.
  • VACUUM: Permanently deletes physical data files that are no longer referenced by the Delta transaction log (e.g., historical data past the retention threshold). It is used to reclaim cloud storage and reduce costs. 
Q: What is Schema Evolution in Delta Lake?
A: It allows a pipeline to automatically adapt to changes in data structure without breaking the target table. You typically enable it by using the .option("mergeSchema", "true") setting. 

3. Spark Optimization & Debugging
Q: How do you handle Data Skew in Databricks?
A: Data skew occurs when a few specific keys hold disproportionately more data, slowing down the executors processing them. You can handle this by: [1, 2]
  • Salting the skewed keys (adding random prefixes to distribute data across executors).
  • Increasing the spark.sql.shuffle.partitions configuration.
  • Utilizing Adaptive Query Execution (AQE), which dynamically coalesces skewed partition groups. [1, 2, 3, 4, 5]

Q: When would you use a Broadcast Join?
A: When you are joining a very large table with a significantly smaller table. Spark sends (broadcasts) the small table to the memory of every executor, preventing expensive network shuffle operations. Risk: Running out of executor memory if the small table is too large. [1, 2]


 Databricks data engineering interviews comprehensively test architectural concepts, real-time optimization strategies, and robust programmatic problem-solving. [1, 2]

The following selection delivers 4 major Databricks interview questions spanning Architecture, Optimization, Data Ingestion, and a Coding Scenario. Each question includes detailed explanations, technical context, comprehensive examples, and specific test cases with sample PySpark/SQL implementations. [1, 2, 3, 4, 5]

1. Handling Heavy Data Skew with Salting
Question
What is Data Skew in Apache Spark, how does it negatively affect Databricks clusters, and how do you resolve it using a technical approach like Salting? [1, 2]
Technical Context
Data skew occurs when an asymmetric distribution causes a single or small subset of keys to hold a massive portion of the total records. During wide transformations (like a JOIN or groupBy), Spark hashes data keys to assign them to specific partitions. A skewed key forces a solitary executor node to process vastly more data than its peers. This creates a "long tail" problem where the cluster remains underutilized while one node runs out of memory (OOM) or bottlenecks the entire pipeline. [1, 2]
Salting breaks up this bottleneck by appending a randomized integer (a "salt") to the join keys of the skewed dataset, while exploding the lookup table to match those generated salt variations. [1, 2, 3]
Skewed Key ("A") ----> Appended Salt ----> Distributed Keys: "A_0", "A_1", "A_2"
Detailed Example & Implementation
Imagine joining a multi-terabyte Transactions Table (highly skewed on a popular product_id, such as 101) with a small Products Dimension Table. [1]
python
from pyspark.sql import functions as F

# 1. Define salt range
num_salts = 3

# 2. Add salt to the skewed large table
transactions_salted = transactions_df.withColumn(
    "salt", 
    F.concat(F.col("product_id"), F.lit("_"), F.randint(0, num_salts - 1))
)

# 3. Explode the small dimension table to match every possible salt value
salts_df = spark.range(0, num_salts).withColumnRenamed("id", "salt_val")

products_exploded = products_df.join(salts_df) \
    .withColumn("salt", F.concat(F.col("product_id"), F.lit("_"), F.col("salt_val"))) \
    .drop("salt_val")

# 4. Perform the optimized join on the salted key
optimized_joined_df = transactions_salted.join(
    products_exploded,
    on="salt",
    how="inner"
).drop("salt")
Use code with caution.
Test Cases
  • Test Case 1: Skewed Key Distribution
    • Input Data: transactions_df contains 1,000,000 records where product_id = 101 and 10 records where product_id = 202.
    • Expected Behavior: Without salting, Spark creates a single massive partition for 101. With a salt range of 3, the records for 101 split roughly evenly into 3 distinct processing buckets (101_0, 101_1, 101_2), balancing the cluster load.
  • Test Case 2: Join Cardinality and Correctness Verification
    • Input Data: Single transaction record [product_id: 101, amount: 50]. Dimension record [product_id: 101, name: 'Widget'].
    • Expected Output: The output must yield exactly 1 joined record: [product_id: 101, amount: 50, name: 'Widget']. The test guarantees that exploding the dimension table does not introduce artificial row duplication.

2. Implementing SCD Type 2 tracking in Delta Lake
Question
How do you efficiently implement a Slowly Changing Dimension Type 2 (SCD Type 2) pipeline within a Databricks Delta Lake table using native commands? [1, 2]
Technical Context
SCD Type 2 tracks historical data by creating multiple records for a given natural key, differentiating them with state tracking fields like start_date, end_date, and a boolean is_current flag. Delta Lake handles this pattern natively through the MERGE INTO statement. To perform a historical update, existing active records must be marked as expired (is_current = false), and completely new records must be appended as the new ground truth. [1, 2, 3, 4, 5]
Detailed Example & Implementation
We need to track changes to an internal Employees Delta Table. If an employee changes their department, the old record must terminate, and a new record must begin.
sql
-- Target table initialization configuration
CREATE TABLE delta.employees (
    emp_id INT,
    name STRING,
    department STRING,
    start_date DATE,
    end_date DATE,
    is_current BOOLEAN
) USING DELTA;

-- Merge logic processing staging data
MERGE INTO delta.employees AS target
USING (
    -- Combine new updates with an explicit null-key row to force insertions of changed values
    SELECT src.emp_id AS merge_key, src.* FROM staging_updates src
    UNION ALL
    SELECT NULL AS merge_key, src.* FROM staging_updates src
    JOIN delta.employees tgt ON src.emp_id = tgt.emp_id
    WHERE tgt.is_current = true AND src.department <> tgt.department
) AS source
ON target.emp_id = source.merge_key AND target.is_current = true
WHEN MATCHED AND target.department <> source.department THEN
    UPDATE SET target.end_date = source.start_date, target.is_current = false
WHEN NOT MATCHED THEN
    INSERT (emp_id, name, department, start_date, end_date, is_current)
    VALUES (source.emp_id, source.name, source.department, source.start_date, NULL, true);
Use code with caution.
Test Cases
  • Test Case 1: Processing a Brand New Record
    • Input Data: Staging record [emp_id: 5, name: 'Alice', department: 'HR', start_date: '2026-06-01'] targeting an empty target table.
    • Expected Output: One row added: [emp_id: 5, ..., department: 'HR', start_date: '2026-06-01', end_date: NULL, is_current: true].
  • Test Case 2: Historical Value Mutation (The Update)
    • Input Data: Target table contains active record for Alice in 'HR'. Staging receives: [emp_id: 5, name: 'Alice', department: 'Engineering', start_date: '2026-06-25'].
    • Expected Output: The target table must contain exactly two rows for emp_id: 5:
      1. The legacy row updated to: [department: 'HR', end_date: '2026-06-25', is_current: false].
      2. The newly created row added as: [department: 'Engineering', start_date: '2026-06-25', end_date: NULL, is_current: true].

3. Scaling File Ingestion with Auto Loader
Question
What is Databricks Auto Loader, how does it work under the hood, and what advantages does it offer over standard Spark Structured Streaming file readers? [1, 2, 3]
Technical Context
Traditional Spark directory streaming routinely scans cloud object storage containers to find newly added objects. As file quantities scale into the millions, directory listing operations become painfully slow and prohibitively expensive. [1, 2]
Databricks Auto Loader (cloudFiles) solves this using two operational mechanisms: [1, 2]
  1. Directory Listing Mode: It efficiently enumerates files and retains state directly inside RocksDB checkpoints, completely removing the need to re-scan previously ingested assets. [1]
  2. File Notification Mode: It automatically configures and listens to cloud notification systems (like AWS SNS/SQS or Azure Event Grid/Queue), directly queuing incoming files without running manual system storage scans. [1, 2, 3, 4]
Detailed Example & Implementation
Streaming JSON file extraction from a landing storage container with explicit automatic schema inference and evolution tracking. [1, 2]
python
# Streaming data ingestion via Auto Loader
checkpoint_path = "dbfs:/mnt/checkpoints/json_ingestion"
schema_location = "dbfs:/mnt/metadata/schemas/json_ingestion"

streaming_df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", schema_location)
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns") # Automatic schema capture
    .load("dbfs:/mnt/landing_zone/raw_events/"))

query = (streaming_df.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", checkpoint_path)
    .start("dbfs:/mnt/silver_zone/events_table"))
Use code with caution.
Test Cases
  • Test Case 1: Incremental Execution and Processing Latency
    • Input Data: 50,000 files exist in /raw_events/. The pipeline processes them completely. 10 new files are dropped into the bucket.
    • Expected Behavior: Auto Loader identifies and streams only the 10 incoming files instantly using the notification queue, skipping expensive directory scanning. [1, 2]
  • Test Case 2: Schema Evolution Verification
    • Input Data: System reads initial records with fields [id, timestamp]. A new batch introduces a modified structure containing a third column: [id, timestamp, geo_location].
    • Expected Output: The streaming target Delta table alters dynamically to include geo_location without failing the workflow, populating previous records with a NULL fallback. [1, 2]

4. Algorithmic Coding Challenge — IP Range to CIDR Merging
Question
Write a Python/PySpark solution that takes a series of network log IP addresses and collapses contiguous ranges into minimal CIDR blocks. [1]
Technical Context
This programmatic scenario evaluates your algorithmic fluency, object optimization strategies, and network architecture logic. Databricks profiles rely heavily on transforming IP data string representations into numerical 32-bit unsigned integer values to cleanly execute mathematical ranges, interval scans, and sliding optimization windows. [1, 2]
Detailed Example & Implementation
python
import ipaddress

def merge_ips_to_cidr(ip_list):
    """
    Parses a string list of individual IPv4 strings, collapses contiguity,
    and returns the optimized, minimum possible list of representative CIDR blocks.
    """
    if not ip_list:
        return []
        
    # Convert text strings into structured IP objects, sorting ascending
    ips = sorted([ipaddress.IPv4Address(ip) for ip in ip_list])
    
    merged_cidrs = []
    current_range = []
    
    for ip in ips:
        if not current_range:
            current_range.append(ip)
        # Check if the current IP address is strictly contiguous with the previous element
        elif int(ip) == int(current_range[-1]) + 1:
            current_range.append(ip)
        else:
            # Collapse completed range into native CIDR representations
            start_ip = current_range[0]
            end_ip = current_range[-1]
            merged_cidrs.extend(ipaddress.summarize_address_range(start_ip, end_ip))
            current_range = [ip]
            
    if current_range:
        merged_cidrs.extend(ipaddress.summarize_address_range(current_range[0], current_range[-1]))
        
    return [str(cidr) for cidr in merged_cidrs]

# Example execution verification
input_ips = ["192.168.1.0", "192.168.1.1", "192.168.1.2", "192.168.1.3", "192.168.1.5"]
print(merge_ips_to_cidr(input_ips))
# Output: ['192.168.1.0/30', '192.168.1.5/32']
Use code with caution.
Test Cases
  • Test Case 1: Perfect Contiguous Block Optimization
    • Input Data: ["10.0.0.0", "10.0.0.1", "10.0.0.2", "10.0.0.3"]
    • Expected Output: ['10.0.0.0/30'] (collapses four 32-bit addresses into a single subnet).
  • Test Case 2: Complete Discontinuity Handling
    • Input Data: ["192.168.0.1", "10.0.0.1"]
    • Expected Output: ['10.0.0.1/32', '192.168.0.1/32'] (order sorted, zero