Tuesday 18 April 2023

Additional Spark Interview Questions

 * How to add column to spark dataframe 

    df.withColumn("copiedfromcolumn",col(salary)-1)

    df.select(orderid,salary-1 as new_salary)

    using map to add column in spark

* Difference between map and flat map 

    - https://sparkbyexamples.com/spark/spark-map-vs-flatmap-with-examples/

    - Flat map returns more rows than the input, map returns same number of rows 

* Reduce by and Group by 

    * https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

-- Reduce by key first group data in the partition and then shuffles there by reducing amount of data which needs to be shuffled 

* How to perform basic spark operations 

* What is cost based operation in spark 

* How to optimize Athena Queries 

    - https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

* Redshift spectrum extends support for Open source Hudi Datalakes 

    * https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-redshift-spectrum-adds-support-for-querying-open-source-apache-hudi-and-delta-lake/

* Redshift spectrum and Athena

  1. https://www.upsolver.com/blog/aws-serverless-redshift-spectrum-athena
  2. Redshift spectrum need you to have a Redshift cluster , Athena does not 
  3. Redshift spectrum performance is based on your cluster size so can be faster than athena 
  4. Redshift spectrum allows you to join S3 data with your tables in cluster 

    

* How to use Spark Sql inside scala based application 

Sparkdf.CreateorReplaceTempView(viewName = "abc_table")

val agg_df = spark.sql("select order,count(*) from abc_table group by order")


Steps to work with Spark 

- create a Spark session objects 

spark = SparkSession.builder().config().getOrCreate().

df = spark.read.csv.option()

df.select.where.groupby()

df.join(df2,df(colname) === df2(colname),inner)


Spark Optimizations

* https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/

* out of memory issue happen when Spark executur memory or parallelism is not set properly 

* R for memory intensive and C for compute intensive applications

* Understand that out of total executor memory around 90% is available for executer , Rest is system process, function memory , Reserved 

  • spark.executor.memory – Size of memory to use for each executor that runs the task.
  • spark.executor.cores – Number of virtual cores.

Take example - Suppose we want to process 200 TB of data in S3 files 

- r5.12x -- 48 Cores , 384 GB ram - 20 instances in total ( 1 for driver ) 
- Lets start with Spark executor cores - 5 cores , 
- Spark Executor  memory -  383/9 instances - 37 


spark.executor.instances - 9 * 19 - 170 

spark.default.parallelism - 170 * 5 * 2 = 1700

* Coalesce vs Repartition 
- Coalesce is used for reducing numer of partition and does not call data shuffle 

Additional Spark Optimization Techniques 
- Enabling spark dynamic query execution 
- Handles situation where Partition size is not uniform leading to work being done only by small number of executors . Example grouping by state where number of states are limited and this will leave work to small number of executors . Helps in estimating number of partitions

- Dynamically switching join stratergies - Example filtering data while joining . Join stratergy will be basd on size of dataset , but AQE will make sure plan is changed based on dymamic data size 

- Handling cases where one partition is much bigger - one task taking  much longer. General approach 







No comments:

Post a Comment