* How to add column to spark dataframe
df.withColumn("copiedfromcolumn",col(salary)-1)
df.select(orderid,salary-1 as new_salary)
using map to add column in spark
* Difference between map and flat map
- https://sparkbyexamples.com/spark/spark-map-vs-flatmap-with-examples/
- Flat map returns more rows than the input, map returns same number of rows
* Reduce by and Group by
-- Reduce by key first group data in the partition and then shuffles there by reducing amount of data which needs to be shuffled
* How to perform basic spark operations
* What is cost based operation in spark
* How to optimize Athena Queries
- https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
* Redshift spectrum extends support for Open source Hudi Datalakes
* Redshift spectrum and Athena
- https://www.upsolver.com/blog/aws-serverless-redshift-spectrum-athena
- Redshift spectrum need you to have a Redshift cluster , Athena does not
- Redshift spectrum performance is based on your cluster size so can be faster than athena
- Redshift spectrum allows you to join S3 data with your tables in cluster
* How to use Spark Sql inside scala based application
Sparkdf.CreateorReplaceTempView(viewName = "abc_table")
val agg_df = spark.sql("select order,count(*) from abc_table group by order")
Steps to work with Spark
- create a Spark session objects
spark = SparkSession.builder().config().getOrCreate().
df = spark.read.csv.option()
df.select.where.groupby()
df.join(df2,df(colname) === df2(colname),inner)
Spark Optimizations
* out of memory issue happen when Spark executur memory or parallelism is not set properly
* R for memory intensive and C for compute intensive applications
* Understand that out of total executor memory around 90% is available for executer , Rest is system process, function memory , Reserved
*
spark.executor.memory
– Size of memory to use for each executor that runs the task.spark.executor.cores
– Number of virtual cores.
No comments:
Post a Comment