Question on Hudi
What is Copy on write and Merge on Read
- https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html
- Copy on write creates a new version of file in which data exists. Copy on write is more suitable for read heavy workload. It takes time to write data . COW is the default
- Merge on Read is suitable for write heavy operations so we do not use it
What is Serialization in Athena table Serde
- serde tells us format of input file. If we have json file and we load into hive table. serde tells it the underlying file is in json format.
What happens if we remove or add a column in Glue
- Glue is hive catalog so you cannot remove a column - https://stackoverflow.com/questions/34198114/alter-hive-table-add-or-drop-column
- you can alter table to keep only required columns
- you can add a column to hive catalog and hive will populate null for previous data, you cannot have a default value for previous data , if you need default value you need to drop and recreate the table
- Materialized views and joining tables
-- how hudi knows something is latest record
-- Hudi Snapshot and Incremental API
How data skews are handled in Spark
- SALT Technique in spark for Handling Skews - Adding a new field to join key
-- How do we partition table in S3 or HUDI , what partition keys do we use. What are indexes in HUDI
How Parquet stores data -
- https://www.linkedin.com/pulse/all-you-need-know-parquet-file-structure-depth-rohan-karanjawala#:~:text=Each%20block%20in%20the%20parquet,in%20the%20form%20of%20pages.
- Parquet does not store each column in a separate file
- On High level there is body(Row groups) and footer (Metadata)
- All the columns are contained in a single file as Row groups , one row group contain only that columns metadata
No comments:
Post a Comment