Cognossimplified: HUDI

Friday, 28 April 2023

Question on Hudi

What is Copy on write and Merge on Read

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html
Copy on write creates a new version of file in which data exists. Copy on write is more suitable for read heavy workload. It takes time to write data . COW is the default
Merge on Read is suitable for write heavy operations so we do not use it

What is Serialization in Athena table Serde

serde tells us format of input file. If we have json file and we load into hive table. serde tells it the underlying file is in json format.

What happens if we remove or add a column in Glue

Glue is hive catalog so you cannot remove a column - https://stackoverflow.com/questions/34198114/alter-hive-table-add-or-drop-column
you can alter table to keep only required columns
you can add a column to hive catalog and hive will populate null for previous data, you cannot have a default value for previous data , if you need default value you need to drop and recreate the table

- Materialized views and joining tables

-- how hudi knows something is latest record

-- Hudi Snapshot and Incremental API

How data skews are handled in Spark

-- How do we partition table in S3 or HUDI , what partition keys do we use. What are indexes in HUDI

How Parquet stores data -

https://www.linkedin.com/pulse/all-you-need-know-parquet-file-structure-depth-rohan-karanjawala#:~:text=Each%20block%20in%20the%20parquet,in%20the%20form%20of%20pages.
Parquet does not store each column in a separate file
On High level there is body(Row groups) and footer (Metadata)
All the columns are contained in a single file as Row groups , one row group contain only that columns metadata

Cognossimplified