Monday, 24 April 2023

Data Streaming

 There are various data streaming platforms 

- KAFKA

- Apark Streaming 

- Flink 

- Glue for Streaming 


Amazon Kinesis Data Streams 


Understanding AVRO 

- https://www.confluent.io/blog/avro-kafka-data/

- Avro has metadata in JSON and data is stored in Binary. Each file has header which is human readable and rest is binary 

-- Parquet and ORC - Are both suitable for Write once and Read heavy. ORC is optimized for hive with Hadoop. Both are columunar formas 

-- Parquet works best with Spark 

How Parquet Stores Data.Good Article  - https://www.linkedin.com/pulse/all-you-need-know-parquet-file-structure-depth-rohan-karanjawala/

- Both ORC and Parquet squeeze the data - ORC ( Optimized Row Columunar) 

https://www.upsolver.com/blog/the-file-format-fundamentals-of-big-data#:~:text=The%20ORC%20file%20format%20stores,reduce%20read%20and%20decompression%20loads.

How to Create Data streaming job 

https://aws.amazon.com/blogs/big-data/crafting-serverless-streaming-etl-jobs-with-aws-glue/

Blog on Stream Processin 

https://www.upsolver.com/wp/stream-processing-ebook?submissionGuid=d36d39a3-91ac-4f92-9a9b-f41cfd7eb305

Message Broker / Stream Processor 

- Apache Kafka 

- Kinesis Data streams 

Stream Processing Tools 

- AWS Kinesis 

- Apache Spark Streaming 

- Apache Flink 

- Kafka streaming API 

- Amazon Kinesis Data analytics allows you to process data in streams and use analytics over it. It uses apache flink

--------------------------------------------

For Loading data 

- Kinesis Firehose  -Firehose also has some tranformation capabilities . Firehose has dynamic partition capabilities and can partition data based on keys


Kafka Consumer API 

-- this is not streaming API 

-- this allows to read kafka streams and validate them or apply a logic 


No comments:

Post a Comment