Cognossimplified: Data Streaming

Monday, 24 April 2023

Data Streaming

There are various data streaming platforms

- KAFKA

- Apark Streaming

- Flink

- Glue for Streaming

Amazon Kinesis Data Streams

Understanding AVRO

- https://www.confluent.io/blog/avro-kafka-data/

- Avro has metadata in JSON and data is stored in Binary. Each file has header which is human readable and rest is binary

-- Parquet and ORC - Are both suitable for Write once and Read heavy. ORC is optimized for hive with Hadoop. Both are columunar formas

-- Parquet works best with Spark

How Parquet Stores Data.Good Article - https://www.linkedin.com/pulse/all-you-need-know-parquet-file-structure-depth-rohan-karanjawala/

- Both ORC and Parquet squeeze the data - ORC ( Optimized Row Columunar)

https://www.upsolver.com/blog/the-file-format-fundamentals-of-big-data#:~:text=The%20ORC%20file%20format%20stores,reduce%20read%20and%20decompression%20loads.

How to Create Data streaming job

https://aws.amazon.com/blogs/big-data/crafting-serverless-streaming-etl-jobs-with-aws-glue/

Blog on Stream Processin

https://www.upsolver.com/wp/stream-processing-ebook?submissionGuid=d36d39a3-91ac-4f92-9a9b-f41cfd7eb305

Message Broker / Stream Processor

- Apache Kafka

- Kinesis Data streams

Stream Processing Tools

- AWS Kinesis

- Apache Spark Streaming

- Apache Flink

- Kafka streaming API

- Amazon Kinesis Data analytics allows you to process data in streams and use analytics over it. It uses apache flink

--------------------------------------------

For Loading data

- Kinesis Firehose -Firehose also has some tranformation capabilities . Firehose has dynamic partition capabilities and can partition data based on keys

Kafka Consumer API

-- this is not streaming API

-- this allows to read kafka streams and validate them or apply a logic

Cognossimplified

Monday, 24 April 2023

Data Streaming

No comments:

Post a Comment