There are various data streaming platforms
- KAFKA
- Apark Streaming
- Flink
- Glue for Streaming
Amazon Kinesis Data Streams
Understanding AVRO
- https://www.confluent.io/blog/avro-kafka-data/
- Avro has metadata in JSON and data is stored in Binary. Each file has header which is human readable and rest is binary
-- Parquet and ORC - Are both suitable for Write once and Read heavy. ORC is optimized for hive with Hadoop. Both are columunar formas
-- Parquet works best with Spark
How Parquet Stores Data.Good Article - https://www.linkedin.com/pulse/all-you-need-know-parquet-file-structure-depth-rohan-karanjawala/
- Both ORC and Parquet squeeze the data - ORC ( Optimized Row Columunar)
How to Create Data streaming job
https://aws.amazon.com/blogs/big-data/crafting-serverless-streaming-etl-jobs-with-aws-glue/
Blog on Stream Processin
Message Broker / Stream Processor
- Apache Kafka
- Kinesis Data streams
Stream Processing Tools
- AWS Kinesis
- Apache Spark Streaming
- Apache Flink
- Kafka streaming API
- Amazon Kinesis Data analytics allows you to process data in streams and use analytics over it. It uses apache flink
--------------------------------------------
For Loading data
- Kinesis Firehose -Firehose also has some tranformation capabilities . Firehose has dynamic partition capabilities and can partition data based on keys
Kafka Consumer API
-- this is not streaming API
-- this allows to read kafka streams and validate them or apply a logic
No comments:
Post a Comment