Cognossimplified: April 2023

Friday, 28 April 2023

HUDI

Question on Hudi

What is Copy on write and Merge on Read

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html
Copy on write creates a new version of file in which data exists. Copy on write is more suitable for read heavy workload. It takes time to write data . COW is the default
Merge on Read is suitable for write heavy operations so we do not use it

What is Serialization in Athena table Serde

serde tells us format of input file. If we have json file and we load into hive table. serde tells it the underlying file is in json format.

What happens if we remove or add a column in Glue

Glue is hive catalog so you cannot remove a column - https://stackoverflow.com/questions/34198114/alter-hive-table-add-or-drop-column
you can alter table to keep only required columns
you can add a column to hive catalog and hive will populate null for previous data, you cannot have a default value for previous data , if you need default value you need to drop and recreate the table

- Materialized views and joining tables

-- how hudi knows something is latest record

-- Hudi Snapshot and Incremental API

How data skews are handled in Spark

SALT Technique in spark for Handling Skews - Adding a new field to join key

-- How do we partition table in S3 or HUDI , what partition keys do we use. What are indexes in HUDI

How Parquet stores data -

https://www.linkedin.com/pulse/all-you-need-know-parquet-file-structure-depth-rohan-karanjawala#:~:text=Each%20block%20in%20the%20parquet,in%20the%20form%20of%20pages.
Parquet does not store each column in a separate file
On High level there is body(Row groups) and footer (Metadata)
All the columns are contained in a single file as Row groups , one row group contain only that columns metadata

Monday, 24 April 2023

Data Streaming

There are various data streaming platforms

- KAFKA

- Apark Streaming

- Flink

- Glue for Streaming

Amazon Kinesis Data Streams

Understanding AVRO

- https://www.confluent.io/blog/avro-kafka-data/

- Avro has metadata in JSON and data is stored in Binary. Each file has header which is human readable and rest is binary

-- Parquet and ORC - Are both suitable for Write once and Read heavy. ORC is optimized for hive with Hadoop. Both are columunar formas

-- Parquet works best with Spark

How Parquet Stores Data.Good Article - https://www.linkedin.com/pulse/all-you-need-know-parquet-file-structure-depth-rohan-karanjawala/

- Both ORC and Parquet squeeze the data - ORC ( Optimized Row Columunar)

https://www.upsolver.com/blog/the-file-format-fundamentals-of-big-data#:~:text=The%20ORC%20file%20format%20stores,reduce%20read%20and%20decompression%20loads.

How to Create Data streaming job

https://aws.amazon.com/blogs/big-data/crafting-serverless-streaming-etl-jobs-with-aws-glue/

Blog on Stream Processin

https://www.upsolver.com/wp/stream-processing-ebook?submissionGuid=d36d39a3-91ac-4f92-9a9b-f41cfd7eb305

Message Broker / Stream Processor

- Apache Kafka

- Kinesis Data streams

Stream Processing Tools

- AWS Kinesis

- Apache Spark Streaming

- Apache Flink

- Kafka streaming API

- Amazon Kinesis Data analytics allows you to process data in streams and use analytics over it. It uses apache flink

--------------------------------------------

For Loading data

- Kinesis Firehose -Firehose also has some tranformation capabilities . Firehose has dynamic partition capabilities and can partition data based on keys

Kafka Consumer API

-- this is not streaming API

-- this allows to read kafka streams and validate them or apply a logic

Wednesday, 19 April 2023

Python Notes

Quick python revision notes

# formating
name = 'bhagvant'
f = f"this is {name}"
print(f)

# Notice we are not adding f
name = 'bhagvant'
greeting = 'hello {}'
k = greeting.format(name)

################ LIST ###########################
l = ["Bob", "Rolf", "Anne"]

l[0] = "Smith"
l.append("Jen")

# extend allows to add set or tuple to list, l+s will not work , works only for list
s = {"Bob", "Rolf", "Anne"}
l.extend(s)
print(l)

# we can even insert a set to list
l[0] = s

#gets the length of list
len(l)

# Get number of times element appears in list
print(l.count("Bob"))

# to remove element from given inde type POP
l.pop(2)
l.pop() # removes last element

# reverse a list
l.reverse()

# sorts a list
prime_numbers = [11, 3, 7, 5, 2]
prime_numbers.sort()

############ Tuple ##############################

# you can concat 2 tuples but cant , change element of tuple
tuple1 = (0, 1, 2, 3)
tuple2 = ('python', 'geek')
 
# Concatenating above two
print(tuple1 + tuple2)



################# SET ##########################
# sets are unordered, Cannot contain duplicates and efficient for searching elements


friends = {"Bob", "Rolf", "Anne"}
abroad = {"Bob", "Anne"}
# to Create a set you cant use {} , this will create empty dictionary
a = set()

s.add("Jen")
# set cant have same element twice , its distinct
s.add("Bob")


print(friends.difference(abroad))
# returns empty set
print(abroad.difference(friends))

print(friends.intersection(abroad))


friends = {"Bob", "Rolf", "Anne"}
abroad = {"Bob", "Anne"}

# superset
if friends > abroad :
    print('superset')
   
# subset
if abroad < friends:
    print('subset')


# easy to check key in , fastest data structure due to hash table
for name in friends:
    if name in abroad:
        print(f' {name} gone abroad')



############# Is operator #######################
if 2 variables point to same object
x = 5
y = 5

print(x is y)

############ If condition #####################

dayofweek = 'Monday'

if dayofweek == 'Monday':
    print(1)

######### in keyword #################
# The `in` keyword works in most sequences like lists, tuples, and sets.

friends = ["Rolf", "Bob", "Jen"]
print("Jen" in friends)

# --

movies_watched = {"The Matrix", "Green Book", "Her"}
user_movie = input("Enter something you've watched recently: ")

print(user_movie in movies_watched)


############ LOOPS #######################

while n < 5:
    print(n)
    n+=1
   
while True:
    break
   
for i in range(10):
    print(f'for {i}')

#How to have 2 loops and loop through
for i in range(_size):
    k = i + 1
    for j in range(k, _size):

# remember when range start and end are equal it will not print
for i in range(1,1):
    print('wont print')

# incrementing range by 2 every time
for i in range(0,6,2):
    print(i)

########### List comprehension ###################

print([i for i in range(10) if i in [2,4,6]])


######## Dicationaries ##################

friend_ages = {"Rolf": 24, "Adam": 30, "Anne": 27}
print(friend_ages.keys())
print(friend_ages.values())
print(friend_ages['bhagvant'])

# adding to dictionary
friend_ages['bhagvant'] = 54
   
# Simple
student_attendance = {"Rolf": 96, "Bob": 80, "Anne": 100}
for student in student_attendance:
    print(f"{student}: {student_attendance[student]}")

# better
for i,k in friend_ages.items():
    print(i,k)


############## Functions #####################

def abc(a=1,b=2):
    print('this is function',a,b)
   
abc(1,3)

def abc(a=1,b=2):
    print('this is function',a,b)
    c = a+b
    return c
   
total = abc(1,3)



########## Lambda ############################
#map allows to add lambda to sequence
l = [1,2,3,4,5]
sum1 = lambda x:x+1

map_object = map(sum1,l)
   
print(list(map_object))



# * packs arguments into sinle list
a,*b = 1,2,3,4
print('first',a,b)

# unpacks it into tuple
def abc(k,*a):
    print(a)
    print(k)
   
abc(1,2,3,4)


# it packs the values into dictionary
def abc(**kwargs):
    print(kwargs)
   
abc(a='kk',b='jj')

anotherfunctionwithKeyValue(**kwargs)

def abc(**kwargs):
    print(kwargs)
    # allows to pass it
    anotherfunctionwithKeyValue(**kwargs)
   
abc(a='kk',b='jj')



########## Object oriented programming ###################

class abc:
    # notice init has to have self as argument
    def __init__(self,a=0,b=0):
        self.c = a+b
        # all class variables need self
   
    # self has to be passed as argument
    def multi(self):
        print(self.c)
     
    # used to print info about class , shoudl have a return
    def __str__(self):
        return f"value of a is {self.c}"
       
   
k = abc(1,1)
k.multi()
# str gets called when you print class ref
print(k)

Apache Airflow Notes

Notes on apache airflow

with DAG( dag_id = ) as dag :

- Operators in Airflow - default python , bash the default ones which come with it

- https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/index.html

- When interactin with any third party provides we install those providors

- Example AWS , Snowflake , Databricks

- Sensors -

- Task group - Airflow utils has task group ids which can be used for grouping

Interview Use cases to talk about

Engineering Challenges solved

Building Data lake

- A system built over time was moved to Datalake

- Initial system has multiple redshifts / Compaction issues / Multiple s3 paths from which data was consumed by customers

- Ownership was divided among teams but not logically

- access was not controlled , redundant access

STAR format - How was the impact measured

- SLA improvement -- Team was able to redesign pipeline during migration to datalake and take our redundant steps improving sla 6 hours to 3 hours

- Cost saving of 200k by moving processing

People Challenges solved

Hiring and Recruiting Issues solved

Big Projects Handled

Appraisal Ratings

Cost Saving

- What is cost of Redshift

- What is cost of EMR

- Cost of Athena

- Number of nodes

Ra3.16x Large - Reserved instance - 75,000 - 48vcpu , 384 ram , 128TB space , Scales up to 16 petabytes

DS2- 8x large - 16TB , HDD , 244gb memory , -- We had 50 Node cluster - DS2 is deprecated - 30 thousand

DC2-8x - 32vcpu ,244 gb memory, 2.5TB SSD,

EMR Type used - R5d - 48 VCPU, 512 memory - Supports upto EMR 6.3

Tuesday, 18 April 2023

Dependency Management with Other teams

- For teams in initial stages - Documenting user feedback on How Data teams have helped business indirectly greatly helps. Example Product owners were able to show a new widget which boosted sales. Understanging user behaviour helped them to come up with widgets

- For teams in advanced stages - Integrating data reports with business operations. Infrastructure savings

- Productivity

- Prioritizing work , unblocking teams members.

- Building trust within team so that they can voice their concerns, Running retros. Running team bonding activities

- Monitoring daily scrum calls

- Making sure work is aligned with career goals of people on the team.

- Building work life balance by setting up processes around work intake, completion timelines.

- Setting ownership expectations

- System Health

- Making sure alarms, disaster recorvery are in place. Setting up ownership for projects

- planning for version updates , migrations

- Team Health

- Stakeholder Happiness

- separate post on stakeholder happiness

Delivering Results

Team Builing

Collaboration

Vision

Professional Goals

Stakeholder Management

Earn trust of Stakeholder and make sure you are not loosing trust

Clearly defining ways of working of the team with SLA for resolving a particular type of issue

* Data Quality issues

* New data onboarding requests, How will new request come in

* Adhoc data requests

* Office hours and how to get expert to help stakeholders

* Clearly defining Data SLA

* What will the team helping with (Example team wont be writing queries for user questions) . May be only during office hours team can help

* Clearly defining ways to onboard customers

* How will stakeholder receive communication in case of issues/delays

* A clear onboarding process for stakeholders with clear defination of what to expect.

* In case of indirect dependencies stakeholder is creating and are critical business functions. Making sure the downstream customers are aware of the SLA , processes from upstream team

* Making sure access is not spiralling out and only those who are required are having access

Have seen a lot of stakeholder issues as it was not clear how a stakeholder is consuming data. The consumption pattern were randomly developed by Stakeholders without consulting data team.

Having Dashboards to clearly show and document

* SLA

* Data Quality

* Incident requests

* Updates and Backfill delays

* Centere of excellence - In case of bigger issues what was done in past

How to Track Sprint progress as Manager

Task Priorities

Velocity

Running Retrospectives

How to Prioratize task

* We have organization goals for a year, we align our highest priority work to match those

High Priority

Example moving to Cloud/ Moving to inhouse reporting tools to reduce cost / Moving serves to cloud based / Aligning with Org initiatives like moving to data lake
Bug Fixes - Is this bug immediately impacting business or can it got to backlog

Priority needs to be discussed with Business and based on business impact work prioritized

- New dataset requests

- Report /Dashboard - creation request

Tech Debts

Velocity of Work

- Define how to measure work

- Measure velocity of work

- Fibonaci series number - 1,3,5,8,13

- We measure story point even for stories in backlog

- We investigate a sudden drop in sprint velocity

- for capacity we track - 6 hours * 10 days = 70 hours

- example 5 members - 6 points each - 30 points for each sprint

How to measure Impact in Data engineering Projects

How to measure impact in Data engineering Project

* Depending on the stage your data engineering project is in the, Metric to measure impact will vary

* First we need to define a goal and measure how we are doing against the goal. High level goals are

* Have the Data - Stage 1

* Know how to use it - Stage 2

* Trust the Data - Stage 3

Have the Data - Stage 1

When we just starting data team, we are in this stage.

Accuracy

At this stage we want to measure accuracy of data. For this we generally setup alarms comparing source data with target. Example if the sum of sales of orders in source by 10 am is 20. Same should reflect in the datawarehouse

Completness

We want to make sure we have captured all the orders by 10am . Example 5 orders in source , same 5 should reflect in target

Consistency

Are we consistent in getting the data . SLA met

Usability

Number of reports , Dashboards published in Subject area

User survey results

Reliability

Number of tickets for data quality by users

Know how to use this data - STAGE 2

Here we mean business users should know how to use it . We use below metrics to track it

User Training / BI Office hours conducted /Number of people attended weekly
Number of users onboarded.
Scheduled reports running
Adhoc queries run by users on Daily basis

Trust the Data - STAGE 3

Metrics we track at this stage

* Is DW used for operational reporting. We want to track add direct business impact

* Infrastructure savings - At this stage, we want to to track how we can reduce our infrastructure costs

* Reduction in Number of Adhoc request for data via tickets

* Turn Around Time - This is at very high level. How long does it take to fulfill a new data requests/ Project time

Additional Spark Interview Questions

* How to add column to spark dataframe

df.withColumn("copiedfromcolumn",col(salary)-1)

df.select(orderid,salary-1 as new_salary)

using map to add column in spark

* Difference between map and flat map

- https://sparkbyexamples.com/spark/spark-map-vs-flatmap-with-examples/

- Flat map returns more rows than the input, map returns same number of rows

* Reduce by and Group by

* https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

-- Reduce by key first group data in the partition and then shuffles there by reducing amount of data which needs to be shuffled

* How to perform basic spark operations

* What is cost based operation in spark

* How to optimize Athena Queries

- https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

* Redshift spectrum extends support for Open source Hudi Datalakes

* https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-redshift-spectrum-adds-support-for-querying-open-source-apache-hudi-and-delta-lake/

* Redshift spectrum and Athena

https://www.upsolver.com/blog/aws-serverless-redshift-spectrum-athena
Redshift spectrum need you to have a Redshift cluster , Athena does not
Redshift spectrum performance is based on your cluster size so can be faster than athena
Redshift spectrum allows you to join S3 data with your tables in cluster

* How to use Spark Sql inside scala based application

Sparkdf.CreateorReplaceTempView(viewName = "abc_table")

val agg_df = spark.sql("select order,count(*) from abc_table group by order")

Steps to work with Spark

- create a Spark session objects

spark = SparkSession.builder().config().getOrCreate().

df = spark.read.csv.option()

df.select.where.groupby()

df.join(df2,df(colname) === df2(colname),inner)

Spark Optimizations

* https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/

* out of memory issue happen when Spark executur memory or parallelism is not set properly

* R for memory intensive and C for compute intensive applications

* Understand that out of total executor memory around 90% is available for executer , Rest is system process, function memory , Reserved

spark.executor.memory – Size of memory to use for each executor that runs the task.
spark.executor.cores – Number of virtual cores.

Take example - Suppose we want to process 200 TB of data in S3 files

- r5.12x -- 48 Cores , 384 GB ram - 20 instances in total ( 1 for driver )

- Lets start with Spark executor cores - 5 cores ,

- Spark Executor memory - 383/9 instances - 37

spark.executor.instances - 9 * 19 - 170

spark.default.parallelism - 170 * 5 * 2 = 1700

* Coalesce vs Repartition

- https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce

- Coalesce is used for reducing numer of partition and does not call data shuffle

Additional Spark Optimization Techniques

- Enabling spark dynamic query execution

- Handles situation where Partition size is not uniform leading to work being done only by small number of executors . Example grouping by state where number of states are limited and this will leave work to small number of executors . Helps in estimating number of partitions

- Dynamically switching join stratergies - Example filtering data while joining . Join stratergy will be basd on size of dataset , but AQE will make sure plan is changed based on dymamic data size

- Handling cases where one partition is much bigger - one task taking much longer. General approach

Noting down my Spark Understanding - How Plan is generated

Understanding Spark Jobs , Stages , Task.

* Purpose here is to understand how spark execution plan is generated

* Spark has Action and Transformation. Spark is lazy evaluation and only Action generates a Job.

* Transformation and Action in Spark

* Spark DataFrame is a distrubuted data structure and its immutable

* Sql like operations are transfromation - Select , Filter, Group by , union , Intersection, distinct, repartition

- https://spark.apache.org/docs/latest/rdd-programming-guide.html

Transformation	Meaning
map(func)	Return a new distributed dataset formed by passing each element of the source through a function func.
filter(func)	Return a new dataset formed by selecting those elements of the source on which func returns true.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.
sample(withReplacement, fraction, seed)	Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.
union(otherDataset)	Return a new dataset that contains the union of the elements in the source dataset and the argument.
intersection(otherDataset)	Return a new RDD that contains the intersection of elements in the source dataset and the argument.
distinct([numPartitions]))	Return a new dataset that contains the distinct elements of the source dataset.
groupByKey([numPartitions])	When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using `reduceByKey` or `aggregateByKey` will yield much better performance. Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional `numPartitions` argument to set a different number of tasks.
reduceByKey(func, [numPartitions])	When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in `groupByKey`, the number of reduce tasks is configurable through an optional second argument.
aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])	When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in `groupByKey`, the number of reduce tasks is configurable through an optional second argument.
sortByKey([ascending], [numPartitions])	When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean `ascending` argument.
join(otherDataset, [numPartitions])	When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through `leftOuterJoin`, `rightOuterJoin`, and `fullOuterJoin`.
cogroup(otherDataset, [numPartitions])	When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called `groupWith`.
cartesian(otherDataset)	When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
pipe(command, [envVars])	Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.
coalesce(numPartitions)	Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions)	Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(partitioner)	Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling `repartition` and then sorting within each partition because it can push the sorting down into the shuffle machinery.

Actions

Action	Meaning
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
collect()	Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count()	Return the number of elements in the dataset.
first()	Return the first element of the dataset (similar to take(1)).
take(n)	Return an array with the first n elements of the dataset.
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n, [ordering])	Return the first n elements of the RDD using either their natural order or a custom comparator.
saveAsTextFile(path)	Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
saveAsSequenceFile(path) (Java and Scala)	Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
saveAsObjectFile(path) (Java and Scala)	Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using `SparkContext.objectFile()`.
countByKey()	Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
foreach(func)	Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. Note: modifying variables other than Accumulators outside of the `foreach()` may result in undefined behavior. See Understanding closures for more details.

* Each Stage will have its own DAG , DAG is spark compiler calling low level API, its difficult for us to know exact details

* Each Actions triggers a Job in our case reading data from CSV

* Each Job is seperated by a Shuffle operation

* Generally wide dependency tranformation such as Group by , Repartition , will have it own stages

* Narrow dependency tranformation in a stage will have its own tasks - example where , select , group by

* Whenever we need to shuffle sort the results these are generally broken into stages

dsflkadslkf

Cognossimplified