Spark

Spark VS map Reduce (in memory computation).
what is rdd, what it consist of?
actions and transformations
wide and narrow transformations
what is shuffle and why does it bad?
Deploy modes. (Local, Client and Cluster)
run modes (local[1] vs local[*] vs remote [with yarn only or not?])
application workflow (jobs, stages, tasks)
why using collect is bad practice?
reduceByKey vs groupByKey vs combineByKey (rdd)
Coalesce VS Repartition
RDD VS Dataframe VS Dataset
Catalyst, CBO, Tungsten
Spark optimization (spark.serializer, spark.driver.maxResultSize, spark.sql.shuffle.partitions Vs spark.default.parallelism)
Cache vs persist method in Spark
window operations. What is it in spark.
metastore
Speculation execution
DStream. Nature and anatomy.
Window calculations in Streaming.
Checkpointing in Spark Streaming. WAL.
Spark Streaming Recievers
Spark Streaming application workflow
foreach partition VS foreach RDD
Monitoring Streaming Queries (Reporting Metrics, Dropwizard )
Spark Structured streaming: basic concept; Watermarking
Spark Structured streaming Output modes. (Append; Complete; Update)
Spark Kafka integration

Last updated 5 years ago

Was this helpful?