# Spark

* Spark VS map Reduce (in memory computation).
* what is rdd, what it consist of?
* actions and transformations
* wide and narrow transformations
* what is shuffle and why does it bad?
* Deploy modes. (Local, Client and Cluster)
* run modes (local\[1] vs local\[\*] vs remote \[with yarn only or not?])
* application workflow (jobs, stages, tasks)
* why using collect is bad practice?
* reduceByKey vs groupByKey vs combineByKey (rdd)
* Coalesce VS Repartition
* RDD VS Dataframe VS Dataset
* Catalyst, CBO, Tungsten
* Spark optimization (spark.serializer, spark.driver.maxResultSize, spark.sql.shuffle.partitions Vs spark.default.parallelism)
* Cache vs persist method in Spark
* window operations. What is it in spark.
* metastore
* Speculation execution
* DStream. Nature and anatomy.
* Window calculations in Streaming.
* Checkpointing in Spark Streaming. WAL.
* Spark Streaming Recievers
* Spark Streaming application workflow
* foreach partition VS foreach RDD
* Monitoring Streaming Queries (Reporting Metrics, Dropwizard )
* Spark Structured streaming: basic concept; Watermarking
* Spark Structured streaming Output modes. (Append; Complete; Update)&#x20;
* Spark Kafka integration
