
  • Spark VS map Reduce (in memory computation).

  • what is rdd, what it consist of?

  • actions and transformations

  • wide and narrow transformations

  • what is shuffle and why does it bad?

  • Deploy modes. (Local, Client and Cluster)

  • run modes (local[1] vs local[*] vs remote [with yarn only or not?])

  • application workflow (jobs, stages, tasks)

  • why using collect is bad practice?

  • reduceByKey vs groupByKey vs combineByKey (rdd)

  • Coalesce VS Repartition

  • RDD VS Dataframe VS Dataset

  • Catalyst, CBO, Tungsten

  • Spark optimization (spark.serializer, spark.driver.maxResultSize, spark.sql.shuffle.partitions Vs spark.default.parallelism)

  • Cache vs persist method in Spark

  • window operations. What is it in spark.

  • metastore

  • Speculation execution

  • DStream. Nature and anatomy.

  • Window calculations in Streaming.

  • Checkpointing in Spark Streaming. WAL.

  • Spark Streaming Recievers

  • Spark Streaming application workflow

  • foreach partition VS foreach RDD

  • Monitoring Streaming Queries (Reporting Metrics, Dropwizard )

  • Spark Structured streaming: basic concept; Watermarking

  • Spark Structured streaming Output modes. (Append; Complete; Update)

  • Spark Kafka integration

