Spark Core interview questions Set 1

Debashree Gorai
3 min readOct 4, 2020

1.What is spark good for?

Spark ,a data processing engine used in wide range of platforms including spark ETL batches ,machine learning process, streaming application where data comes from IOT, sensor etc.

2.What is lazy evaluation in spark?

It means execution of spark application won’t happen until action is triggered.When transformations are called in RDD,it doesn’t execute immediately rather it maintains a list of operations through DAG.

It helps spark in processing a large amount of data instead of doing computation on each element of data.

Val no_of_occurence_spark=wikipedia.filter(_.contains(“spark”)).take(5)

here immediate RDD do not execute immediately and spark execution process will evaluate 5 elements of filtered RDD and stops.This is a way where spark processing occurs through lazy evaluation.

3.What is the use of the Spark engine?

Spark engine is responsible for planning, distributing, and monitoring data applications in a cluster.

4.What is RDD Lineage?

Spark does not replicate data while processing rather it maintains history of operations using DAG called RDD lineage which helps to reconstruct lost RDD. We can obtain and study lineage graph by using RDD.toDebugString .

5.What is SparkContext?

It was introduced since Spark 1.x and acted as main entry point of spark program before SparkSession introduced in Spark 2.0. It placed at first line of program and creates connection with Spark cluster. It is used to create broadcast variables , accumulators , RDD and run jobs and create spark services.

Default object of SaprkContext available as sc in spark console .

Val rdd=sc.textFile(“Fromfile.txt”)

Creating SparkContext using programming:-

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName(“Spark Application”)

val sc = new SparkContext(conf)

SparkContext constructor has been deprecated in Spark 2.0 hence, it recommends to use a static method getOrCreate() to create SparkContext

SparkContext.getOrCreate(SparkConf)

Sparkcontext instance can run only one instance per single jvm

6.What is Spark Executor?

When spark application is submitted in spark cluster ,it launches executors in worker nodes .Executors are the Spark processes which runs the task and sends result to the driver.

7.Define broadcast variables?

Broadcast variables acts as read-only variable cached in all executors. It reduecs network I/O overhead as each executor having local of of it.

In streaming application ,when one lookup dataset has to be used across multiple executors for each event, the broadcast variable comes into picture which in turn reduces network I/O in costly operations.

The broadcast variable is a wrapper around variable , and its value can be accessed .The code depicts below:

scala> val broadcast_Var = sc.broadcast(Array(1, 2, 3,4,5))

broadcast_Var r: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcast_Var.value

res0: Array[Int] = Array(1, 2, 3,4,5)

8.Define accumulators?

Accumulators are variables that can only be “added” to through an associative operation in parallel. It is used to implement counters or sums .

scala> val accum_var = sc.longAccumulator(“My Accumulator variable”)

accum_var: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator variable), value: 0)

scala> sc.parallelize(Array(1, 2)).foreach(x => accum_var.add(x))

…INFO SparkContext: Tasks finished in 0.111106 s

scala> accum_var.value

res2: Long = 3

9.Define RDD Persistence?

Spark provides the feature to persist RDD in memory across operations.I does by calling persist() and cache() method on it.

There are different storage level available to store persisted RDD. However ,cache() method is used only to store RDD in memory.

MEMORY_ONLY — This is the default storage level. RDD is stored as deserialized Java objects in the JVM. doesn’t fit in memory, some partitions will not be cached and recomputed each time they’re needed.

MEMORY_AND_DISK — It is similar to MEMORY_ONLY but If the RDD doesn’t fit in memory, store the partitions that don’t fit on disk, and read them from there when they are needed.

MEMORY_ONLY_SER -It stores RDD as serialized Java objects ( i.e. one-byte array per partition). It is more space-efficient than deserialized objects. It is available only in JAVA and Scala

MEMORY_AND_DISK_SER-It is similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them.

DISK_ONLY -It stores the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. -It is the same as the levels above, but replicate each partition on two cluster nodes.

OFF_HEAP -It is similar to MEMORY_ONLY_SER, but store the data in off-heap memory. The off-heap memory must be enabled.

These are the questions which I have experienced and shall post more to help you.

Happy learning!!

See you in next blog!

Originally published at https://itechshree.blogspot.com.

--

--