1.Difference between groupByKey() and reduceByKey() in spark?
groupBykey() works on dataset with key value pair(K,V) and groups data based on the key.A lot of shuffling occurs while grouping the dataset if it is not partitioned.
val dataset = sc.parallelize(Array((‘a’,5),(‘b,3),(‘b’,4),(‘c’,7)),3)
val groupdataset= data.groupByKey().collect()
reduceByKey() equivalent to grouping+ aggregation .We can say it works on combining dataset pairs based on key within same machine before shuffling.
val data= Array(“a”,”b”,”c”,”d”)
val combined_data = sc.parallelize(words).map(w => (w,1)).reduceByKey((v+w)=> v+w)
2.Define lineage graph and DAG in spark?
All RDDs created in Spark depends one or more RDD that new rdd contains pointer to parent RDD.All these dependencies between RDDs is represented by a graph rather than actual data is known as lineage graph.
DAG is combination of vertices and edges whereas vertices represents RDDs and the edges is represented by the operations applied over RDD.
3. What is the benefit of lineage graph?
Lineage graph information is used to recompute RDD whenever needed. If a part of RDD is lost if any reason,then lineage graph information re computes RDD again and continues to process spark application.
4. What is catalyst Optimizer?
It is new addition in Spark SQL framework.It allows spark to automatically transform SQL queries to execute more efficiently by adding new optimization techniques such as filtering,indexing and ensuring performance of data source joins most efficient order.
5. Why Dataset is more faster than RDD API?
Spark Dataset does not use standard serializers rather they uses Encoders can efficiently transform objects into internal binary storage.
Wheres RDD API uses Java or kryo serializer.Hence, it is slower to perform simple grouping and aggregation operations.
6. What are the Cluster Manager available in Spark?
· Standalone Mode: By default, spark provides simple cluster manager that is standalone. It is easy to set up within spark distribution and resilient in nature.
· Apache Mesos: Apache Mesos is an open-source project and distributed cluster manager.It supports two level of scheduling. The main advantage of using Apache mesos as cluster manager as it supports dynamic partitioning between spark and other frameworks as well as scalable partitioning between multiple instances of Spark.
· Hadoop YARN: It is the cluster resource manager of Hadoop 2. and it is compatible with Spark as well.
· Kubernetes: Kubernetes is an open-source system and new cluster manager scheduler for automating deployment, scaling, and management of containerized applications.
7. What is the benefit of using broadcast variable in spark?
Broadcast variables is read-only variable cached on each machine . They can be used, for example, when there is need to give every node a copy of a large input dataset in an efficient manner. It eliminates the necessity to ship copies of a variable for each task, so data can be processed faster.
8. What is the use of accumulator in spark?
Accumulator are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel processing. They are used as counters like in Mapreduce or sums which spark program runs over cluster..
val acc = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
9. Is there any benefit in comparison with Spark?
Yes, MapReduce is a programming framework used by many big data tools when the data becomes bigger and bigger. Most bigdata tools like Pig and Hive convert their queries into Map and Reduce phases to optimize them better
10. What is the use of Spark SQL over HQL and SQL?
.Spark SQL is a part of Spark Core engine that performs both SQL as well as Hive Query Language using their existing syntax. We can also join SQL table and HQL table using Spark SQL.
Hope you enjoy my blog!!
Originally published at https://itechshree.blogspot.com.