First create a azure free account and try this simple pipeline.

To create pipeline which moves file from one blob container to other blob container, we need 2 resources in Microsoft Azure

Azure storage account creation step by step:

Provide necessary details as per your requirement and continue to clicking next button keeping all default settings.


Spark Out Of Memory Error

OOM error Spark driver level:

1. Spark driver is the main control of spark application .if its configured with less memory to collect all data of files then it throws error.

2. If table size which is to be broadcasted is huge then also driver faces OOM error.

OOM error Spark executor level:

1. Spark job is executed though one or more number of stages and each stages consists of multiple task. No of task at executor level depends on spark.executor.cores property. If it is set higher value without consideration of memory required then spark job fails with OOM error.


1. Difference between Coalesce and repartition?

Repartition is used to increase or decrease the number of partition with equal sized data and creates a lot of shuffling.

Coalesce can be used to decrease the the number of partition or use existing partitions minimizing the amount of data that is shuffled.

2. Advantages of parquet file format in Spark?

Parquet file is native to spark and Parquet file with snappy compression is best optimized format for spark application .It …


1.Difference between groupByKey() and reduceByKey() in spark?

groupBykey() works on dataset with key value pair(K,V) and groups data based on the key.A lot of shuffling occurs while grouping the dataset if it is not partitioned.

val dataset = sc.parallelize(Array((‘a’,5),(‘b,3),(‘b’,4),(‘c’,7)),3)

val groupdataset= data.groupByKey().collect()

group.foreach(println)

reduceByKey() equivalent to grouping+ aggregation .We can say it works on combining dataset pairs based on key within same machine before shuffling.

val data= Array(“a”,”b”,”c”,”d”)

val combined_data = sc.parallelize(words).map(w => (w,1)).reduceByKey((v+w)=> v+w)

data.collect.foreach(println)

2.Define lineage graph and DAG in spark?

All RDDs created in Spark depends one or more RDD that new rdd contains pointer to parent…


Are you worried about getting older?

Here is some tips you must follow .
1.Dont lie straight while sleeping .Always lie on your back otherwise it will obstruct blood circulation and cause skin sagging.

2. Always eat plenty of green leafy vegetables and lot of fruits. Avoid fried chips and oily food as much as possible. Drink fruit juice , almond milk, turmeric milk etc. Turmeric penetrates in the skin and makes it glow faster. Almond ,Walnut consists ingredients which gives a glowing skin.

3. Try to stay as less as possible in AC. If One has to stay in…


Lip balm moisturizes our lips skin tone by enhancing blood circulation over lips. Due to the preservatives added in lip balms available in market , these are not beneficial to our skin tone of lips .

So let’s begin lip balm preparation at home which you can use all the time.

Ingredients:

1.one beetroot

2. 1 table spoon glycerin

3. 1/2 table spoon coconut oil

Method:


Hbase is a distributed No SQL system built on top of HDFS(Hadoop distributed file system).

It is derived from Google’s Bigtable and stores huge volume of structured or unstructured data over discrete columns instead of rows and provides consistent read and write access. This makes use this HBase feature for high-speed requirements .

Data representation in Hbase Table:

An HBase table is divided into rows, column families, columns, and cells. …


A NoSQL database can be called as non SQL or non relational database that provides a way to store and retrieve data modeled in non tabular format.

Why NoSqL?

A traditional database system prefers more predictable, structured data and has been dominating the database industry for the past few years. Nowadays as business grows,social media dominates there is a need

1.support a large number of concurrent users

2.Handle huge amount of semi structured data as well as unstructured data

3.High availability system without any downtime

4.Huge amount of data insertion and population

Hence , Relational databases are unable to meet…


Vijayadashami known as Dussehra is one of the major Hindu festivals which is celebrated every year at the end of Navaratri .It marks the end of Durga Puja and Ramlila. This Navratri festival is associated to the prominent battle of Maa Durga and buffalo demon Mahishasura that lasted for nine days. This war ended at 10th day by elimination of demon mahishasura . Each day in Navratri is dedicated to the nine avatars of Maa Durga .

Debashree Gorai

Information and Technology Analyst|Bigdata Developer |Spark|Scala

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store