Effects of partitionBy(HashPartitioner) on a cached RDD when few values are modified - scala

I would like to know what happen when I put in cache a RDD then get a new RDD by modifying a limited number of values.
rdd.cache
val rdd2 = rdd.map(x=>if(cond) partitionValue else x)
The part of RDD which hasn't been touch is it still in cache if I used rdd2 ?
Moreover I need to update the partition in which are the modified values so I
val rdd2bis = rdd2.partitionBy(HashPartioner(nbPart))
And I would like to iterate this process for each datapoint :
Find in which partition should go one value.
Modify my value and put it in the right partition using partitionBy.
So my main question is if partitionBy keeps output RDD in memory if only few members have been modified?
I know that the partitionBy gives a new RDD as output but is there any chance that some of the non modified cached values are still in cache for the generated RDD.

I would like to know what happen when i put in cache a RDD then get a new RDD by modifying limited number of value.
If you literally modify mutable objects in place you'll end up with programming which is incorrect and nondeterministic.
The part of rdd which hasn't been touch is it still in cache if i used rdd2 ?
If you map without modifying existing objects it won't affect cached data at all. rdd should be cached as it was (unless evicted due to memory issues), rdd2 won't be. Nevertheless data is not copied so "unchanged" records in rdd2 reference the same objects as rdd.
if partitionBy keeps output RDD in memory if only few members have been modified
No. partitionBy requires standard shuffle mechanism. Once again it doesn't really affected cached state of the rdd.

Related

Importance of the caching [Spark]

Could you explain me please why does the following happen?
I have a .csv file with some data (about 25kk rows).
I'm doing the following:
val RDD1 = sc.textFile(...).map(...).aggregateByKey(...).
mapValues(...).persist(StorageLevel.MEMORY_AND_DISK)
Then I'm doing with RDD1 some more transformations:
val RDD2 = RDD1.zipWithIndex(...).cartesian(...).filter(...).map(...)
At this moment there are about 14kk elements in RDD2 and for each of them I'm doing some calculations.
Finally I'm writing the result into files:
RDD2.map(...).saveAsTextFile(...)
It seems to be working. But if I don't use the persist() method, then I'll get some different errors such as GC error, heartbeat timeout errors as so on.
I thought that caching is essential only if I use the RDD1 many times, so I don't have to evaluate it for the second time. But I'm using RDD1 only one time - to create RDD2. Why is it happening?
Thank you in advance!
PS: I'm attaching my code there just if what I described above is not enough to detect the problem.
I thought that caching is essential only if I use the RDD1 many times, so I don't have to evaluate it for the second time
It is recommended to use persist or cache anytime we want to reuse an RDD (no matter how many times).
In the example used on your question:
val RDD2 = RDD1.zipWithIndex(...).cartesian(...).filter(...).map(...)
when transformations are called on RDD1, RDD1 is reevaluated.
From the docs:
Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

Spark Dataframe performance for overwrite

Is there any performance difference or considerations between the following two pyspark statements:
df5 = df5.drop("Ratings")
and
df6 = df5.drop("Ratings)
Not specifically targeting the drop function, but any operation. Was wondering what happens under the hood when you overwrite a variable compared to creating a new one.
Also, is the behavior and performance considerations the same if this was an RDD and not a dataframe ?
No, There won't be any difference in the operation.
In case of Numpy, There is a option of flag which shows whether its own the data or not.
variable_name.flag
In case of Pyspark, the Dataframe is immutable and every change in the dataframe creates a new Dataframe. How does it do ? well, Dataframe is stored in distributed fashion. So, to move data in memory costs. Therefore, they change the ownership of data from a Dataframe to another, more particularly where index of the data is stored.
and
Dataframe is way better than RDD. Here is a good blog.
Dataframe RDD and dataset

Scala - Write data to file with row limit

I have an RDD with 30Million rows of data, Is there a way to save this into files of 1M each.
I think their is no direct way of doing it. one thing you can do is collect() your rdd and get the iterator from it and save it using normal file save using what scala provides. Something like this
val arrayValue = yourRdd.collect();
//Iterate the array and put it in file if it reaches the limit .
Note: This approach is not recommended if your data size id huge because collect() will bring all the records of RDD to driver code(Master).
You can do rdd.repartition(30). This will ensure that your data is about equally partitioned into 30 partitions and that should give you partitions which have roughly 1 Mil rows each.
Then you do simple rdd.saveAsTextFile(<path>) and Spark will create as many files as partitions under <path>. Or if you want more control over how and where your data is saved, you can do rdd.foreachPartition(f: Iterator[T] => Unit) and handle the logic of actually dealing with rows and saving then as you see fit within the function f passed to the foreachPartition. (Note that foreachPartition will run on each of your executor nodes and will not bring the data back to driver, which of course is a desirable thing).

How to access broadcasted DataFrame in Spark

I have created two dataframes which are from Hive tables(PC_ITM and ITEM_SELL) and big in size and I am using those
frequently in the SQL query by registering as table.But as those are big, it is taking much time
to get the query result.So I have saved them as parquet file and then read them and registered as temporary table.But still I am not getting good performance so I have broadcasted those data-frames and then registered as tables as below.
PC_ITM_DF=sqlContext.parquetFile("path")
val PC_ITM_BC=sc.broadcast(PC_ITM_DF)
val PC_ITM_DF1=PC_ITM_BC
PC_ITM_DF1.registerAsTempTable("PC_ITM")
ITM_SELL_DF=sqlContext.parquetFile("path")
val ITM_SELL_BC=sc.broadcast(ITM_SELL_DF)
val ITM_SELL_DF1=ITM_SELL_BC.value
ITM_SELL_DF1.registerAsTempTable(ITM_SELL)
sqlContext.sql("JOIN Query").show
But still I cant achieve performance it is taking same time as when those data frames are not broadcasted.
Can anyone tell if this is the right approach of broadcasting and using it?`
You don't really need to 'access' the broadcast dataframe - you just use it, and Spark will implement the broadcast under the hood. The broadcast function works nicely, and makes more sense that the sc.broadcast approach.
It can be hard to understand where the time is being spent if you evaluate everything at once.
You can break your code into steps. The key here will be performing an action and persisting the dataframes you want to broadcast before you use them in your join.
// load your dataframe
PC_ITM_DF=sqlContext.parquetFile("path")
// mark this dataframe to be stored in memory once evaluated
PC_ITM_DF.persist()
// mark this dataframe to be broadcast
broadcast(PC_ITM_DF)
// perform an action to force the evaluation
PC_ITM_DF.count()
Doing this will ensure that the dataframe is
loaded in memory (persist)
registered as temp table for use in your SQL query
marked as broadcast, so will be shipped to all executors
When you now run sqlContext.sql("JOIN Query").show you should now see a 'broadcast hash join' in the SQL tab of your Spark UI.
I would cache the rdds in memory. The next time they are needed, spark will read the RDD from memory rather than generating the RDD from scratch each time. Here is a link to the quick start docs.
val PC_ITM_DF = sqlContext.parquetFile("path")
PC_ITM_DF.cache()
PC_ITM_DF.registerAsTempTable("PC_ITM")
val ITM_SELL_DF=sqlContext.parquetFile("path")
ITM_SELL_DF.cache()
ITM_SELL_DF.registerAsTempTable("ITM_SELL")
sqlContext.sql("JOIN Query").show
rdd.cache() is shorthand for rdd.persist(StorageLevel.MEMORY_ONLY). There are a few levels of persistence you can choose from incase your data is too big for memory only persistence. Here is a list of persistence options. If you want to manually remove the RDD from the cache you can call rdd.unpersist().
If you prefer to broadcast the data. You must first collect it on the driver before you broadcast it. This requires that your RDD fits in memory on your driver (and executers).
At this moment you can not access broadcasted data frame in the SQL query. You can use brocasted data frame through only through data frames.
Refer: https://issues.apache.org/jira/browse/SPARK-16475

(Why) do we need to call cache or persist on a RDD

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?
val textFile = sc.textFile("/user/emp.txt")
As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.
If so, why do we need to call "cache" or "persist" on textFile RDD then?
Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:
val textFile = sc.textFile("/user/emp.txt")
It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.
RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.
What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.
So what does RDD.cache do? If you add textFile.cache to the above code:
val textFile = sc.textFile("/user/emp.txt")
textFile.cache
It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.
The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.
I think the question would be better formulated as:
When do we need to call cache or persist on a RDD?
Spark processes are lazy, that is, nothing will happen until it's required.
To quick answer the question, after val textFile = sc.textFile("/user/emp.txt") is issued, nothing happens to the data, only a HadoopRDD is constructed, using the file as source.
Let's say we transform that data a bit:
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
Again, nothing happens to the data. Now there's a new RDD wordsRDD that contains a reference to testFile and a function to be applied when needed.
Only when an action is called upon an RDD, like wordsRDD.count, the RDD chain, called lineage will be executed. That is, the data, broken down in partitions, will be loaded by the Spark cluster's executors, the flatMap function will be applied and the result will be calculated.
On a linear lineage, like the one in this example, cache() is not needed. The data will be loaded to the executors, all the transformations will be applied and finally the count will be computed, all in memory - if the data fits in memory.
cache is useful when the lineage of the RDD branches out. Let's say you want to filter the words of the previous example into a count for positive and negative words. You could do this like that:
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
Here, each branch issues a reload of the data. Adding an explicit cache statement will ensure that processing done previously is preserved and reused. The job will look like this:
val textFile = sc.textFile("/user/emp.txt")
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
wordsRDD.cache()
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing.
Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
Do we need to call "cache" or "persist" explicitly to store the RDD data into memory?
Yes, only if needed.
The RDD data stored in a distributed way in the memory by default?
No!
And these are the reasons why :
Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
For more details please check the Spark programming guide.
Below are the three situations you should cache your RDDs:
using an RDD many times
performing multiple actions on the same RDD
for long chains of (or very expensive) transformations
Adding another reason to add (or temporarily add) cache method call.
for debug memory issues
with cache method, spark will give debugging informations regarding the size of the RDD. so in the spark integrated UI, you will get RDD memory consumption info. and this proved very helpful diagnosing memory issues.