I am in need of using localCheckpoint() to resize shuffle partitions, but I am unsure how to use it properly. I searched through Github as well as Stackoverflow and found effectively everyone repeating this block of code (with no further context, situation, reasoning, etc) which is from this databricks presentation: https://www.youtube.com/watch?v=daXEp4HmS-E
How do I use this properly? For instance, do I:
df = df.localCheckpoint()
Under what conditions would I repartition? Is specifying the write necessary?
Let's say I have specified my number of shuffle partitions like so:
spark.conf.set("spark.sql.shuffle.partitions", 1200);
# bunch of work here
# local checkpoint here
But I have more work to do, but at a different shuffle scale, so then can I set my shuffle partitions again, like so?
spark.conf.set("spark.sql.shuffle.partitions", 400);
# more work here
# local checkpoint again
Finally, write to disk:
How should localCheckpoint() be used in work that we have to chain together at different shuffle scales?
Also, how can I use two localCheckpoint() operations? If I run one localCheckpoint() on df but then run another on df2; when I return to yet another localCheckpoint() on df, I get:
Checkpoint block rdd_322_17 not found! Either the executor that originally checkpointed this partition is no longer alive, or the original RDD is unpersisted


I am reading in 64 compressed csv files (probably 70-80 GB) into one dask data frame then run groupby with aggregations.
The job never completed because appereantly the groupby creates a data frame with only one partition.
This post and this post already addressed this issue but focusing on the computational graph and not the memory issue you run into, when your resulting data frame is too large.
I tried a workaround with repartioning but the job still wont complete.
What am I doing wrong, will I have to use map_partition? This is very confusing as I expect Dask will take care of partitioning everything even after aggregation operations.
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1, memory_limit='8GB',diagnostics_port=5000)
dB3 = dd.read_csv("boden/expansion*.csv", # read in parallel
blocksize=None, # 64 files
aggs = {
'boden': ['count','min']
with ProgressBar(dt=30): dBSelect.compute().to_parquet('boden/final/boden_final.parq',compression=None)
Most groupby aggregation outputs are small and fit easily in one partition. Clearly this is not the case in your situation.
To resolve this you should use the split_out= parameter to your groupby aggregation to request a certain number of output partitions.
df.groupby(['x', 'y', 'z']).mean(split_out=10)
Note that using split_out= will significantly increase the size of the task graph (it has to mildly shuffle/sort your data ahead of time) and so may increase scheduling overhead.

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct.
Let's say I create an RDD:
val rdd = sc.textFile(file)
Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)?
Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
Let's say there are 100 objects in rdd, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with rdd.size as 10 or 100? Because, overall, the RDD is size 100 but locally on each node it is only 10. Am I required to make a broadcast variable prior to doing the calculation? This question is linked to the question below.
Finally, if I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
val rdd = sc.textFile(file)
Does that mean that the file is now partitioned across the nodes?
The file remains wherever it was. The elements of the resulting RDD[String] are the lines of the file. The RDD is partitioned to match the natural partitioning of the underlying file system. The number of partitions does not depend on the number of nodes you have.
It is important to understand that when this line is executed it does not read the file(s). The RDD is a lazy object and will only do something when it must. This is great because it avoids unnecessary memory usage.
For example, if you write val errors = rdd.filter(line => line.startsWith("error")), still nothing happens. If you then write val errorCount = errors.count now your sequence of operations will need to be executed because the result of count is an integer. What each worker core (executor thread) will do in parallel then, is read a file (or piece of file), iterate through its lines, and count the lines starting with "error". Buffering and GC aside, only a single line per core will be in memory at a time. This makes it possible to work with very large data without using a lot of memory.
I want to count the number of objects in the RDD, however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
There is no rdd.size method. There is rdd.count, which counts the number of elements in the RDD. rdd.map(x => x / rdd.count) will not work. The code will try to send the rdd variable to all workers and will fail with a NotSerializableException. What you can do is:
val count = rdd.count
val normalized = rdd.map(x => x / count)
This works, because count is an Int and can be serialized.
If I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
map does not change the number of elements. I don't know what you mean by "size". But yes, you need to perform an action, such as count to get anything out of the RDD. You see, no work at all is performed until you perform an action. (When you perform count, only the per-partition count will be sent back to the driver, of course, not "all the information".)
Usually, the file (or parts of the file, if it's too big) is replicated to N nodes in the cluster (by default N=3 on HDFS). It's not an intention to split every file between all available nodes.
However, for you (i.e. the client) working with file using Spark should be transparent - you should not see any difference in rdd.size, no matter on how many nodes it's split and/or replicated. There are methods (at least, in Hadoop) to find out on which nodes (parts of the) file can be located at the moment. However, in simple cases you most probably won't need to use this functionality.
UPDATE: an article describing RDD internals: https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf

When decreasing the number of partitions one can use coalesce, which is great because it doesn't cause a shuffle and seems to work instantly (doesn't require an additional job stage).
I would like to do the opposite sometimes, but repartition induces a shuffle. I think a few months ago I actually got this working by using CoalescedRDD with balanceSlack = 1.0 - so what would happen is it would split a partition so that the resulting partitions location where all on the same node (so small net IO).
This kind of functionality is automatic in Hadoop, one just tweaks the split size. It doesn't seem to work this way in Spark unless one is decreasing the number of partitions. I think the solution might be to write a custom partitioner along with a custom RDD where we define getPreferredLocations ... but I thought that is such a simple and common thing to do surely there must be a straight forward way of doing it?
Things tried:
.set("spark.default.parallelism", partitions) on my SparkConf, and when in the context of reading parquet I've tried sqlContext.sql("set spark.sql.shuffle.partitions= ..., which on 1.0.0 causes an error AND not really want I want, I want partition number to change across all types of job, not just shuffles.
Watch this space
This kind of really simple obvious feature will eventually be implemented - I guess just after they finish all the unnecessary features in Datasets.
I do not exactly understand what your point is. Do you mean you have now 5 partitions, but after next operation you want data distributed to 10? Because having 10, but still using 5 does not make much sense… The process of sending data to new partitions has to happen sometime.
When doing coalesce, you can get rid of unsued partitions, for example: if you had initially 100, but then after reduceByKey you got 10 (as there where only 10 keys), you can set coalesce.
If you want the process to go the other way, you could just force some kind of partitioning:
[RDD].partitionBy(new HashPartitioner(100))
I'm not sure that's what you're looking for, but hope so.
As you know pyspark use some kind of "lazy" way of running. It will only do the computation when there is some action to do (for exemple a "df.count()" or a "df.show()". So what you can do is define the a shuffle partition between those actions.
You can write :
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=100")
# you spark code here with some transformation and at least one action
df = df.withColumn("sum", sum(df.A).over(your_window_function))
df.count() # your action
df = df.filter(df.B <10)
df = df.count()
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=10")
# you reduce the number of partition because you know you will have a lot
# less data
df = df.withColumn("max", max(df.A).over(your_other_window_function))
df.count() # your action

I have a text file consisting of a large number of random floating values separated by spaces.
I am loading this file into a RDD in scala.
How does this RDD get partitioned?
Also, is there any method to generate custom partitions such that all partitions have equal number of elements along with an index for each partition?
val dRDD = sc.textFile("hdfs://master:54310/Data/input*")
keyval=dRDD.map(x =>process(x.trim().split(' ').map(_.toDouble),query_norm,m,r))
Here I am loading multiple text files from HDFS and process is a function I am calling.
Can I have a solution with mapPartitonsWithIndex along with how can I access that index inside the process function? Map shuffles the partitions.
How does an RDD gets partitioned?
By default a partition is created for each HDFS partition, which by default is 64MB. Read more here.
How to balance my data across partitions?
First, take a look at the three ways one can repartition his data:
1) Pass a second parameter, the desired minimum number of partitions
for your RDD, into textFile(), but be careful:
In [14]: lines = sc.textFile("data")
In [15]: lines.getNumPartitions()
Out[15]: 1000
In [16]: lines = sc.textFile("data", 500)
In [17]: lines.getNumPartitions()
Out[17]: 1434
In [18]: lines = sc.textFile("data", 5000)
In [19]: lines.getNumPartitions()
Out[19]: 5926
As you can see, [16] doesn't do what one would expect, since the number of partitions the RDD has, is already greater than the minimum number of partitions we request.
2) Use repartition(), like this:
In [22]: lines = lines.repartition(10)
In [23]: lines.getNumPartitions()
Out[23]: 10
Warning: This will invoke a shuffle and should be used when you want to increase the number of partitions your RDD has.
From the docs:
The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
3) Use coalesce(), like this:
In [25]: lines = lines.coalesce(2)
In [26]: lines.getNumPartitions()
Out[26]: 2
Here, Spark knows that you will shrink the RDD and gets advantage of it. Read more about repartition() vs coalesce().
But will all this guarantee that your data will be perfectly balanced across your partitions? Not really, as I experienced in How to balance my data across the partitions?
The loaded rdd is partitioned by default partitioner: hash code. To specify custom partitioner, use can check rdd.partitionBy(), provided with your own partitioner.
I don't think it's ok to use coalesce() here, as by api docs, coalesce() can only be used when we reduce number of partitions, and even we can't specify a custom partitioner with coalesce().
You can generate custom partitions using the coalesce function:
coalesce(numPartitions: Int, shuffle: Boolean = false): RDD[T]

In Spark, the groupByKey function transforms a (K,V) pair RDD into a (K,Iterable<V>) pair RDD.
Yet, is this function stable? i.e is the order in the iterable preserved from the original order?
For example, if I originally read a file of the form:
May my iterable for K1 be like (V12, V11) (thus not preserving the original order) or can it only be (V11, V12) (thus preserving the original order)?
No, the order is not preserved. Example in spark-shell:
scala> sc.parallelize(Seq(0->1, 0->2), 2).groupByKey.collect
res0: Array[(Int, Iterable[Int])] = Array((0,ArrayBuffer(2, 1)))
The order is timing dependent, so it can vary between runs. (I got the opposite order on my next run.)
What is happening here? groupByKey works by repartitioning the RDD with a HashPartitioner, so that all values for a key end in up in the same partition. Then it performs the aggregation locally on each partition.
The repartitioning is also called a "shuffle", because the lines of the RDD are redistributed between nodes. The shuffle files are pulled from the other nodes in parallel. The new partition is built from these pieces in the order that they arrive. The data from the slowest source will be at the end of the new partition, and at the end of the list in groupByKey.
(Data pulled from the worker itself is of course fastest. Since there is no network transfer involved here, this data is pulled synchronously, and thus arrives in order. (It seems to, at least.) So to replicate my experiment you need at least 2 Spark workers.)
Source: http://apache-spark-user-list.1001560.n3.nabble.com/Is-shuffle-quot-stable-quot-td7628.html
Spark (and other map reduce frameworks) sort data by partitioning , and then merging. Since a merge sort is a stable operation I would guess that the result is stable. After looking more into the source I found that if spark.shuffle.spill is true it uses an external sort , merge sort in this case, which is stable. I'm not 100% sure what it does if it's allowed to spill to disk.
From source:
private val externalSorting = SparkEnv.get.conf.getBoolean("spark.shuffle.spill", true)
Partitioning is also a stable operation because it does no reordering