Scala - sort RDD partitions - scala

Assume I have RDD of Integer from 1 to 1,000,000,000 and I want to print them ordered using foreachPartition. There might be situation that the partition of 5-6-7-8 will be printed before 1-2-3-4. How can I prevent this?
Thanks,
Maya

I think the only way to do this would be to ensure there is only one partition, and then you could print your data. You can call repartition(1) or coalesce(1) on your RDD to reduce the number of partitions. For your use case I think coalesce is better as it avoids a shuffle.
https://spark.apache.org/docs/1.3.1/programming-guide.html#transformations

Related

PySpark - Does Coalesce(1) Retain the Order of Range Partitioning?

Looking into the Spark UI and physical plan, I found that orderBy is accomplished by Exchange rangepartitioning(col#0000 ACS NULLS FIRST, 200) and then Sort [col#0000 ACS NULLS FIRST], true, 0.
From what I understand, rangepartitioning would define minimum and maximum values for each partition and order the data with column value within the min and max into that partition so as to achieve global ordering.
But now I have 200 partitions and I want to output to a single csv file. If I do a repartition(1), spark will trigger a shuffle and the ordering will be gone. However, I tried coalesce(1) and it retained the global ordering. Yet I don't know if it was merely pure luck since coalesce does not necessarily decrease number of partitions and keep the ordering of partitions. Does anyone know how to repartition to keep the ordering after rangepartitioning? Thanks a lot.
As you state yourself maintaining order is not part of the coalesce API contract. You you have to choose:
collect the ordered dataframe as a list of Row instances and write to csv outside spark
write the partitions to individual CSV files with spark and concatenate the partitions with some other tool, e.g. "hadoop dfs getmerge" on the command line.

RDD persist mechanism (what happen when I persist a RDD and then use take(10) not count() )

what happens when I persist a RDD and then use take(10) instead of count().
I have read some comments, it says that if I use take() instead of count, it might only persist partial partition not all the partitions.
But, if my dataset is big enough,then using count is very time consuming.
Is there any other action operator that I can use to trigger persist to persist all partition.
foreachPartition is an action operator and it need data from all partitions, can I use this after persist?
need your help ~
Ex:
val rdd1 = sc.textFile("src/main/resources/").persist()
rdd1.foreachPartition(partition=>partition.take(1))

Does spark handle data shuffling?

I have an input A which I convert into an rdd X spread across the cluster.
I perform certain operations on it.
Then I do .repartition(1) on the output rdd.
Will my output rdd be in the same order that input A.
Does spark handle this automatically? If yes, then how?
The documentation doesn't guarantee that order will be kept, so you can assume it won't be. If you look at the implementation, you'll see it certainly won't be (unless your original RDD already has 1 partition for some reason): repartition calls coalesce(shuffle = true), which
Distributes elements evenly across output partitions, starting from a random partition.

Split an RDD into multiple RDDS

I have a pair RDD[String,String] where key is a string and value is html. I want to split this rdd into n RDDS based on n keys and store them in HDFS.
htmlRDD = [key1,html
key2,html
key3,html
key4,html
........]
Split this RDD based on keys and store html from each RDD individually on HDFS. Why I want to do that? When, I'm trying to store the html from the main RDD to HDFS,it takes a lot of time as some tasks are denied committing by output coordinator.
I'm doing this in Scala.
htmlRDD.saveAsHadoopFile("hdfs:///Path/",classOf[String],classOf[String], classOf[Formatter])
You can also try this in place of breaking RDD:
htmlRDD.saveAsTextFile("hdfs://HOST:PORT/path/");
I tried this and it worked for me. I had RDD[JSONObject] and it wrote toString() of JSON Object very well.
Spark saves each RDD partition into 1 hdfs file partition. So to achieve good parallelism your source RDD should have many partitions(actually depends on size of whole data). So I think you want to split your RDD not into several RDDs, but rather to have RDD with many partitions.
You you can do it with repartition() or coallesce()

When create two different Spark Pair RDD with same key set, will Spark distribute partition with same key to the same machine?

I want to do a join operation between two very big key-value pair RDDs. The keys of these two RDD comes from the same set. To reduce data shuffle, I wish I could add a pre-distribute phase so that partitions with the same key will be distributed on the same machine. Hopefully this could reduce some shuffle time.
I want to know is spark smart enough to do that for me or I have to implement this logic myself?
I know when I join two RDD, one preprocess with partitionBy. Spark is smart enough to use this information and only shuffle the other RDD. But I don't know what will happen if I use partitionBy on two RDD at the same time and then do the join.
If you use the same partitioner for both RDDs you achieve co-partitioning of your data sets. That does not necessarily mean that your RDDs are co-located - that is, that the partitioned data is located on the same node.
Nevertheless, the performance should be better as if both RDDs would have different partitioner.
I have seen this, Speeding Up Joins by Assigning a Known Partitioner that would be helpful to understand the effect of using the same partitioner for both RDDs;
Speeding Up Joins by Assigning a Known Partitioner
If you have to do an operation before the join that requires a
shuffle, such as aggregateByKey or reduceByKey, you can prevent the
shuffle by adding a hash partitioner with the same number of
partitions as an explicit argument to the first operation and
persisting the RDD before the join.