mapPartitionsWithIndex for Spark Datasets - apache-spark-dataset

Does anyone know why mapPartitionsWithIndex is only available for RDDs but not for Datasets?
Thanks in advance!

Related

Spark Repartition() and Coalesce() internals

How does Repartition or Coalesce work internally?
For Repartition() is the data being collected on Drive node and then shuffled across the executors?
Is Coalesce a Narrow/wide transformation?
Hope this answer is helpful - Spark - repartition() vs coalesce()
Do read the answer by Powers and Justin

What is materialization of RDDS in Spark?

I have been searching for what materialization means, and I keep getting links to persist() functions. But more fundamentally and conceptually, what does materialization of Rdds help in and what is it?

Split an RDD into multiple RDDS

I have a pair RDD[String,String] where key is a string and value is html. I want to split this rdd into n RDDS based on n keys and store them in HDFS.
htmlRDD = [key1,html
key2,html
key3,html
key4,html
........]
Split this RDD based on keys and store html from each RDD individually on HDFS. Why I want to do that? When, I'm trying to store the html from the main RDD to HDFS,it takes a lot of time as some tasks are denied committing by output coordinator.
I'm doing this in Scala.
htmlRDD.saveAsHadoopFile("hdfs:///Path/",classOf[String],classOf[String], classOf[Formatter])
You can also try this in place of breaking RDD:
htmlRDD.saveAsTextFile("hdfs://HOST:PORT/path/");
I tried this and it worked for me. I had RDD[JSONObject] and it wrote toString() of JSON Object very well.
Spark saves each RDD partition into 1 hdfs file partition. So to achieve good parallelism your source RDD should have many partitions(actually depends on size of whole data). So I think you want to split your RDD not into several RDDs, but rather to have RDD with many partitions.
You you can do it with repartition() or coallesce()

Apache spark + RDD + persist() doubts

I am new in apache spark and using scala API. I have 2 questions regarding RDD.
How to persist some partitions of a rdd, instead of entire rdd in apache spark? (core rdd implementation provides rdd.persist() and rdd.cache() methods but i do not want to persist entire rdd. I am interested only some partitions to persist.)
How to create one empty partition while creating each rdd? (I am using repartition and textFile transformations.In these cases i can get expected number of partitions but i also want one empty partition for each rdd.)
Any help is appreciated.
Thanks in advance

When create two different Spark Pair RDD with same key set, will Spark distribute partition with same key to the same machine?

I want to do a join operation between two very big key-value pair RDDs. The keys of these two RDD comes from the same set. To reduce data shuffle, I wish I could add a pre-distribute phase so that partitions with the same key will be distributed on the same machine. Hopefully this could reduce some shuffle time.
I want to know is spark smart enough to do that for me or I have to implement this logic myself?
I know when I join two RDD, one preprocess with partitionBy. Spark is smart enough to use this information and only shuffle the other RDD. But I don't know what will happen if I use partitionBy on two RDD at the same time and then do the join.
If you use the same partitioner for both RDDs you achieve co-partitioning of your data sets. That does not necessarily mean that your RDDs are co-located - that is, that the partitioned data is located on the same node.
Nevertheless, the performance should be better as if both RDDs would have different partitioner.
I have seen this, Speeding Up Joins by Assigning a Known Partitioner that would be helpful to understand the effect of using the same partitioner for both RDDs;
Speeding Up Joins by Assigning a Known Partitioner
If you have to do an operation before the join that requires a
shuffle, such as aggregateByKey or reduceByKey, you can prevent the
shuffle by adding a hash partitioner with the same number of
partitions as an explicit argument to the first operation and
persisting the RDD before the join.