How to create accumulator for list of DataFrame? - scala

How to create accumulator of List[DataFrame] in scala spark ?
any suggestions

Related

How can I merge RDDs after repartitionAndSortWithinPartitions to generate final sorted RDD

I have a KV RDD of type Array[Byte], Array[Byte] on which I apply repartitionAndSortWithinPartitions. So this will result in sorted partitions. When I do .collect() this will only fetch the data to the driver however I want to merge the partitions into a sorted RDD.
Can I use zipPartitions to zip RDDs in a sorted fashion? If so how?

Spark - save sparse vectors to MongoDB

I'm trying to vectorise my data using spark tf/idf-like functions. So, as an output I get very long sparse vectors of features where only few indexes have values.
I was thinking to save such vectors into MongoDB array-like objects, having present indexes as keys. So, for example a SparseVector like
(23,[0,15],[1.0,1.0])
would be converted into MongoDB object as follows:
{"0": 1.0, "15": 1.0}
How can I do it using spark scala and mongodb connector?
I should probably implement some kind of UDF but not sure what type would fit mongodb as an input.
Ok, I found the solution.
Here's the udf I defined to convert SparseVectors to BSON-convertable Map.
val makeSparseMapUdf = udf {
(vec: SparseVector) => vec.indices
.map((index) => (index.toString, vec.toArray(index)))
.toMap
}

Reduction of a RDD to the collection of its values [duplicate]

This question already has an answer here:
ways to replace groupByKey in apache Spark
(1 answer)
Closed 6 years ago.
I have an RDD as the types of (key,value) which value is a case class, I need to reduce this RDD to (Key, ArrayBuffer(values))..based on comments below the typical way is using reducebykey method..however, I wanted to know if I can do this with reducebykey as it is a more efficient way based on this article.
// Consider pairRdd is the RDD that contains the (key, value) then
val groupedPairRDD = pairRdd.groupByKey
The output groupedPairRDD is your expected output. It contains the collection of values against the keys.

Spark, mapPartitions, network connection is closed before map operation is finished

I am running a spark job and at some point I want to connect to an elastic search server to get some data and add them to an RDD. So the code I am using looks like this
input.mapParitions(records=>{
val elcon=new ElasticSearchConnection
val client:TransportClient=elcon.openConnection()
val newRecs=records.flatMap(record=>{
val response = client.prepareGet("index" "indexType",
record.id.toString).execute().actionGet()
val newRec=processRec(record,reponse)
newRec
})//end of flatMap
client.close()
newRecs
})//end of mapPartitions
My problem is that the client.close() command is called before the flatMap operation is finished which results of course into an Exception. The code works if I move the generation and the closing of the connection inside the flatMap, but this would generate a huge amount of connections. Is it possible to make sure that client.close will be called after the flatMap operation is finished?
Making a blocking call for each item in your RDD to fetch corresponding ElasticSearch document is causing the problem. It is generally advised to avoid blocking calls.
There is another alternate approach using the ElasticSearch-for-Hadoop's Spark support.
Read the ElasticSearch index/type as another RDD and join it with your RDD.
Include the right version of ESHadoop dependency.
import org.elasticsearch.spark._
val esRdd = sc.esRDD("index/indexType") //This returns a pair RDD of (_id, Map of all key value pairs for all fields]
input.map(record => (record.id, record)) //Convert your RDD of records to a pair rdd of (id, record) as we want to join based on the id
input.join(esRdd).map(rec => processResponse(rec._2._1, rec._2._2)) // Join the two RDDs based on id column it returns a pair RDD with key=id & value=Pair of matching records (id,(inputrddrecord,esrddrecord))
Hope this helps.
PS: It will still not alleviate the problem of lack of co-location. (i.e. each document with _id will come from different shard of the index). Better approach would have been to achieve co-location of the input RDD and the ES index's documents at the time of creating the ES index.

How to sort an RDD of tuples with 5 elements in Spark Scala?

If I have an RDD of tuples with 5 elements, e.g.,
RDD(Double, String, Int, Double, Double)
How can I sort this RDD efficiently using the fifth element?
I tried to map this RDD into key-value pairs and used sortByKey, but looks like sortByKey is quite slow, it is slower than I collected this RDD and used sortWith on the collected array. Why is it like this?
Thank you very much.
You can do this with sortBy acting directly on the RDD:
myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple
There are extra optional parameters to define sort order ("ascending") and number of partitions.
If you want to sort by descending order & if the corresponding element is of type int, you can use "-" sign to sort the RDD in descending order.
For ex:
I've a RDD of tuple with (String, Int). To sort this RDD by its 2nd element in descending order,
rdd.sortBy(x => -x._2).collect().foreach(println);
I've a RDD of tuple with (String, String). To sort this RDD by its 2nd element in descending order,
rdd.sortBy(x => x._2, false).collect().foreach(println);
sortByKey is the only distributed sorting API for Spark 1.0.
How much data are you trying to sort? Small amount will result in faster local/centralized sorting. If you try to sort GB and GB of data that may not even fit on a single node, that's where Spark shines.