Reduction of a RDD to the collection of its values [duplicate] - scala

This question already has an answer here:
ways to replace groupByKey in apache Spark
(1 answer)
Closed 6 years ago.
I have an RDD as the types of (key,value) which value is a case class, I need to reduce this RDD to (Key, ArrayBuffer(values))..based on comments below the typical way is using reducebykey method..however, I wanted to know if I can do this with reducebykey as it is a more efficient way based on this article.

// Consider pairRdd is the RDD that contains the (key, value) then
val groupedPairRDD = pairRdd.groupByKey
The output groupedPairRDD is your expected output. It contains the collection of values against the keys.

Related

how to find length of RDD in Spark [duplicate]

This question already has answers here:
How to find spark RDD/Dataframe size?
(3 answers)
Closed 5 years ago.
How can i find the length of the below RDD?
var mark = sc.parallelize(List(1,2,3,4,5,6))
scala> mark.map(l => l.length).collect
<console>:27: error: value length is not a member of Int
mark.map(l => l.length).collect
First you should clarify what you want exactly. In your examplek you are running a map function, so it looks like you are trying to get the length of each of the fields of the RDD, not the RDD size.
sc.textFile loads everything as Strings, so you can call length method on each of the fields. Paralellize is parallelizing the information as Ints because your list is made of integers.
If you want the size of an RDD you should run count on the RDD, not on on each field
mark.count()
This will return 6
If you want the size of each element you can convert them to String if needed, but it looks like a weird requirement. It will be something like this:
mark.map(l => l.toString.length).collect

How to loop through tuple in scala [duplicate]

This question already has answers here:
Scala: How to convert tuple elements to lists
(5 answers)
Closed 5 years ago.
I have a tuple in scala
val captainStuff = ("One", "Two", "Three", "Four", "Five")
How i can iterate through for loop?? It's easy to loop through list and map. But how to loop through Tuple.
Thanks!!
You can convert it to iterator like:
val captainStuff = ("One", "Two", "Three", "Four", "Five")
captainStuff.productIterator.foreach(x => {
println(x)
})
This question is a duplicate btw:
Scala: How to convert tuple elements to lists
How i can iterate through for loop?? It's easy to loop through list and map. But how to loop through Tuple.
Lists and maps are collections. Tuples are not. Iterating (aka "looping through") really only makes sense for collections which tuples aren't.
Tuples are product types. They are a way of grouping multiple values of different types together into a single structure. Considering that the fields of a tuple may have different types, how exactly would you iterate over it? What would be the type of your element variable?
If you are familiar with other languages, you may be familiar with the concept of records (e.g. RECORD in Pascal or struct in C). Tuples are kind of like them, except the fields don't have names. How do you iterate over a record in Pascal or a struct in C? You don't, it makes no sense.
In fact, you can think of an object as a record. Again, how do you iterate over the fields of an object? You don't, it makes no sense.
Note #1: Yes, sometimes, it does make sense to iterate over the field of an object iff you are doing reflective metaprogramming.
Note #2: In Scala, tuples inherit from Product, which has a non-typesafe productIterator method that gives you an Iterator[Any] which allows you to iterate over a tuple, but without type-safety. Just don't do it.
tl;dr: tuples are not collections. You simply don't iterate over them. Period. If you think you have to, you're doing something wrong, i.e. you shouldn't have a tuple but maybe an array or a list instead.

how to order my tuple of spark results descending order using value

I am new to spark and scala. I need to order my result count tuple which is like (course, count) into descending order. I put like below
val results = ratings.countByValue()
val sortedResults = results.toSeq.sortBy(_._2)
But still its't working. In the above way it will sort the results by count with ascending order. But I need to have it in descending order. Can anybody please help me.
Results would be like below
(History, 12100),
(Music, 13200),
(Drama, 143000)
But I need to display it like below
(Drama, 143000),
(Music, 13200),
(History, 12100)
thanks
You have almost done it! you need add additional parameter for descending order as RDD sortBy() method arrange elements in ascending order by default.
val results = ratings.countByValue()
val sortedRdd = results.sortBy(_._2, false)
//Just to display results from RDD
println(sortedRdd.collect().toList)
You can use
.sortWith(_._2 >_._2)
most of the time calling toSeq is not good idea because driver needs to put this in memory and you might run out of memory in on larger data sets. I guess this is o.k. for intro to spark.
For example, someRDD is a pair RDD and the value is comparable, you can do like this:
someRDD.sortBy(item=>(item._2, false))
note: do not forget the brackets after =>.

Spark, mapPartitions, network connection is closed before map operation is finished

I am running a spark job and at some point I want to connect to an elastic search server to get some data and add them to an RDD. So the code I am using looks like this
input.mapParitions(records=>{
val elcon=new ElasticSearchConnection
val client:TransportClient=elcon.openConnection()
val newRecs=records.flatMap(record=>{
val response = client.prepareGet("index" "indexType",
record.id.toString).execute().actionGet()
val newRec=processRec(record,reponse)
newRec
})//end of flatMap
client.close()
newRecs
})//end of mapPartitions
My problem is that the client.close() command is called before the flatMap operation is finished which results of course into an Exception. The code works if I move the generation and the closing of the connection inside the flatMap, but this would generate a huge amount of connections. Is it possible to make sure that client.close will be called after the flatMap operation is finished?
Making a blocking call for each item in your RDD to fetch corresponding ElasticSearch document is causing the problem. It is generally advised to avoid blocking calls.
There is another alternate approach using the ElasticSearch-for-Hadoop's Spark support.
Read the ElasticSearch index/type as another RDD and join it with your RDD.
Include the right version of ESHadoop dependency.
import org.elasticsearch.spark._
val esRdd = sc.esRDD("index/indexType") //This returns a pair RDD of (_id, Map of all key value pairs for all fields]
input.map(record => (record.id, record)) //Convert your RDD of records to a pair rdd of (id, record) as we want to join based on the id
input.join(esRdd).map(rec => processResponse(rec._2._1, rec._2._2)) // Join the two RDDs based on id column it returns a pair RDD with key=id & value=Pair of matching records (id,(inputrddrecord,esrddrecord))
Hope this helps.
PS: It will still not alleviate the problem of lack of co-location. (i.e. each document with _id will come from different shard of the index). Better approach would have been to achieve co-location of the input RDD and the ES index's documents at the time of creating the ES index.

How to sort an RDD of tuples with 5 elements in Spark Scala?

If I have an RDD of tuples with 5 elements, e.g.,
RDD(Double, String, Int, Double, Double)
How can I sort this RDD efficiently using the fifth element?
I tried to map this RDD into key-value pairs and used sortByKey, but looks like sortByKey is quite slow, it is slower than I collected this RDD and used sortWith on the collected array. Why is it like this?
Thank you very much.
You can do this with sortBy acting directly on the RDD:
myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple
There are extra optional parameters to define sort order ("ascending") and number of partitions.
If you want to sort by descending order & if the corresponding element is of type int, you can use "-" sign to sort the RDD in descending order.
For ex:
I've a RDD of tuple with (String, Int). To sort this RDD by its 2nd element in descending order,
rdd.sortBy(x => -x._2).collect().foreach(println);
I've a RDD of tuple with (String, String). To sort this RDD by its 2nd element in descending order,
rdd.sortBy(x => x._2, false).collect().foreach(println);
sortByKey is the only distributed sorting API for Spark 1.0.
How much data are you trying to sort? Small amount will result in faster local/centralized sorting. If you try to sort GB and GB of data that may not even fit on a single node, that's where Spark shines.