sorting a list of tuples in Pyspark - pyspark

I have a set of tuple key,value pairs which look like this:
X=[(('cat','mouse'),1),(('dog','rat'),20),(('hamster','skittles),67)]
which I want to sort in order of the second item in the tuple. Pythonically I would have used:
sorted(X, key=lambda tup:tup[1])
I also want to get the key,value pair with the highest value, again, pythonically this would be simple:
max_X=max(x[1] for x in X)
max_tuple=[x for x in X if x[1]==max_X
however I do not know how to translate this into a spark job.

X.max(lambda x: x[1])
You could also do it another way, which is probably faster if you need to sort your RDD anyway. But, this is slower if you don't need your RDD to be sorted, because sorting will take longer than just telling it to find the max.(So, in a vacuum, use the max function).
X.sortBy(lambda x: x[1], False).first()
This will sort as you did before, but adding the False will sort it in descending order. Then you take the first one, which will be the largest.

Figured it out in the 2 minutes since posting!
X.sortBy(lambda x:x[1]).collect()

Related

Preserving index in a RDD in Spark

I would like to create an RDD which contains String elements. Alongside of these elements I would like a number indicating the index of the element. However, I do not want this number to change if I remove elements, as I want the number to be the original index (thus preserving it). It is also important that the order is preserved in this RDD.
If I use zipWithIndex and thereafter remove some elements, will the indexes change? Which function/structure can I use to have unchanged indexes? I was thinking of creating a Pair RDD, however my input data does not contain indexes.
Answering rather than deleting. My problem was easily solved by zipWithIndex which fulfilled all my requirements.

how to order my tuple of spark results descending order using value

I am new to spark and scala. I need to order my result count tuple which is like (course, count) into descending order. I put like below
val results = ratings.countByValue()
val sortedResults = results.toSeq.sortBy(_._2)
But still its't working. In the above way it will sort the results by count with ascending order. But I need to have it in descending order. Can anybody please help me.
Results would be like below
(History, 12100),
(Music, 13200),
(Drama, 143000)
But I need to display it like below
(Drama, 143000),
(Music, 13200),
(History, 12100)
thanks
You have almost done it! you need add additional parameter for descending order as RDD sortBy() method arrange elements in ascending order by default.
val results = ratings.countByValue()
val sortedRdd = results.sortBy(_._2, false)
//Just to display results from RDD
println(sortedRdd.collect().toList)
You can use
.sortWith(_._2 >_._2)
most of the time calling toSeq is not good idea because driver needs to put this in memory and you might run out of memory in on larger data sets. I guess this is o.k. for intro to spark.
For example, someRDD is a pair RDD and the value is comparable, you can do like this:
someRDD.sortBy(item=>(item._2, false))
note: do not forget the brackets after =>.

Scala Count Occurrences in Large List

In Scala I have a list of tuples List[(String, String)]. So now from this list I want to find how many times each unique tuple appears in the list.
One way to do this would be to apply groupby{ x => x} and then find the length. But here my data set it quite large and it's taking a lot of time.
So is there any better way of doing this?
I would do the counting manually, using a Map. Iterate over your collection/list. During the iteration, build a count map. Keys in the count map are unique items from the original collection/list, values are number of occurrences of the key. If the item being processed during the iteration is in the count collection, increase its value by 1. If not, add value 1 to the count map. You can use getOrElse:
count(current_item) = count.getOrElse(current_item, 0) + 1;
This should work faster than groupby, followed by length check. Will also require less memory.
Other suggestions, check also this discussion.

How to sort Array[Row] by given column index in Scala

How to sort Array[Row] by given column index in Scala?
I'm using RDD[Row].collect() which gives me array[Row], but I want to sort it based on a given column index.
I have already used quick-sort logic and it's working, but there are too many for loops and all.
I would like to use a Scala built-in API which can do this task with the minimum amount of code.
It would be much more efficient to sort the Dataframe before collecting it - if you collect it, you lose the distributed (and parallel) computation. You can use Dataframe's sort, for example - ascending order by column "col1":
val sorted = dataframe.sort(asc("col1"))

Spark need an RDD.take with a big argument. Result should be an RDD

Is there a RDD method like take but which do not get all the elements in memory. For exemple, I may need to take 10^9 elements of my RDD and keep it as an RDD. What is the best way to do that ?
EDIT: A solution could be to zipWithIndex and filter with index < aBigValue but I am pretty sure there is a better solution.
EDIT 2: The code will be like
sc.parallelize(1 to 100, 2).zipWithIndex().filter(_._2 < 10).map(_._1)
It is a lot of operations just to reduce the size of an RDD :-(
I actually quite liked the zipWithIndex + filter mechanism, but if you are looking for an alternative that is sometimes much faster, I would suggest the sample function as described here: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/RDD.html
data.count
...
res1: Long = 1000
val result = data.sample(false, 0.1, System.currentTimeMillis().toInt)
result.count
...
res2: Long = 100
Sample takes the whole RDD and subsets it by a fraction and returns this as another RDD - the problem is that if you are looking for exactly 150 samples from 127112310274 data rows, well, good luck writing that fraction parameter (you could try 150/data.length) - but if you roughly looking for 1-10th of your data, this function works much faster than your take/drop or zip and filter
A solution:
yourRDD.zipWithIndex().filter(_._2 < ExactNumberOfElements).map(_._1)
If you want an approximation, take GameOfThrows'solution