Count number of elements in each pyspark RDD partition - pyspark

I'm looking for the Pyspark equivalent to this question: How to get the number of elements in partition?.
Specifically, I want to programmatically count the number of elements in each partition of a pyspark RDD or dataframe (I know this information is available in the Spark Web UI).
This attempt:
df.foreachPartition(lambda iter: sum(1 for _ in iter))
results in:
AttributeError: 'NoneType' object has no attribute '_jvm'
I do not want to collect the contents of the iterator into memory.

If you are asking: can we get the number of elements in an iterator without iterating through it? The answer is No.
But we don't have to store it in memory, as in the post you mentioned:
def count_in_a_partition(idx, iterator):
count = 0
for _ in iterator:
count += 1
return idx, count
data = sc.parallelize([
1, 2, 3, 4
], 4)
data.mapPartitionsWithIndex(count_in_a_partition).collect()
EDIT
Note that your code is very close to the solution, just that mapPartitions needs to return an iterator:
def count_in_a_partition(iterator):
yield sum(1 for _ in iterator)
data.mapPartitions(count_in_a_partition).collect()

Related

Order Spark RDD based on ordering in another RDD

I have an RDD with strings like this (ordered in a specific way):
["A","B","C","D"]
And another RDD with lists like this:
["C","B","F","K"],
["B","A","Z","M"],
["X","T","D","C"]
I would like to order the elements in each list in the second RDD based on the order in which they appear in the first RDD. The order of the elements that do not appear in the first list is not of concern.
From the above example, I would like to get an RDD like this:
["B","C","F","K"],
["A","B","Z","M"],
["C","D","X","T"]
I know I am supposed to use a broadcast variable to broadcast the first RDD as I process each list in the second RDD. But I am very new to Spark/Scala (and functional programming in general) so I am not sure how to do this.
I am assuming that the first RDD is small since you talk about broadcasting it. In that case you are right, broadcasting the ordering is a good way to solve your problem.
// generating data
val ordering_rdd = sc.parallelize(Seq("A","B","C","D"))
val other_rdd = sc.parallelize(Seq(
Seq("C","B","F","K"),
Seq("B","A","Z","M"),
Seq("X","T","D","C")
))
// let's start by collecting the ordering onto the driver
val ordering = ordering_rdd.collect()
// Let's broadcast the list:
val ordering_br = sc.broadcast(ordering)
// Finally, let's use the ordering to sort your records:
val result = other_rdd
.map( _.sortBy(x => {
val index = ordering_br.value.indexOf(x)
if(index == -1) Int.MaxValue else index
}))
Note that indexOf returns -1 if the element is not found in the list. If we leave it as is, all non-found elements would end up at the beginning. I understand that you want them at the end so I relpace -1 by some big number.
Printing the result:
scala> result.collect().foreach(println)
List(B, C, F, K)
List(A, B, Z, M)
List(C, D, X, T)

Count operation in reduceByKey in spark

val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Here I have performed Sum operation But Is it possible to do count operation inside reduceByKey.
Like what i think,
reduceByKey((x, y) => (math.count(x._1),(x._2+y._2)))
But this is not working any suggestion please.
Well, counting is equivalent to summing 1s, so just map the first item in each value tuple into 1 and sum both parts of the tuple like you did before:
val temp1 = tempTransform.map { temp =>
((temp.getShort(0), temp.getString(1)), (1, USAGE_TEMP.getDouble(3)))
}
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Result would be an RDD[((Short, String), (Int, Double))] where the first item in the value tuple (the Int) is the number of original records matching that key.
That's actually the classic map-reduce example - word count.
No, you can't do that. RDD provide iterator model for lazy computation. So every element will be visited only once.
If you really want to do sum as described, re-partition your rdd first, then use mapWithPartition, implement your calculation in closure( Keep in mind that elements in RDD is not in order).

How to replace RDD type of [String] with values of RDD type [String, Int]

Sorry for the confusion in the initial question. Here is a questions with the reproducible example:
I have an rdd of [String] and I have a rdd of [String, Long]. I would like to have an rdd of [Long] based on the match of String of second with String of first. Example:
//Create RDD
val textFile = sc.parallelize(Array("Spark can also be used for compute intensive tasks",
"This code estimates pi by throwing darts at a circle"))
// tokenize, result: RDD[(String)]
val words = textFile.flatMap(line => line.split(" "))
// create index of distinct words, result: RDD[(String,Long)]
val indexWords = words.distinct().zipWithIndex()
As a result, I would like to have an RDD with indexes of words instead of words in "Spark can also be used for compute intensive tasks".
Sorry again and thanks
If I understand you correctly, you're interested in the indices of works that also appear in Spark can also be used for compute intensive tasks.
If so - here are two versions with identical outputs but different performance characteristics:
val lookupWords: Seq[String] = "Spark can also be used for compute intensive tasks".split(" ")
// option 1 - use join:
val lookupWordsRdd: RDD[(String, String)] = sc.parallelize(lookupWords).keyBy(w => w)
val result1: RDD[Long] = indexWords.join(lookupWordsRdd).map { case (key, (index, _)) => index }
// option 2 - assuming list of lookup words is short, you can use a non-distributed version of it
val result2: RDD[Long] = indexWords.collect { case (key, index) if lookupWords.contains(key) => index }
The first option creates a second RDD with the words whose indices we're interested in, uses keyBy to transform it into a PairRDD (with key == value!), joins it with your indexWords RDD and then maps to get the index only.
The second option should only be used if the list of "interesting words" is known not to be too large - so we can keep it as a list (and not RDD), and let Spark serialize it and send to workers for each task to use. We then use collect(f: PartialFunction[T, U]) which applies this partial function to get a "filter" and a "map" at once - we only return a value if the words exists in the list, and if so - we return the index.
I was getting an error of SPARK-5063 and given this answer, I found the solution to my problem:
//broadcast `indexWords`
val bcIndexWords = sc.broadcast(indexWords.collectAsMap)
// select `value` of `indexWords` given `key`
val result = textFile.map{arr => arr.split(" ").map(elem => bcIndexWords.value(elem))}
result.first()
res373: Array[Long] = Array(3, 7, 14, 6, 17, 15, 0, 12)

Why printing inside foreach doesn't reflect an order of elements

May be I am missing something but I expected the data to be sorted based on the key
scala> val x=sc.parallelize(Array( "cat", "ant", "1"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[160] at parallelize at <console>:22
scala> val xxx=x.map(v=> (v,v.length))
xxx: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[161] at map at <console>:26
scala> xxx.sortByKey().foreach(println)
(1,1)
(cat,3)
(ant,3)
scala> xxx.sortByKey().foreach(println)
(cat,3)
(1,1)
(ant,3)
It works if I tell spark to use only 1 partitions as below but how to make this work in a cluster or more than 1 workers?
scala> xxx.sortByKey(numPartitions=1).foreach(println)
(1,1)
(ant,3)
(cat,3)
UPDATE:
I think I got the answer. It is being sorted correctly as it works when I use the collect
scala> xxx.sortByKey().collect
res170: Array[(String, Int)] = Array((1,1), (ant,3), (cat,3))
Keeping the question open to validate my understanding.
That makes sense. foreach runs in parallel across the partitions which creates non-deterministic ordering. The order may be mixed. collect gives you an array of the partitions concatenated in their sorted order.
Have a look at spark documentation why collect() method fixed the issue for you.
e.g.
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as an array of objects.
Calling collect() on the resulting RDD will return or output an ordered list of records
collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Remember doing a collect() action operation on a very large distributed RDD can cause your driver program to run out of memory and crash. So, do not use collect() except for when you are prototyping your Spark program on a small dataset.
Have a look at this article for more details
EDIT:
sortByKey(): Sort the RDD by key, so that each partition contains a sorted range of the elements. Since all partitions may not reside in same Executor node, you will not get ordered set unless you call collect()

Spark processing columns in parallel

I've been playing with Spark, and I managed to get it to crunch my data. My data consists of flat delimited text file, consisting of 50 columns and about 20 millions of rows. I have scala scripts that will process each column.
In terms of parallel processing, I know that RDD operation run on multiple nodes. So, every time I process a column, they are processed in parallel, but the column itself is processed sequentially.
A simple example: if my data is 5 column text delimited file and each column contain text, and I want to do word count for each column. I would do:
for(i <- 0 until 4){
data.map(_.split("\t",-1)(i)).map((_,1)).reduce(_+_)
}
Although each column's operation is run in parallel, the column itself is processed sequentially(bad wording I know. Sorry!). In other words, column 2 is processed after column 1 is done. Column 3 is processed after column 1 and 2 are done, and so on.
My question is: Is there anyway to process multiple column at a time? If you know a way, cor a tutorial, would you mind sharing it with me?
thank you!!
Suppose the inputs are seq. Following can be done to process columns concurrently. The basic idea is to using sequence (column, input) as the key.
scala> val rdd = sc.parallelize((1 to 4).map(x=>Seq("x_0", "x_1", "x_2", "x_3")))
rdd: org.apache.spark.rdd.RDD[Seq[String]] = ParallelCollectionRDD[26] at parallelize at <console>:12
scala> val rdd1 = rdd.flatMap{x=>{(0 to x.size - 1).map(idx=>(idx, x(idx)))}}
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = FlatMappedRDD[27] at flatMap at <console>:14
scala> val rdd2 = rdd1.map(x=>(x, 1))
rdd2: org.apache.spark.rdd.RDD[((Int, String), Int)] = MappedRDD[28] at map at <console>:16
scala> val rdd3 = rdd2.reduceByKey(_+_)
rdd3: org.apache.spark.rdd.RDD[((Int, String), Int)] = ShuffledRDD[29] at reduceByKey at <console>:18
scala> rdd3.take(4)
res22: Array[((Int, String), Int)] = Array(((0,x_0),4), ((3,x_3),4), ((2,x_2),4), ((1,x_1),4))
The example output: ((0, x_0), 4) means the first column, key is x_0, and value is 4. You can start from here to process further.
You can try the following code, which use the scala parallize collection feature,
(0 until 4).map(index => (index,data)).par.map(x => {
x._2.map(_.split("\t",-1)(x._1)).map((_,1)).reduce(_+_)
}
data is a reference, so duplicate the data will not cost to much. And rdd is read-only, so parallelly processing can work. The par method use the parallely collection feature. You can check the parallel jobs on the spark web UI.