Printing reduceByKey output - scala

How to print the output after reduceByKey
I tried things like
totalsByAge.foreach{ i =>println("Value = " + i )}
I have a couple of lines of code
val totalsByAgeEntry = rdd.mapValues(x => (x, 1))
val totalsByAge = totalsByAgeEntry.reduceByKey( (x,y) => (x._1 + y._1, x._2 + y._2))
I want to print the tuple that gets when reduceByKey is called. I dont print the output after (x._1 + y._1, x._2 + y._2) is computed.
I know that the data created after reduceByKey is something like:
(x,((x1,y1),(x2,y2))
But how can I print that

That is because reduceByKey is performed by the executors, and println prints the output in the executor's standard output. The executor's stdout is usually available at master.application.ip.address:8080.
If you want to print/view your data you can do that in several ways. For instance: 1) by applying totalByAge.take(numberOfLines).foreach(println); 2) by collecting (.collect()) the RDD to the driver; and 3) by converting the RDD to a Dataframe and then applying .show().
val rdd: RDD[(Int, Int)] =
sparkContext
.parallelize(Vector(1, 2, 3))
.map(i => (i, 1))
.reduceByKey(_ + _)
rdd.take(10).foreach(println) // take the first 10 lines and print them
rdd.collect().foreach(println) // centralize the entire RDD and print it
import spark.implicits._
rdd.toDF().show(10) // conver to dataframe and show the first 10 lines

Related

How to Sorting results after run code in Spark

I created some code lines of scala to count number of words in a text file (in Spark). The result such like this:
(further,,1)
(Hai,,2)
(excluded,1)
(V.,5)
I wonder that can I sort the result as follow:
(V.,5)
(Hai,,2)
(excluded,1)
(further,,1)
The code as showed bellow, thank you for your help!
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.collect()
wordCounts.saveAsTextFile("./WordCountTest")
If you want to sort your first dataset by the second field, you can use the following code:
val wordCounts = Seq(
("V.",5),
("Hai",2),
("excluded",1),
("further",1)
)
val wcOrdered = wordCounts.sortBy(_._2).reverse
which yields the following result
wcOrdered: Seq[(String, Int)] = List((V.,5), (Hai,2), (further,1), (excluded,1))
You can just call wordCounts.sortBy(_._2, false). Method sortBy from RDD takes boolean as the second argument, which tells if the result should be sorted ascending (true - default) or descending (false).
textFile
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.sortBy(_._2, false)

Spark print result of Arrays or saveAsTextFile

I have been bogged down by this for some hours now... tried collect and mkString(") and still i am not able to print in console or save as text file.
scala> val au1 = sc.parallelize(List(("a",Array(1,2)),("b",Array(1,2))))
scala> val au2 = sc.parallelize(List(("a",Array(3)),("b",Array(2))))
scala> val au3 = au1.union(au2)
Result of the union is
Array[(String,Array[int])] = Array((a,Array(1,2)),(b,Array(1,2)),(a,Array(3)),(b,Array(2)))
All the print attempts are resulting in following when i do x(0) and x(1)
Array[Int]) does not take parameters
Last attempt, performed following and it is resulting in index error
scala> val au4 = au3.map(x => (x._1, x._2._1._1, x._2._1._2))
<console>:33: error: value _1 is not a member of Array[Int]
val au4 = au3.map(x => (x._1, x._2._1._1, x._2._1._2))
._1 or ._2 can be done in tuples and not in arrays
("a",Array(1,2)) is a tuple so ._1 is a and ._2 is Array(1,2)
so if you want to get elements of an array you need to use () as x._2(0)
but au2 arrays has only one element so x._2(1) will work for au1 and not for au2. You can use Option or Try as below
val au4 = au3.map(x => (x._1, x._2(0), Try(x._2(1)) getOrElse(0)))
The result of au3 is not Array[(String,Array[int])] , it is RDD[(String,Array[int])]
so this how you could do to write output in a file
au3.map( r => (r._1, r._2.map(_.toString).mkString(",")))
.saveAsTextFile("data/result")
You need to map through the array and create a string from it so that it could be written in file as
(a,1:2)
(b,1:2)
(a,3)
(b,2)
You could write to file without brackets as below
au3.map( r => Row(r._1, r._2.map(_.toString).mkString(":")).mkString(","))
.saveAsTextFile("data/result")
Output:
a,1:2
b,1:2
a,3
b,2
The value is comma "," separated and array value are separated as ":"
Hope this helps!

How can I deal with each adjoin two element difference greater than threshold from Spark RDD

I have a problem with Spark Scala which get the value of each adjoin two element difference greater than threshold,I create a new RDD like this:
[2,3,5,8,19,3,5,89,20,17]
I want to subtract each two adjoin element like this:
a.apply(1)-a.apply(0) ,a.apply(2)-a.apply(1),…… a.apply(a.lenght)-a.apply(a.lenght-1)
If the result greater than the threshold of 10,than output the collection,like this:
[19,89]
How can I do this with scala from RDD?
If you have data as
val data = Seq(2,3,5,8,19,3,5,89,20,17)
you can create rdd as
val rdd = sc.parallelize(data)
What you desire can be achieved by doing the following
import org.apache.spark.mllib.rdd.RDDFunctions._
val finalrdd = rdd
.sliding(2)
.map(x => (x(1), x(1)-x(0)))
.filter(y => y._2 > 10)
.map(z => z._1)
Doing
finalrdd.foreach(println)
should print
19
89
You can create another RDD from the original dataframe and zip those two RDD which creates a tuple like (2,3)(3,5)(5,8) and filter the subtracted result if it is greater than 10
val rdd = spark.sparkContext.parallelize(Seq(2,3,5,8,19,3,5,89,20,17))
val first = rdd.first()
rdd.zip(rdd.filter(r => r != first))
.map( k => ((k._2 - k._1), k._2))
.filter(k => k._1 > 10 )
.map(t => t._2).foreach(println)
Hope this helps!

Spark Split RDD into chunks and concatenate

I have a relatively simple problem.
I have an large Spark RDD[String] (containing JSON). In my use case I want to group (concatenate) N strings together into a new RDD[String], so that it will have the size of oldRDD.size/N.
pseudo example:
val oldRDD : RDD[String] = ['{"id": 1}', '{"id": 2}', '{"id": 3}', '{"id": 4}']
val newRDD : RDD[String] = someTransformation(oldRDD, ",", 2)
newRDD = ['{"id": 1},{"id": 2}','{"id": 3},{"id": 4}']
val anotherRDD : RDD[String] = someTransformation(oldRDD, ",", 3)
anotherRDD = ['{"id": 1},{"id": 2},{"id": 3}','{"id": 4}']
I already looked for a similar case, but couldnt find anything.
Thanks!
Here you have to use zipWithIndex function and then calculate group.
For example, index = 3 and n (number of groups) = 2 gives you 2nd group. 3 / 2 = 1 (integer divide), so 0-based 2nd group
val n = 3;
val newRDD1 = oldRDD.zipWithIndex() // creates tuples (element, index)
// map to tuple (group, content)
.map(x => (x._2 / n, x._1))
// merge
.reduceByKey(_ + ", " + _)
// remove key
.map(x => x._2)
One note: order of "zipWithIndex" is internal order. It can make no sense in business logic, you must check if order is ok in your case. If not, sort RDD and then use zipWithIndex

word count(frequency) spark rdd scala

if I have an rdd accross cluster and I want to do the word count
not only count the appear times,
I want to get the frequency, which is defined as count/total count
What is the best and efficient way to do so in scala?
How can I do reduction job and calculate total number at the same time within one workflow?
BTW I know purely word count can be done in this way.
text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
but what is the difference if I use aggregate? in terms of spark job workflow
val result = pairs
.aggregate(Map[String, Int]())((acc, pair) =>
if(acc.contains(pair._1))
acc ++ Map[String, Int]((pair._1, acc(pair._1)+1))
else
acc ++ Map[String, Int]((pair._1, pair._2))
,
(a, b) =>
(a.toSeq ++ b.toSeq)
.groupBy(_._1)
.mapValues(_.map(_._2).reduce(_ + _))
)
You can use this
val total = counts.map(x => x._2).sum()
val freq = counts.map(x => (x._1, x._2/total))
There exists also the concept of Accumulator which is a write-only variable and you could use it to avoid using the sum() action, but your code would need a lot of change.