Spark print result of Arrays or saveAsTextFile - scala

I have been bogged down by this for some hours now... tried collect and mkString(") and still i am not able to print in console or save as text file.
scala> val au1 = sc.parallelize(List(("a",Array(1,2)),("b",Array(1,2))))
scala> val au2 = sc.parallelize(List(("a",Array(3)),("b",Array(2))))
scala> val au3 = au1.union(au2)
Result of the union is
Array[(String,Array[int])] = Array((a,Array(1,2)),(b,Array(1,2)),(a,Array(3)),(b,Array(2)))
All the print attempts are resulting in following when i do x(0) and x(1)
Array[Int]) does not take parameters
Last attempt, performed following and it is resulting in index error
scala> val au4 = au3.map(x => (x._1, x._2._1._1, x._2._1._2))
<console>:33: error: value _1 is not a member of Array[Int]
val au4 = au3.map(x => (x._1, x._2._1._1, x._2._1._2))

._1 or ._2 can be done in tuples and not in arrays
("a",Array(1,2)) is a tuple so ._1 is a and ._2 is Array(1,2)
so if you want to get elements of an array you need to use () as x._2(0)
but au2 arrays has only one element so x._2(1) will work for au1 and not for au2. You can use Option or Try as below
val au4 = au3.map(x => (x._1, x._2(0), Try(x._2(1)) getOrElse(0)))

The result of au3 is not Array[(String,Array[int])] , it is RDD[(String,Array[int])]
so this how you could do to write output in a file
au3.map( r => (r._1, r._2.map(_.toString).mkString(",")))
.saveAsTextFile("data/result")
You need to map through the array and create a string from it so that it could be written in file as
(a,1:2)
(b,1:2)
(a,3)
(b,2)
You could write to file without brackets as below
au3.map( r => Row(r._1, r._2.map(_.toString).mkString(":")).mkString(","))
.saveAsTextFile("data/result")
Output:
a,1:2
b,1:2
a,3
b,2
The value is comma "," separated and array value are separated as ":"
Hope this helps!

Related

How to Sorting results after run code in Spark

I created some code lines of scala to count number of words in a text file (in Spark). The result such like this:
(further,,1)
(Hai,,2)
(excluded,1)
(V.,5)
I wonder that can I sort the result as follow:
(V.,5)
(Hai,,2)
(excluded,1)
(further,,1)
The code as showed bellow, thank you for your help!
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.collect()
wordCounts.saveAsTextFile("./WordCountTest")
If you want to sort your first dataset by the second field, you can use the following code:
val wordCounts = Seq(
("V.",5),
("Hai",2),
("excluded",1),
("further",1)
)
val wcOrdered = wordCounts.sortBy(_._2).reverse
which yields the following result
wcOrdered: Seq[(String, Int)] = List((V.,5), (Hai,2), (further,1), (excluded,1))
You can just call wordCounts.sortBy(_._2, false). Method sortBy from RDD takes boolean as the second argument, which tells if the result should be sorted ascending (true - default) or descending (false).
textFile
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.sortBy(_._2, false)

Spark Sql data from org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]

I have org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])] data,
how to print data or get data?
I have code like:
val sessionsDF = Seq(("day1","user1","session1", 100.0),
("day1","user1","session2",200.0),
("day2","user1","session3",300.0),
("day2","user1","session4",400.0),
("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal").toDF()
val groupByData=sessionsDF.groupBy(x=>(x.get(0),x.get(1))
val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)
The above code is returning org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]
In your first step, you have .toDF() extra. Correct one is as below
val sessionsDF = Seq(("day1","user1","session1", 100.0),
("day1","user1","session2",200.0),
("day2","user1","session3",300.0),
("day2","user1","session4",400.0),
("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal")
In your second step, you missed .rdd so the actual second step is
val groupByData=sessionsDF.rdd.groupBy(x=>(x.get(0),x.get(1)))
which has dataType as you mentioned in the question as
scala> groupByData
res12: org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])] = ShuffledRDD[9] at groupBy at <console>:25
to view the groupByData rdd you can simply use foreach as
groupByData.foreach(println)
which would give you
((day1,user1),CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0]))
((day2,user1),CompactBuffer([day2,user1,session3,300.0], [day2,user1,session4,400.0], [day2,user1,session4,99.0]))
Now your third step is filtering data which has day1 as value for day column in your dataframe. And you are taking only the values of the grouped rdd data.
val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)
the returned dataType for this step is
scala> filterData
res13: org.apache.spark.rdd.RDD[Iterable[org.apache.spark.sql.Row]] = MapPartitionsRDD[11] at map at <console>:27
You can use foreach as above to view the data as
filterData.foreach(println)
which would give you
CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0])
You can see that the returned dataType is an RDD[Iterable[org.apache.spark.sql.Row]] so you can print each values using a map as
filterData.map(x => x.map(y => println(y(0), y(1), y(2), y(3)))).collect
which would give you
(day1,user1,session1,100.0)
(day1,user1,session2,200.0)
if you do only
filterData.map(x => x.map(y => println(y(0), y(3)))).collect
you would get
(day1,100.0)
(day1,200.0)
I hope the answer is helpful

Spark Scala: Split each line between multiple RDDs

I have a file on HDFS in the form of:
61,139,75
63,140,77
64,129,82
68,128,56
71,140,47
73,141,38
75,128,59
64,129,61
64,129,80
64,129,99
I create an RDD from it and and zip the elements with their index:
val data = sc.textFile("hdfs://localhost:54310/usrp/sample.txt")
val points = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val indexed = points.zipWithIndex()
val indexedData = indexed.map{case (value,index) => (index,value)}
Now I need to create rdd1 with the index and the first two elements of each line. Then need to create rdd2 with the index and third element of each row. I am new to Scala, can you please help me with how to do this ?
This does not work since y is not of type Vector but org.apache.spark.mllib.linalg.Vector
val rdd1 = indexedData.map{case (x,y) => (x,y.take(2))}
Basically how to get he first two elements of such a vector ?
Thanks.
You can make use of DenseVector's unapply method to get the underlying Array[Double] in your pattern-matching, and then call take/drop on the Array, re-wrapping it with a Vector:
val rdd1 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.take(2))) }
val rdd2 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.drop(2))) }
As you can see - this means the original DenseVector you created isn't really that useful, so if you're not going to use indexedData anywhere else, it might be better to create indexedData as a RDD[(Long, Array[Double])] in the first place:
val points = data.map(s => s.split(',').map(_.toDouble))
val indexedData: RDD[(Long, Array[Double])] = points.zipWithIndex().map(_.swap)
val rdd1 = indexedData.mapValues(arr => Vectors.dense(arr.take(2)))
val rdd2 = indexedData.mapValues(arr => Vectors.dense(arr.drop(2)))
Last tip: you probably want to call .cache() on indexedData before scanning it twice to createrdd1 and rdd2 - otherwise the file will be loaded and parsed twice.
You can achieve the above output by following the below steps:
Original Data:
indexedData.foreach(println)
(0,[61.0,139.0,75.0])
(1,[63.0,140.0,77.0])
(2,[64.0,129.0,82.0])
(3,[68.0,128.0,56.0])
(4,[71.0,140.0,47.0])
(5,[73.0,141.0,38.0])
(6,[75.0,128.0,59.0])
(7,[64.0,129.0,61.0])
(8,[64.0,129.0,80.0])
(9,[64.0,129.0,99.0])
RRD1 Data:
Having index along with first two elements of each line.
val rdd1 = indexedData.map{case (x,y) => (x, (y.toArray(0), y.toArray(1)))}
rdd1.foreach(println)
(0,(61.0,139.0))
(1,(63.0,140.0))
(2,(64.0,129.0))
(3,(68.0,128.0))
(4,(71.0,140.0))
(5,(73.0,141.0))
(6,(75.0,128.0))
(7,(64.0,129.0))
(8,(64.0,129.0))
(9,(64.0,129.0))
RRD2 Data:
Having index along with third element of row.
val rdd2 = indexedData.map{case (x,y) => (x, y.toArray(2))}
rdd2.foreach(println)
(0,75.0)
(1,77.0)
(2,82.0)
(3,56.0)
(4,47.0)
(5,38.0)
(6,59.0)
(7,61.0)
(8,80.0)
(9,99.0)

How do I split a Spark rdd Array[(String, Array[String])]?

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.
rdd
org.apache.spark.rdd.RDD[Array[String]] = ...
From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)
rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
I then apply sortByKey, still no problem...
rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:
val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])
What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?
what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple.
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])]
exactly as it says :).
now if you want to split the record you need to get it from this tuple.
you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)
but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.
if you want to sort the rdd using the 7th string in the array, you can just do it directly by
rdd.sortBy(_(6)) // array starts at 0 not 1
or
rdd.sortBy(arr => arr(6))
That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).
To test this, i did the following:
val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))
// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))
Here's the result:
Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))
just do this:
val rdd4 = rdd3.map(_._2)
I thought you don't familiar with Scala,
So, below should help you understand more,
rdd3.map(kv => {
println(kv._1) // This represent String
println(kv._2) // This represent Array[String]
})

How to unpack a map/list in scala to tuples for a variadic function?

I'm trying to create a PairRDD in spark. For that I need a tuple2 RDD, like RDD[(String, String)]. However, I have an RDD[Map[String, String]].
I can't work out how to get rid of the iterable so I'm just left with RDD[(String, String)] rather than e.g. RDD[List[(String, String)]].
A simple demo of what I'm trying to make work is this broken code:
val lines = sparkContext.textFile("data.txt")
val pairs = lines.map(s => Map(s -> 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The last line doesn't work because pairs is an RDD[Map[String, Int]] when it needs to be an RDD[(String, Int)].
So how can I get rid of the iterable in pairs above to convert the Map to just a tuple2?
You can actually just run:
val counts = pairs.flatMap(identity).reduceByKey(_ + _)
Note that the usage of the identity function that replicates the functionality of flatten on an RDD and the reduceByKey() function has a nifty underscore notation for conciseness.