How to save a two-dimensional array into HDFS in spark? - scala

Something like:
val arr : Array[Array[Double]] = new Array(featureSize)
sc.parallelize(arr, 100).saveAsTextFile(args(1))
Then Spark will store data type into HDFS.

Array in Scala exactly corresponds to Java Arrays - in particular, it's a mutable type, and its toString method will return a reference to the Array. When you save this RDD as textFile, it's invoking toString method on each element of the RDD and therefore giving you gibberish. If you want to output actual elements of the Array, you first have to stringify the Array, for example by applying mkString(",") method to each array. Example from Spark shell:
scala> Array(1,2,3).toString
res11: String = [I#31cba915
scala> Array(1,2,3).mkString(",")
res12: String = 1,2,3
For double arrays:
scala> sc.parallelize(Array( Array(1,2,3), Array(4,5,6), Array(7,8,9) )).collect.mkString("\n")
res15: String =
[I#41ff41b0
[I#5d31aba9
[I#67fd140b
scala> sc.parallelize(Array( Array(1,2,3), Array(4,5,6), Array(7,8,9) ).map(_.mkString(","))).collect.mkString("\n")
res16: String =
1,2,3
4,5,6
7,8,9
So, your code should be:
sc.parallelize(arr.map(_.mkString(",")), 100).saveAsTextFile(args(1))
or
sc.parallelize(arr), 100).map(_.mkString(",")).saveAsTextFile(args(1))

Related

How to print array values in Scala? I am getting different values

Code:
object Permutations extends App
{
val ar=Array(1,2,3).combinations(2).foreach(println(_))
}
Output:
[I#378fd1ac
[I#49097b5d
[I#6e2c634b
I am trying to execute this but I am getting some other values.
How to print array values in Scala? Can any one help to print?
Use mkString
object Permutations extends App {
Array(1,2,3).combinations(2).foreach(x => println(x.mkString(", ")))
}
Scala REPL
scala> Array(1,2,3).combinations(2).foreach(x => println(x.mkString(", ")))
1, 2
1, 3
2, 3
When array instance is directly used for inside println. The toString method of array gets called and results in output like [I#49097b5d. So, use mkString for converting array instance to string.
Scala REPL
scala> println(Array(1, 2))
[I#2aadeb31
scala> Array(1, 2).mkString
res12: String = 12
scala> Array(1, 2).mkString(" ")
res13: String = 1 2
scala>
You can't print array directly, If you will try to print it will print the reference of that array.
You are almost there, Just iterate over array of array and then on individual array and display the elements like below
Array(1,2,3).combinations(2).foreach(_.foreach(println))
Or Just convert each array to string and display like below
Array(1,2,3).combinations(2).foreach(x=>println(x.mkString(" ")))
I hope this will help you

Flattening Array[Array[String]] to Array[String]

I would like to flattening my data structure of type Array[Array[String]] to Array[String] where there are some empty Array() too.
For example:
val test=Array(Array("foo"), Array("bar"), Array(),...)
To be converted to:
Array(foo,bar,"")
I tried :
test.flatMap(x=>x.toString())
But this gets broken down into char array:
Array([f, o, o,..])
What am I doing wrong?
You can do this using
test.flatten
The reason your initial approach didn't work is that x in x=>x.toString() is an Array[String] so each Array will become the string representation of that Array

Can Scala Array add new element

When I created a Scala Array and added one element, but the array length is still 0, and I can not got the added element although I can see it in the construction function.
scala> val arr = Array[String]()
arr: Array[String] = Array()
scala> arr:+"adf"
res9: Array[String] = Array(adf)
scala> println(arr(0))
java.lang.ArrayIndexOutOfBoundsException: 0
... 33 elided
You declared Array of 0 size. It cannot have any elements. Your array is of size 0. Array[String]() is a array contructor syntax.
:+ creates a new Array with with the given element so the old array is empty even after doing the :+ operation.
You have to use ofDim function to declare the array first of certain size and then you can put elements inside using arr(index) = value syntax.
Once declared array size do not increase dynamically like list. Trying to append values would create new array instance.
or you can initialize the array during creation itself using the Array("apple", "ball") syntax.
val size = 1
val arr = Array.ofDim[String](size)
arr(0) = "apple"
Scala REPL
scala> val size = 1
size: Int = 1
scala> val arr = Array.ofDim[String](size)
arr: Array[String] = Array(null)
scala> arr(0) = "apple"
scala> arr(0)
res12: String = apple

How do I split a Spark rdd Array[(String, Array[String])]?

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.
rdd
org.apache.spark.rdd.RDD[Array[String]] = ...
From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)
rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
I then apply sortByKey, still no problem...
rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:
val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])
What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?
what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple.
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])]
exactly as it says :).
now if you want to split the record you need to get it from this tuple.
you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)
but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.
if you want to sort the rdd using the 7th string in the array, you can just do it directly by
rdd.sortBy(_(6)) // array starts at 0 not 1
or
rdd.sortBy(arr => arr(6))
That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).
To test this, i did the following:
val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))
// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))
Here's the result:
Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))
just do this:
val rdd4 = rdd3.map(_._2)
I thought you don't familiar with Scala,
So, below should help you understand more,
rdd3.map(kv => {
println(kv._1) // This represent String
println(kv._2) // This represent Array[String]
})

Making string of 2-d array in scala

So I have the following in Scala:
scala> val example = "hello \tmy \nname \tis \nmaria \tlee".split("\n").map(_.split("\\s+"))
example: Array[Array[String]] = Array(Array(hello, my), Array(name, is), Array(maria, lee))
I want to take each 1-d array and make it into a string, and make an array of these strings (strings should be comma separated). How do I do this?
scala> example.map(_.mkString)
res0: Array[String] = Array(hellomy, nameis, marialee)
To make the strings comma separated:
scala> example.map(_.mkString(","))
res0: Array[String] = Array(hello,my, name,is, maria,lee)