Flattening Array[Array[String]] to Array[String] - scala

I would like to flattening my data structure of type Array[Array[String]] to Array[String] where there are some empty Array() too.
For example:
val test=Array(Array("foo"), Array("bar"), Array(),...)
To be converted to:
Array(foo,bar,"")
I tried :
test.flatMap(x=>x.toString())
But this gets broken down into char array:
Array([f, o, o,..])
What am I doing wrong?

You can do this using
test.flatten
The reason your initial approach didn't work is that x in x=>x.toString() is an Array[String] so each Array will become the string representation of that Array

Related

Spark: format of an rdd to convert to dataframe

Assuming I am having the following rdd:
val rdd = sc.parallelize(Seq(('a'.toString,1.1,Array(1.1,2.2),0),
('b'.toString,1.5,Array(1.4,4.2),3),
('d'.toString,2.1,Array(3.3,7.4),4)))
>>>rdd: org.apache.spark.rdd.RDD[(String,Double,Array[Double],Int)]
And I want to write the output to csv format by using .write.format("com.databricks.spark.csv") which takes a dataframe.
So firstly i need to convert the current schema to -> rdd[(String, String, String, String, String)] and after convert it to df. I tried the following:
rdd.map { case((a,b,c,d)) => (a,b,c.mkString(","),d)}
but this outputs:
rdd[(string,double,string,int)]
Any idea how to do it?
UPDATE
To work with Tuples, you have to know how many elements you're going to put in them and define the use case yourself. Hence, to work with variable number of elements, you'll probably need to work with some collection.
For your use case, something like this can work:
rdd.map { case((a,b,c,d)) => a +: (b +: c) :+ d}.map(_.mkString(","))
This will result in an RDD[String] corresponding to each line of the csv file.
You're prepending and appending the other elements to the Array "c" to result in a single Array.

How do I split a Spark rdd Array[(String, Array[String])]?

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.
rdd
org.apache.spark.rdd.RDD[Array[String]] = ...
From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)
rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
I then apply sortByKey, still no problem...
rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:
val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])
What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?
what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple.
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])]
exactly as it says :).
now if you want to split the record you need to get it from this tuple.
you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)
but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.
if you want to sort the rdd using the 7th string in the array, you can just do it directly by
rdd.sortBy(_(6)) // array starts at 0 not 1
or
rdd.sortBy(arr => arr(6))
That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).
To test this, i did the following:
val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))
// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))
Here's the result:
Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))
just do this:
val rdd4 = rdd3.map(_._2)
I thought you don't familiar with Scala,
So, below should help you understand more,
rdd3.map(kv => {
println(kv._1) // This represent String
println(kv._2) // This represent Array[String]
})

How to save a two-dimensional array into HDFS in spark?

Something like:
val arr : Array[Array[Double]] = new Array(featureSize)
sc.parallelize(arr, 100).saveAsTextFile(args(1))
Then Spark will store data type into HDFS.
Array in Scala exactly corresponds to Java Arrays - in particular, it's a mutable type, and its toString method will return a reference to the Array. When you save this RDD as textFile, it's invoking toString method on each element of the RDD and therefore giving you gibberish. If you want to output actual elements of the Array, you first have to stringify the Array, for example by applying mkString(",") method to each array. Example from Spark shell:
scala> Array(1,2,3).toString
res11: String = [I#31cba915
scala> Array(1,2,3).mkString(",")
res12: String = 1,2,3
For double arrays:
scala> sc.parallelize(Array( Array(1,2,3), Array(4,5,6), Array(7,8,9) )).collect.mkString("\n")
res15: String =
[I#41ff41b0
[I#5d31aba9
[I#67fd140b
scala> sc.parallelize(Array( Array(1,2,3), Array(4,5,6), Array(7,8,9) ).map(_.mkString(","))).collect.mkString("\n")
res16: String =
1,2,3
4,5,6
7,8,9
So, your code should be:
sc.parallelize(arr.map(_.mkString(",")), 100).saveAsTextFile(args(1))
or
sc.parallelize(arr), 100).map(_.mkString(",")).saveAsTextFile(args(1))

scala - one line convert string split to vals

I saw that following answer: Scala split string to tuple, but in the question the OP is asking for a string to a List. I would like to take a string, split it by some character, and convert it to a tuple so they can be saved as vals:
val (a,b,c) = "A.B.C".split(".").<toTupleMagic>
Is this possible? This would be a conversion from an Array[String] to a Tuple3 of (String,String,String)
It is unnecessary:
val Array(a, b, c) = "A.B.C".split('.')
Note that I converted the parameter to split from String to Char: if you pass a String, it is treated as a regex pattern, and . matches anything (so you'll get an array of empty strings back).
If you truly want to convert it to tuple, you can use Shapeless.

Making string of 2-d array in scala

So I have the following in Scala:
scala> val example = "hello \tmy \nname \tis \nmaria \tlee".split("\n").map(_.split("\\s+"))
example: Array[Array[String]] = Array(Array(hello, my), Array(name, is), Array(maria, lee))
I want to take each 1-d array and make it into a string, and make an array of these strings (strings should be comma separated). How do I do this?
scala> example.map(_.mkString)
res0: Array[String] = Array(hellomy, nameis, marialee)
To make the strings comma separated:
scala> example.map(_.mkString(","))
res0: Array[String] = Array(hello,my, name,is, maria,lee)