How to separate array or vector column into multiple columns? - scala

Suppose I have a Spark Dataframe generated as:
val df = Seq(
(Array(1, 2, 3), Array("a", "b", "c")),
(Array(1, 2, 3), Array("a", "b", "c"))
).toDF("Col1", "Col2")
It's possible to extract elements at the first index in "Col1" with something like:
val extractFirstInt = udf { (x: Seq[Int], i: Int) => x(i) }
df.withColumn("Col1_1", extractFirstInt($"Col1", lit(1)))
And similarly for the second column "Col2" with e.g.
val extractFirstString = udf { (x: Seq[String], i: Int) => x(i) }
df.withColumn("Col2_1", extractFirstString($"Col2", lit(1)))
But the code duplication is a little ugly -- I need a separate UDF for each underlying element type.
Is there a way to write a generic UDF, that automatically infers the type of the underlying Array in the column of the Spark Dataset? E.g. I'd like to be able to write something like (pseudocode; with generic T)
val extractFirst = udf { (x: Seq[T], i: Int) => x(i) }
df.withColumn("Col1_1", extractFirst($"Col1", lit(1)))
Where somehow the type T would just be automagically inferred by Spark / the Scala compiler (perhaps using reflection if appropriate).
Bonus points if you're aware of a solution that works both with array-columns and Spark's own DenseVector / SparseVector types. The main thing I'd like to avoid (if at all possible) is the requirement of defining a separate UDF for each underlying array-element type I want to handle.

Perhaps frameless could be a solution?
Since manipulating datasets requires an Encoder for a given type, you have to define the type upfront so Spark SQL can create one for you. I think a Scala macro to generate all sorts of Encoder-supported types would make sense here.
As of now, I'd define a generic method and a UDF per type (which is against your wish to find a way to have "a generic UDF, that automatically infers the type of the underlying Array in the column of the Spark Dataset").
def myExtract[T](x: Seq[T], i: Int) = x(i)
// define UDF for extracting strings
val extractString = udf(myExtract[String] _)
Use as follows:
val df = Seq(
(Array(1, 2, 3), Array("a", "b", "c")),
(Array(1, 2, 3), Array("a", "b", "c"))
).toDF("Col1", "Col2")
scala> df.withColumn("Col1_1", extractString($"Col2", lit(1))).show
+---------+---------+------+
| Col1| Col2|Col1_1|
+---------+---------+------+
|[1, 2, 3]|[a, b, c]| b|
|[1, 2, 3]|[a, b, c]| b|
+---------+---------+------+
You could explore Dataset (not DataFrame, i.e. Dataset[Row]) instead. That would give you all the type machinery (and perhaps you could avoid any macro development).

As per advice from #zero323, I centered on an implementation of the following form:
def extractFirst(df: DataFrame, column: String, into: String) = {
// extract column of interest
val col = df.apply(column)
// figure out the type name for this column
val schema = df.schema
val typeName = schema.apply(schema.fieldIndex(column)).dataType.typeName
// delegate based on column type
typeName match {
case "array" => df.withColumn(into, col.getItem(0))
case "vector" => {
// construct a udf to extract first element
// (could almost certainly do better here,
// but this demonstrates the strategy regardless)
val extractor = udf {
(x: Any) => {
val el = x.getClass.getDeclaredMethod("toArray").invoke(x)
val array = el.asInstanceOf[Array[Double]]
array(0)
}
}
df.withColumn(into, extractor(col))
}
case _ => throw new IllegalArgumentException("unexpected type '" + typeName + "'")
}
}

Related

Scala MaxBy's Tuple

I have a Seq of Tuples, which represents a word count: (count, word)
For Example:
(5, "Hello")
(3, "World")
My Goal is to find the word with the highest count. In a case of a tie between 2 words, I'll pick the word, which appears first in the Dictionary (aka Alphabetical order).
val wordCounts = Seq(
(10, "World"),
(5, "Something"),
(10, "Hello")
)
val commonWord = wordCounts.maxBy(_._1)
print(commonWord)
Now, this code segment will return (10, "World"), because this is the first tuple that have the maximum count.
I could use .sortBy and then .head, but I want to be more efficient.
My question is there any way to change the Ordering of the maxBy, in order to achieve the desired outcome.
Note: I prefer not to use .sortBy, because it's O(n*log(n)) and not O(n). I know that I can use .reduce, but I want to check if I can adjust .maxBy?
Scala Version 2.13
Functions like max, min, maxBy and minBy use implicit Ordering that defines the comparison between two items. There's a default implementation of Ordering for Tuple2, however the problem is it will apply the same comparison to both elements – while in your case you need to use greater than for _._1 and less than for _._2. However you can easily solve this by inverting the first element, so this does the trick:
wordCounts.minBy(x => (-x._1, x._2))
You can create your own Ordering by using orElse() to combine two Orderings together:
// can't use .orElseBy() because of the .reverse, so this is a bit verbose
val countThenAlphaOrdering =
Ordering.by[(Int, String), Int](_._1)
.orElse(Ordering.by[(Int, String), String](_._2).reverse)
Or you can use Ordering.Tuple2 in this case:
val countThenAlphaOrdering = Ordering.Tuple2(Ordering[Int], Ordering[String].reverse)
Then
val wordCounts = Seq(
(10, "World"),
(5, "Something"),
(10, "Hello"),
)
wordCounts.max(countThenAlphaOrdering) // (10,Hello): (Int, String)
implicit val WordSorter: Ordering[(Int, String)] = new Ordering[(Int, String)]{
override def compare(
x: (Int, String),
y: (Int, String)
) = {
val iComp = implicitly[Ordering[Int]].compare(x._1, y._1)
if iComp == 0
-implicitly[Ordering[String]].compare(x._2, y._2)
else
iComp
}
}
val seq = Seq(
(10, "World"),
(5, "Something"),
(10, "Hello")
)
def main(args: Array[String]): Unit = println(seq.max)
You can create your own Ordering[(Int, String)] implicit whose compare method returns the comparison of the numbers in the tuples if its not zero and the comparison of the strings negatively if the int comparison is zero. Using implicitly defined Ordering[Int] and Ordering[String] for modularity if you want to change the behaviour later on. If you don't want to use those, you can just replace
implicitly[Orderint[Int]].compare(x._1, y._1) with x._1.compareTo(y._1) and so on.

Concatenate String to each element of a List in a Spark dataframe with Scala

I have two columns in a Spark dataframe: one is a String, and the other is a List of Strings. How do I create a new column that is the concatenation of the String in column one with each element of the list in column 2, resulting in another list in column 3.
For example, if column 1 is "a", and column 2 is ["A","B"], I'd like the output in column 3 of the dataframe to to be ["aA","aB"].
So far, I have:
val multiplier = (x1: String, x2: Seq[String]) => {x1+x2}
val multiplierUDF = udf(multiplier)
val df2 = df1
.withColumn("col3", multiplierUDF(df1("col1"),df1("col2")))
which gives aWrappedArray(A,B)
I suggest you try your udf functions outside of spark, and get them working for local variables first. If you do:
val multiplier = (x1: String, x2: Seq[String]) => {x1+x2}
multiplier("a", Seq("A", "B"))
// output
res1: String = aList(A, B)
You'll see multiplier doesn't do what you want.
I think you're looking for:
val multiplier = (x1: String, x2: Seq[String]) => x2.map(x1+_)
multiplier("a", Seq("A", "B"))
//output
res2: Seq[String] = List(aA, aB)
I think you should redefine your UDF to something similar to my function append
val a = Seq("A", "B")
val p = "a"
def append(init: String, tails: Seq[String]) = tails.map(x => init + x)
append(p, a)
//res1: Seq[String] = List(aA, aB)

Spark RDD tuple transformation

I'm trying to transform an RDD of tuple of Strings of this format :
(("abc","xyz","123","2016-02-26T18:31:56"),"15") TO
(("abc","xyz","123"),"2016-02-26T18:31:56","15")
Basically seperating out the timestamp string as a seperate tuple element. I tried following but it's still not clean and correct.
val result = rdd.map(r => (r._1.toString.split(",").toVector.dropRight(1).toString, r._1.toString.split(",").toList.last.toString, r._2))
However, it results in
(Vector(("abc", "xyz", "123"),"2016-02-26T18:31:56"),"15")
The expected output I'm looking for is
(("abc", "xyz", "123"),"2016-02-26T18:31:56","15")
This way I can access the elements using r._1, r._2 (the timestamp string) and r._3 in a seperate map operation.
Any hints/pointers will be greatly appreciated.
Vector.toString will include the String 'Vector' in its result. Instead, use Vector.mkString(",").
Example:
scala> val xs = Vector(1,2,3)
xs: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3)
scala> xs.toString
res25: String = Vector(1, 2, 3)
scala> xs.mkString
res26: String = 123
scala> xs.mkString(",")
res27: String = 1,2,3
However, if you want to be able to access (abc,xyz,123) as a Tuple and not as a string, you could also do the following:
val res = rdd.map{
case ((a:String,b:String,c:String,ts:String),d:String) => ((a,b,c),ts,d)
}

On Spark's rdd.map(_.swap)

I'm new to both Scala and Spark. Could anyone explain what's the meaning of
rdd.map(_.swap)
? If I look in Scala/Spark API I cannot find swap as a method in RDD class.
swap is a method on Scala Tuples. It swaps the first and second elements of a Tuple2 (or pair) with each other. For example:
scala> val pair = ("a","b")
pair: (String, String) = (a,b)
scala> val swapped = pair.swap
swapped: (String, String) = (b,a)
RDD's map function applies a given function to each element of the RDD. In this case, the function to be applied to each element is simply
_.swap
The underscore in this case is shorthand in Scala when writing anonymous functions, and it pertains to the parameter passed in to your function without naming it. So the above snippet can be rewritten into something like:
rdd.map{ pair => pair.swap }
So the code snippet you posted swaps the first and second elements of the tuple/pair in each row of the RDD.
This would only be available if rdd is of type RDD[Tuple2[T1,T2]], so swap is on the Tuple2
In Python it works like as follows:
rdd.map(lambda x: (x[1], x[0]))
This will swap (a, b) to (b, a) in the key, value pair.
For tuples which have been created using spark, use this lambda:
RDD map1 : ("a", 1), ("b", 2), ("c", 3)...
val map 2 = map1.map(a=> (a._2, a._1))
This will return the RDD
RDD map2 : (1, "a"), (2, "b"), (3, "c")...

How can I get a sum of arrays of tuples in scala

I have a simple array of tuples
val arr = Array((1,2), (3,4),(5,6),(7,8),(9,10))
I wish to get (1+3+5+7+9, 2+4+6+8+10) tuple as the answer
What is the best way to get the sum as tuples, similar to regular arrays. I tried
val res = arr.foldLeft(0,0)(_ + _)
This does not work.
Sorry about not writing the context. I was using it in scalding with algebird. Algebird allows sums of tuples and I assumed this would work. That was my mistake.
There is no such thing as Tuple addition, so that can't work. You would have to operate on each ordinate of the Tuple:
val res = arr.foldLeft(0,0){ case (sum, next) => (sum._1 + next._1, sum._2 + next._2) }
res: (Int, Int) = (25,30)
This should work nicely:
arr.foldLeft((0,0)){ case ((a0,b0),(a1,b1)) => (a0+a1,b0+b1) }
Addition isn't defined for tuples.
Use scalaz, which defines a tuple as a semigroup, allowing you to use the append operator |+|
import scalaz._
import Scalaz._
arr.fold((0,0))(_ |+| _)
Yet another alternative
val (a, b) = arr.unzip
//> a : Array[Int] = Array(1, 3, 5, 7, 9)
//| b : Array[Int] = Array(2, 4, 6, 8, 10)
(a.sum, b.sum)
//> res0: (Int, Int) = (25,30)