Use Map to replace column values in Spark - scala

I have to map a list of columns to another column in a Spark dataset: think something like this
val translationMap: Map[Column, Column] = Map(
lit("foo") -> lit("bar"),
lit("baz") -> lit("bab")
)
And I have a dataframe like this one:
val df = Seq("foo", "baz").toDF("mov")
So I intend to perform the translation like this:
df.select(
col("mov"),
translationMap(col("mov"))
)
but this piece of code spits the following error
key not found: movs
java.util.NoSuchElementException: key not found: movs
Is there a way to perform such translation without concatenating hundreds of whens? think that translationMap could have lots of pairs key-value.

Instead of Map[Column, Column] you should use a Column containing a map literal:
import org.apache.spark.sql.functions.typedLit
val translationMap: Column = typedLit(Map(
"foo" -> "bar",
"baz" -> "bab"
))
The rest of your code can stay as-is:
df.select(
col("mov"),
translationMap(col("mov"))
).show
+---+---------------------------------------+
|mov|keys: [foo,baz], values: [bar,bab][mov]|
+---+---------------------------------------+
|foo| bar|
|baz| bab|
+---+---------------------------------------+

You can not refer a Scala collection declared on the driver like this inside a distributed dataframe. An alternative would be to use a UDF which will not be performance efficient if you have a large dataset since UDFs are not optimized by Spark.
val translationMap = Map( "foo" -> "bar" , "baz" -> "bab" )
val getTranslationValue = udf ((x: String)=>translationMap.getOrElse(x,null.asInstanceOf[String]) )
df.select(col("mov"), getTranslationValue($"mov").as("value") ).show
//+---+-----+
//|mov|value|
//+---+-----+
//|foo| bar|
//|baz| bab|
//+---+-----+
Another solution would be to load the Map as a DataSet[(String, String)] and the join the two datasets taking mov as the key.

Related

Expand expression in Spark Scala aggregation

I'm trying to convert a simple aggregation code from PySpark to Scala.
The dataframes:
# PySpark
from pyspark.sql import functions as F
df = spark.createDataFrame(
[([10, 100],),
([20, 200],)],
['vals'])
// Scala
val df = Seq(
(Seq(10, 100)),
(Seq(20, 200)),
).toDF("vals")
Aggregation expansion - OK in PySpark:
df2 = df.agg(
*[F.sum(F.col("vals")[i]).alias(f"col{i}") for i in range(2)]
)
df2.show()
# +----+----+
# |col0|col1|
# +----+----+
# | 30| 300|
# +----+----+
But in Scala...
val df2 = df.agg(
(0 until 2).map(i => sum($"vals"(i)).alias(s"col$i")): _*
)
(0 until 2).map(i => sum($"vals"(i)).alias(s"col$i")): _*
^
On line 2: error: no `: _*` annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
The syntax seems almost the same to this select which works well:
val df2 = df.select(
(0 until 2).map(i => $"vals"(i).alias(s"col$i")): _*
)
Does expression expansion work in Scala Spark aggregations? How?
i'm not fully understanding why this is happening for the compiler but it seems that it is not unpacking your Seq[Column] to vararg as params.
as #RvdV has mentioned in his post, the signature of the method is
def agg(expr: Column, exprs: Column*): DataFrame
so a temp solution is you unpack it manually, like:
val seq = Seq(0, 1).map(i => sum($"vals"(i)).alias(s"col$i"))
val df2 = df.agg(seq(0), seq(1))
If you look at the documentation of Dataset.agg, you see that it first has a fixed parameter and then a list of unspecified length:
def agg(expr: Column, exprs: Column*): DataFrame
So you should first have any other aggregation, then for the second argument you can do the list expansion. So something like
val df2 = df.agg(
first($"vals"), (0 until 2).map(i => sum($"vals"(i)).alias(s"col$i")): _*
)
or any other single aggregation in front of the list should work.
I don't know why it is like this, maybe it's a Scala limitation so you can't pass an empty list and have no aggregation at all?

How to split column in Spark Dataframe to multiple columns

In my case how to split a column contain StringType with a format '1-1235.0 2-1248.0 3-7895.2' to another column with ArrayType contains ['1','2','3']
this is relatively simple with UDF:
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("input")
val extractFirst = udf((s: String) => s.split(" ").map(_.split('-')(0).toInt))
df.withColumn("newCol", extractFirst($"input"))
.show()
gives
+--------------------+---------+
| input| newCol|
+--------------------+---------+
|1-1235.0 2-1248.0...|[1, 2, 3]|
+--------------------+---------+
I could not find an easy soluton with spark internals (other than using split in combination with explode etc and then re-aggregating)
You can split the string to an array using split function and then you can transform the array using Higher Order Function TRANSFORM (it is available since Sark 2.4) together with substring_index:
import org.apache.spark.sql.functions.{split, expr}
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("stringCol")
df.withColumn("array", split($"stringCol", " "))
.withColumn("result", expr("TRANSFORM(array, x -> substring_index(x, '-', 1))"))
Notice that this is native approach, no UDF applied.

How to convert spark dataframe array to tuple

How can I convert spark dataframe to a tuple of 2 in scala?
I tried to explode the array and create a new column with help of lead function, so that I can use two columns to create tuple.
In order to use lead function, I need a column to sort by, I don't have any.
Please suggest which is best way to solve this?
Note: I need to retain the same order in the array.
For example:
Input
For example, input looks something like this,
id1 | [text1, text2, text3, text4]
id2 | [txt, txt2, txt4, txt5, txt6, txt7, txt8, txt9]
expected o/p:
I need to get output of tuple of length 2
id1 | [(text1, text2), (text2, text3), (text3,text4)]
id2 | [(txt, txt2), (txt2, txt4), (txt4, txt5), (txt5, txt6), (txt6, txt7), (txt7, txt8), (txt8, txt9)]
You can create an udf to create list of tuple using sliding window function
val df = Seq(
("id1", List("text1", "text2", "text3", "text4")),
("id2", List("txt", "txt2", "txt4", "txt5", "txt6", "txt7", "txt8", "txt9"))
).toDF("id", "text")
val sliding = udf((value: Seq[String]) => {
value.toList.sliding(2).map { case List(a, b) => (a, b) }.toList
})
val result = df.withColumn("text", sliding($"text"))
Output:
+---+-------------------------------------------------------------------------------------------------+
|id |text |
+---+-------------------------------------------------------------------------------------------------+
|id1|[[text1, text2], [text2, text3], [text3, text4]] |
|id2|[[txt, txt2], [txt2, txt4], [txt4, txt5], [txt5, txt6], [txt6, txt7], [txt7, txt8], [txt8, txt9]]|
+---+-------------------------------------------------------------------------------------------------+
Hope this helps!

access scala map from dataframe without using UDFs

I have a Spark (version 1.6) Dataframe, and I would like to add a column with a value contained in a Scala Map, this is my simplified code:
val map = Map("VAL1" -> 1, "VAL2" -> 2)
val df2 = df.withColumn("newVal", map(col("key")))
This code doesn't work and obviously I receive the following error, because the map expecting a String value, while receiving a column:
found : org.apache.spark.sql.Column
required: String
The only way I could do that is using an UDF:
val map = Map("VAL1" -> 1, "VAL2" -> 2)
val myUdf = udf{ value:String => map(value)}
val df2 = df.withColumn("newVal", myUdf($"key"))
I want avoid the use of UDFs if possible.
Are there any other solutions available using just the DataFrame API (I would like also to avoid transforming it to RDD)?
TL;DR Just use udf.
With the version you use (Spark 1.6 according to your comment) there is no solution which doesn't require udf or map over RDD / Dataset.
In later versions you can:
use map functions (2.0 or later) to create literal MapType column
import org.apache.spark.sql.functions
val map = functions.map(
Map("VAL1" -> 1, "VAL2" -> 2)
.flatMap { case (k, v) => Seq(k, v) } .map(lit) .toSeq: _*
)
map($"key")
typedLit (2.2 or later) to create literal MapType column.
val map = functions.typedLit(Map("VAL1" -> 1, "VAL2" -> 2))
map($"key")
and use these directly.
Reference How to add a constant column in a Spark DataFrame?
You could convert the Map to a Dataframe and use a JOIN between this and your existing dataframe. Since the Map dataframe would be very small, it should be a Broadcast Join and avoid the need for a shuffle phase.
Letting Spark know to use a broadcast join is described in this answer: DataFrame join optimization - Broadcast Hash Join

apache spark groupBy pivot function

I am new to spark and using spark 1.6.1. I am using the pivot function to create a new column based on a integer value. Say I have a csv file like this:
year,winds
1990,50
1990,55
1990,58
1991,45
1991,42
1991,58
I am loading the csv file like this:
var df =sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data/sample.csv")
I want to aggregate the winds colmnn filtering those winds greater than 55 so that I get an output file like this:
year, majorwinds
1990,2
1991,1
I am using the code below:
val df2=df.groupBy("major").pivot("winds").agg(>55)->"count")
But I get this error
error: expected but integer literal found
What is the correct syntax here? Thanks in advance
In your case, if you just want output like:
+----+----------+
|year|majorwinds|
+----+----------+
|1990| 2|
|1991| 1|
+----+----------+
It's not necessary to use pivot.
You could reach this by using filter, groupBy and count:
df.filter($"winds" >= 55)
.groupBy($"year")
.count()
.withColumnRenamed("count", "majorwinds")
.show()
use this generic funtion to do pivot
def transpose(sqlCxt: SQLContext, df: DataFrame, compositeId: Vector[String], pair: (String, String), distinctCols: Array[Any]): DataFrame = {
val rdd = df.map { row => (compositeId.collect { case id => row.getAs(id).asInstanceOf[Any] }, scala.collection.mutable.Map(row.getAs(pair._1).asInstanceOf[Any] -> row.getAs(pair._2).asInstanceOf[Any])) }
val pairRdd = rdd.reduceByKey(_ ++ _)
val rowRdd = pairRdd.map(r => dynamicRow(r, distinctCols))
sqlCxt.createDataFrame(rowRdd, getSchema(compositeId ++ distinctCols))
}