access scala map from dataframe without using UDFs - scala

I have a Spark (version 1.6) Dataframe, and I would like to add a column with a value contained in a Scala Map, this is my simplified code:
val map = Map("VAL1" -> 1, "VAL2" -> 2)
val df2 = df.withColumn("newVal", map(col("key")))
This code doesn't work and obviously I receive the following error, because the map expecting a String value, while receiving a column:
found : org.apache.spark.sql.Column
required: String
The only way I could do that is using an UDF:
val map = Map("VAL1" -> 1, "VAL2" -> 2)
val myUdf = udf{ value:String => map(value)}
val df2 = df.withColumn("newVal", myUdf($"key"))
I want avoid the use of UDFs if possible.
Are there any other solutions available using just the DataFrame API (I would like also to avoid transforming it to RDD)?

TL;DR Just use udf.
With the version you use (Spark 1.6 according to your comment) there is no solution which doesn't require udf or map over RDD / Dataset.
In later versions you can:
use map functions (2.0 or later) to create literal MapType column
import org.apache.spark.sql.functions
val map = functions.map(
Map("VAL1" -> 1, "VAL2" -> 2)
.flatMap { case (k, v) => Seq(k, v) } .map(lit) .toSeq: _*
)
map($"key")
typedLit (2.2 or later) to create literal MapType column.
val map = functions.typedLit(Map("VAL1" -> 1, "VAL2" -> 2))
map($"key")
and use these directly.
Reference How to add a constant column in a Spark DataFrame?

You could convert the Map to a Dataframe and use a JOIN between this and your existing dataframe. Since the Map dataframe would be very small, it should be a Broadcast Join and avoid the need for a shuffle phase.
Letting Spark know to use a broadcast join is described in this answer: DataFrame join optimization - Broadcast Hash Join

Related

Add mapType column to existing DataFrame

I have got probably easy and quick question regarding the DataFrames in the Scala in Spark.
I have an existing Spark DataFrame (operate with Scala 2.10.5 and Spark 1.6.3) and would like to add a new column with the ArrayType or MapType, but don't know how to achieve that. But don't know how to deal with that. I would not like to create multiple columns with the 'single' values, but store them in one column. It would shorten my code and make it more changes prone.
import org.apache.spark.sql.types.MapType
...
// DataFrame initial creation
val df = ...
// adding new columns
val df_new = df
.withColumn("new_col1", lit("something_to_add") // add a literal
.withColumn("new_col2"), MapType("key1" -> "val1", "key2" -> "val2")) // ???
You could try something like
val df_new = df
.withColumn("new_col1", lit("something_to_add") // add a literal
.withColumn("new_col2"), typedLit[Map[String, String]](Map("key1" -> "val1"), ("key2" -> "val2")))

Use Map to replace column values in Spark

I have to map a list of columns to another column in a Spark dataset: think something like this
val translationMap: Map[Column, Column] = Map(
lit("foo") -> lit("bar"),
lit("baz") -> lit("bab")
)
And I have a dataframe like this one:
val df = Seq("foo", "baz").toDF("mov")
So I intend to perform the translation like this:
df.select(
col("mov"),
translationMap(col("mov"))
)
but this piece of code spits the following error
key not found: movs
java.util.NoSuchElementException: key not found: movs
Is there a way to perform such translation without concatenating hundreds of whens? think that translationMap could have lots of pairs key-value.
Instead of Map[Column, Column] you should use a Column containing a map literal:
import org.apache.spark.sql.functions.typedLit
val translationMap: Column = typedLit(Map(
"foo" -> "bar",
"baz" -> "bab"
))
The rest of your code can stay as-is:
df.select(
col("mov"),
translationMap(col("mov"))
).show
+---+---------------------------------------+
|mov|keys: [foo,baz], values: [bar,bab][mov]|
+---+---------------------------------------+
|foo| bar|
|baz| bab|
+---+---------------------------------------+
You can not refer a Scala collection declared on the driver like this inside a distributed dataframe. An alternative would be to use a UDF which will not be performance efficient if you have a large dataset since UDFs are not optimized by Spark.
val translationMap = Map( "foo" -> "bar" , "baz" -> "bab" )
val getTranslationValue = udf ((x: String)=>translationMap.getOrElse(x,null.asInstanceOf[String]) )
df.select(col("mov"), getTranslationValue($"mov").as("value") ).show
//+---+-----+
//|mov|value|
//+---+-----+
//|foo| bar|
//|baz| bab|
//+---+-----+
Another solution would be to load the Map as a DataSet[(String, String)] and the join the two datasets taking mov as the key.

Can I recursively apply transformations to a Spark dataframe in scala?

Noodling around with Spark, using union to build up a suitably large test dataset. This works OK:
val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
df.union(df).union(df).count()
But I'd like to do something like this:
val df = spark.read.json("/opt/spark/examples/src/main/resources/people.json")
for (a <- 1 until 10){
df = df.union(df)
}
that barfs with error
<console>:27: error: reassignment to val
df = df.union(df)
^
I know this technique would work using python, but this is my first time using scala so I'm unsure of the syntax.
How can I recursively union a dataframe with itself n times?
If you use val on the dataset it becomes an immutable variable. That means you can't do any reassignments. If you change your definition to var df your code should work.
A functional approach without mutable data is:
val df = List(1,2,3,4,5).toDF
val bigDf = ( for (a <- 1 until 10) yield df ) reduce (_ union _)
The for loop will create a IndexedSeq of the specified length containing your DataFrame and the reduce function will take the first DataFrame union it with the second and will start again using the result.
Even shorter without the for loop:
val df = List(1,2,3,4,5).toDF
val bigDf = 1 until 10 map (_ => df) reduce (_ union _)
You could also do this with tail recursion using an arbitrary range:
#tailrec
def bigUnion(rng: Range, df: DataFrame): DataFrame = {
if (rng.isEmpty) df
else bigUnion(rng.tail, df.union(df))
}
val resultingBigDF = bigUnion(1.to(10), myDataFrame)
Please note this is untested code based on a similar things I had done.

Spark Select with a List of Columns Scala

I am trying to find a good way of doing a spark select with a List[Column, I am exploding a column than passing back all the columns I am interested in with my exploded column.
var columns = getColumns(x) // Returns a List[Column]
tempDf.select(columns) //trying to get
Trying to find a good way of doing this I know, if it were a string I could do something like
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
For spark 2.0 seems that you have two options. Both depends on how you manage your columns (Strings or Columns).
Spark code (spark-sql_2.11/org/apache/spark/sql/Dataset.scala):
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
You can see how internally spark is converting your head & tail to a list of Columns to call again Select.
So, in that case if you want a clear code I will recommend:
If columns: List[String]:
import org.apache.spark.sql.functions.col
df.select(columns.map(col): _*)
Otherwise, if columns: List[Columns]:
df.select(columns: _*)

Access joined RDD fields in a readable way

I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks
Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.