Select column based on Map - scala

I have the following dataframe df with the following schema:
|-- type: string (nullable = true)
|-- record_sales: array (nullable = false)
| |-- element: string (containsNull = false)
|-- record_marketing: array (nullable = false)
| |-- element: string (containsNull = false)
and a map
typemap = Map("sales" -> "record_sales", "marketing" -> "record_marketing")
I want a new column "record" that is either the value of record_sales or record_marketing based on the value of type.
I've tried some variants of this:
val typeMapCol = typedLit(typemap)
val df2 = df.withColumn("record", col(typeMapCol(col("type"))))
But nothing has worked. Does anyone have any idea? Thanks!

You can iterate over the map typemap and use when function to get case/when expressions depending on the value of type column:
val recordCol = typemap.map{case (k,v) => when(col("type") === k, col(v))}.toSeq
val df2 = df.withColumn("record", coalesce(recordCol: _*))

Related

select column of a dataframe using udf

I use spark-shell and want to create a dataframe (df2) from another dataframe (df1) using select and udf. But there is an error when I want to show the df2 ==> df2.show(1)
var df1 = sql(s"select * from table_1")
val slice = udf ((items: Array[String]) => if (items == null) items
else {
if (items.size <= 20)
items
else
items.slice(0, 20)
})
var df2 = df1.select($"col1", slice($"col2"))
and the df1 schema is:
scala> df1.printSchema
root
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
scala> df2.printSchema
root
|-- col1: string (nullable = true)
|-- UDF(col2): array (nullable = true)
| |-- element: string (containsNull = true)
error:
Failed to execute user defined function($anonfun$1: (array<string>) => array<string>)
Used Seq[String] instead of Array[String] in the udf and the issue is resolved.

How to create a column reference dynamically?

I have DataFrame df with the following structure:
root
|-- author: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- client: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- outbound_link: array (nullable = true)
| |-- element: string (containsNull = true)
|-- url: string (nullable = true)
I run this code:
val sourceField = "outbound_link" // set automatically
val targetField = "url" // set automatically
val nodeId = "client" // set automatically
val result = df.as("df1").join(df.as("df2"),
$"df1."+sourceField === $"df2."+targetField
).groupBy(
($"df1."+nodeId).as("nodeId_1"),
($"df2."+nodeId).as("nodeId_2")
)
.agg(
count("*") as "value", max($"df1."+timestampField) as "timestamp"
)
.toDF("source", "target", "value", "timestamp")
But I get the error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: df1.;
For some reason, the variables sourceField and targetField are not visible inside joinoperation. These variables are not empty and contain the names of fields. I must use variables because I define them automatically in a previous step of code.
An interesting case indeed. Look at $"df1."+sourceField and think about when $"df1." gets converted to a Column versus the concatenation of "df1."+sourceField.
scala> val sourceField = "id"
sourceField: String = id
scala> $"df1."+sourceField
org.apache.spark.sql.AnalysisException: syntax error in attribute name: df1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:151)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:170)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:142)
at org.apache.spark.sql.Column.<init>(Column.scala:137)
at org.apache.spark.sql.ColumnName.<init>(Column.scala:1203)
at org.apache.spark.sql.SQLImplicits$StringToColumn.$(SQLImplicits.scala:45)
... 55 elided
Replace $"df1."+sourceField to use col or column functions and you should be fine.
scala> col(s"df1.$sourceField")
res7: org.apache.spark.sql.Column = df1.id

scala enclose a map to a struct?

I have a schema and called a udf on this column called referencesTypes
|-- referenceTypes: struct (nullable = true)
| |-- map: map (nullable = true)
| | |-- key: string
| | |-- value: long (valueContainsNull = true)
The udf
val mapfilter = udf[Map[String,Long],Map[String,Long]](map => {
map.keySet.exists(_ != "Family")
val newMap = map.updated("Family",1L)
newMap
})
Now after the udf is used my schema goes to this
|-- referenceTypes: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = false)
What do i do to get back referenceType as Struct and map as subroot. In other words how do i convert it back to the orginal on the top with Struct and map one level below.. Bottom has to look like top again, but dont know what changes to make to the udf.
Tried toArray(thought it can be struct) and tomap as well?
basically need to bring back []
actual : Map(Family -> 1)
EXPECTED : [Map(Family -> 1)]
You have to add struct:
import org.apache.spark.sql.functions.struct
df.withColumn(
"referenceTypes",
struct(mapFilter($"referenceTypes.map").alias("map")))

How to rename elements of an array of structs in Spark DataFrame API

I have an UDF which returns an array of tuples:
val df = spark.range(1).toDF("i")
val myUDF = udf((l:Long) => {
Seq((1,2))
})
df.withColumn("udf_result",myUDF($"i"))
.printSchema
gives
root
|-- i: long (nullable = false)
|-- test: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: integer (nullable = false)
I want to rename the elements of the struct to something meaningful instead of _1 and _2, how can this be achieved? Note that I'm aware that returning a Seq of case-classes would let me allow to give proper field names, but using Spark-Notebook (REPL) with Yarn we have many issues using case classes, so I'm looking for a solution without case-classes.
I'm using Spark 2 but with untyped DataFrames, the solution should also be applicable for Spark 1.6
It is possible to cast the output of the udf. E.g. to rename the structfields to x and y, you can do:
type-safe:
val schema = ArrayType(
StructType(
Array(
StructField("x",IntegerType),
StructField("y",IntegerType)
)
)
)
df.withColumn("udf_result",myUDF($"i").cast(schema))
or unsafe, but shorter using string-argument to cast
df.withColumn("udf_result",myUDF($"i").cast("array<struct<x:int,y:int>>"))
both will give the schema
root
|-- i: long (nullable = false)
|-- udf_result: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: integer (nullable = true)
| | |-- y: integer (nullable = true)

How to get an element in WrappedArray: result of Dataset.select("x").collect()?

I am a beginer on Spark/Scala. I would like to extract a value(Double) in the Array selected from Dataset. Simplified major steps are shown below. How can I extract each value[Double] in the
last val wpA ? Something like val p1 = wpA(1). I failed to conver it to normal array by wpA.toArray.
Thank you in advance for your help.
case class Event(eventId: Int, n_track: Int, px:ArrayBuffer[Double],py: ArrayBuffer[Double], pz: ArrayBuffer[Double],ch: ArrayBuffer[Int], en: ArrayBuffer[Double])
---
val rawRdd = sc.textFile("expdata/rawdata.bel").map(_.split("\n"))
val eventRdd = rawRdd.map(x => buildEvent(x(0).toString))
val dataset = sqlContext.createDataset[Event](eventRdd)
dataset.printSchema()
root
|-- eventId: integer (nullable = false)
|-- n_track: integer (nullable = false)
|-- px: array (nullable = true)
| |-- element: double (containsNull = false)
|-- py: array (nullable = true)
| |-- element: double (containsNull = false)
|-- pz: array (nullable = true)
| |-- element: double (containsNull = false)
|-- ch: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- en: array (nullable = true)
| |-- element: double (containsNull = false)
val dataFrame = dataset.select("px")
val dataRow = dataFrame.collect()
val wpA = dataRow(1)(0)
println(wpA)
WrappedArray(-0.99205, 0.379417, 0.448819,.....)
When you write:
val wpA = dataRow(1)(0)
You get a variable of type Any, because org.apache.spark.sql.Row.apply(Int) (which is the method called here on the result of datarow(1)), returns Any.
Since you know the expected type of the first item (index = 0) of this row, you should use Row.getAs[T](Int) and indicate that you expect a WrappedArray. Then, compiler will know that wpA is an array and you'll be able to use any of its methods (including the apply method that takes an int and can be called using parens only):
import scala.collection.mutable
val wpA = dataRow(1).getAs[mutable.WrappedArray[Double]](0)
println(wpA) // WrappedArray(-0.99205, 0.379417, 0.448819,.....)
println(wpA(0)) // -0.99205