Adding the result of a function in a Dataframe column [Spark Scala]

Adding the result of a function in a Dataframe column [Spark Scala] - scala

I want to do some calculations and add that to an existing dataframe.
I have the following function to calculate the address space based on the longitude and lattitude.
def getH3Address(x: Double, y: Double): String ={
h3.get.geoToH3Address(x,y)
}
I created a Dataframe with the following schema:
root
|-- lat: double (nullable = true)
|-- lon: double (nullable = true)
|-- elevation: integer (nullable = true)
I want to add/append a new column to this Dataframe called H3Address, where the address space is calculated based on the input of the lat and lon of that row.
Here a short part of the dataframe I want to achieve:
+----+------------------+---------+---------+
| lat| lon|elevation|H3Address|
+----+------------------+---------+---------+
|51.0| 3.0| 13| a3af83|
|51.0| 3.000277777777778| 13| a3zf83|
|51.0|3.0005555555555556| 12| a1qf82|
|51.0|3.0008333333333335| 12| l3xf83|
I tried something like:
df.withColumn("H3Address", geoToH3Address(df.select(df("lat")), df.select(df("lon")))
But this didn't work.
Can someone help me out?
Edit:
After adding the suggestion of #Garib, I got the following lines:
val getH3Address = udf(
(lat: Double, lon: Double, res: Int) => {
h3.get.geoToH3Address(lat,lon,res).toString
})
var res : Int = 10
val DF_edit = df.withColumn("H3Address",
getH3Address(col("lat"), col("lon"), 10))
This thime, I get the error:
[error] type mismatch;
found : Int
required: org.apache.spark.sql.Column
How can I solve this error? Tried a lot of things. For example by using lit() function
Edit2:
After using the correct way of lit(), the solution proposed has worked.
Solution:
df.withColumn("H3Address", getH3Address(col("lat"), col("lon"), lit(10)))

You should create a UDF out of your function.
User-Defined Functions (UDFs) are user-programmable routines that act on one row
For example:
val getH3Address = udf(
// write here the logic of your function. I used a dummy logic (x+y) just for this example.
(x: Double, y: Double) => {
(x + y).toString
})
val df = Seq((1, 2, "aa"), (2, 3, "bb"), (3, 4, "cc")).toDF("lat", "lon", "value")
df.withColumn("H3Address", getH3Address(col("lat"), col("lon"))).show()
You can read more about UDFs here:
https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html

Related

How to fix : java.lang.StringIndexOutOfBoundsException from spark UDF function

i have a the following dataframe : let's say DF1 as
root
|-- VARIANTS: string (nullable = true)
|-- VARIANT_ID: long (nullable = false)
|-- CASE_ID: string (nullable = true)
|-- APP_ID: integer (nullable = false)
Where Variants (string) look like :
Activity_1,Activity_2,Activity_2,Activity_3,Activity_5...
Am trying to get a new column like
Variants_stats as (per Row) :
Activity_1:1, Activity_2:2, Activity_3:1, Activity_5:1
The approach i have took so far is :
1) Create an UDF :
val countActivityFrequences = udf((value: String) => value.split(",").map(_.trim).groupBy(identity).mapValues(_.length).map{case (k, v) => k + ":" + v}.mkString(","))
val dfNew = df1.withColumn("Variants_stats", countActivityFrequences($"VARIANTS"))
It seems to be ok (at least spark doesn't complain), until i try to do any SQL or dfNew.show(false) call, which always give me back :
java.lang.StringIndexOutOfBoundsException: String index out of range: -84
at java.lang.String.substring(String.java:1931)
at java.lang.Class.getSimpleBinaryName(Class.java:1448)
at java.lang.Class.getSimpleName(Class.java:1309)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage$lzycompute(ScalaUDF.scala:1055)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage(ScalaUDF.scala:1054)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.doGenCode(ScalaUDF.scala:1006)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
at scala.Option.getOrElse(Option.scala:121)
I can't figure out what am doing wrong here ?
Am using Spark 2.1+
To reproduce :
val items = List(
"A_001,A_002,A_010,A_0200,A_0201,A_0201,A_0202,A_0206,A_0207,A_0208,A_0208,A_0209,A_070,A_071,A_072,A_073,A_073,A_074",
"A_001,A_002,A_010,A_0201,A_0201,A_0201,A_0202,A_0206,A_0207,A_0208,A_0208,A_0209,A_070,A_071,A_072,A_073,A_073,A_073")
val df = sc.parallelize(items).toDF("VARIANTS")
df.show(false)
df.printSchema
// create UDF function
val countActivityFrequences = udf((value: String) => value.split(",").map(_.trim).groupBy(identity).mapValues(_.length).map{case (k, v) => k + ":" + v}.mkString(","))
// Apply UDF against our little DF
var dfNew = df.withColumn("Variants_stats", countActivityFrequences($"VARIANTS"))
dfNew.printSchema
// Error Thrown : (either Malforned class name, or java.lang.StringIndexOutOfBoundsException )
dfNew.show(false)
Update :
The issue was only appearing in our AWS EMR environment, under zeppelin.
Restarting the interpreter made it work.

zip function with 3 parameter

I want to transpose multiple columns in Spark SQL table
I found this solution for only two columns, I want to know how to work with zip function with three column varA, varB and varC.
import org.apache.spark.sql.functions.{udf, explode}
val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
$"userId", $"someString",
$"vars._1".alias("varA"), $"vars._2".alias("varB")).show
this is my dataframe schema :
`root
|-- owningcustomerid: string (nullable = true)
|-- event_stoptime: string (nullable = true)
|-- balancename: string (nullable = false)
|-- chargedvalue: string (nullable = false)
|-- newbalance: string (nullable = false)
`
i tried this code :
val zip = udf((xs: Seq[String], ys: Seq[String], zs: Seq[String]) => (xs, ys, zs).zipped.toSeq)
df.printSchema
val df4=df.withColumn("vars", explode(zip($"balancename", $"chargedvalue",$"newbalance"))).select(
$"owningcustomerid", $"event_stoptime",
$"vars._1".alias("balancename"), $"vars._2".alias("chargedvalue"),$"vars._2".alias("newbalance"))
i got this error :
cannot resolve 'UDF(balancename, chargedvalue, newbalance)' due to data type mismatch: argument 1 requires array<string> type, however, '`balancename`' is of string type. argument 2 requires array<string> type, however, '`chargedvalue`' is of string type. argument 3 requires array<string> type, however, '`newbalance`' is of string type.;;
'Project [owningcustomerid#1085, event_stoptime#1086, balancename#1159, chargedvalue#1160, newbalance#1161, explode(UDF(balancename#1159, chargedvalue#1160, newbalance#1161)) AS vars#1167]

In Scala in general you can use Tuple3.zipped
val zip = udf((xs: Seq[Long], ys: Seq[Long], zs: Seq[Long]) =>
(xs, ys, zs).zipped.toSeq)
zip($"varA", $"varB", $"varC")
Specifically in Spark SQL (>= 2.4) you can use arrays_zip function:
import org.apache.spark.sql.functions.arrays_zip
arrays_zip($"varA", $"varB", $"varC")
However you have to note that your data doesn't contain array<string> but plain strings - hence Spark arrays_zip or explode are not allowed and you should parse your data first.

val zip = udf((a: Seq[String], b: Seq[String], c: Seq[String], d: Seq[String]) => {a.indices.map(i=> (a(i), b(i), c(i), d(i)))})

Change value of nested column in DataFrame

I have dataframe with two level nested fields
root
|-- request: struct (nullable = true)
| |-- dummyID: string (nullable = true)
| |-- data: struct (nullable = true)
| | |-- fooID: string (nullable = true)
| | |-- barID: string (nullable = true)
I want to update the value of fooId column here. I was able to update value for the first level for example dummyID column here using this question as reference How to add a nested column to a DataFrame
Input data:
{
"request": {
"dummyID": "test_id",
"data": {
"fooID": "abc",
"barID": "1485351"
}
}
}
output data:
{
"request": {
"dummyID": "test_id",
"data": {
"fooID": "def",
"barID": "1485351"
}
}
}
How can I do it using Scala?

Here is a generic solution to this problem that makes it possible to update any number of nested values, at any level, based on an arbitrary function applied in a recursive traversal:
def mutate(df: DataFrame, fn: Column => Column): DataFrame = {
// Get a projection with fields mutated by `fn` and select it
// out of the original frame with the schema reassigned to the original
// frame (explained later)
df.sqlContext.createDataFrame(df.select(traverse(df.schema, fn):_*).rdd, df.schema)
}
def traverse(schema: StructType, fn: Column => Column, path: String = ""): Array[Column] = {
schema.fields.map(f => {
f.dataType match {
case s: StructType => struct(traverse(s, fn, path + f.name + "."): _*)
case _ => fn(col(path + f.name))
}
})
}
This is effectively equivalent to the usual "just redefine the whole struct as a projection" solutions, but it automates re-nesting fields with the original structure AND preserves nullability/metadata (which are lost when you redefine the structs manually). Annoyingly, preserving those properties isn't possible while creating the projection (afaict) so the code above redefines the schema manually.
An example application:
case class Organ(name: String, count: Int)
case class Disease(id: Int, name: String, organ: Organ)
case class Drug(id: Int, name: String, alt: Array[String])
val df = Seq(
(1, Drug(1, "drug1", Array("x", "y")), Disease(1, "disease1", Organ("heart", 2))),
(2, Drug(2, "drug2", Array("a")), Disease(2, "disease2", Organ("eye", 3)))
).toDF("id", "drug", "disease")
df.show(false)
+---+------------------+-------------------------+
|id |drug |disease |
+---+------------------+-------------------------+
|1 |[1, drug1, [x, y]]|[1, disease1, [heart, 2]]|
|2 |[2, drug2, [a]] |[2, disease2, [eye, 3]] |
+---+------------------+-------------------------+
// Update the integer field ("count") at the lowest level:
val df2 = mutate(df, c => if (c.toString == "disease.organ.count") c - 1 else c)
df2.show(false)
+---+------------------+-------------------------+
|id |drug |disease |
+---+------------------+-------------------------+
|1 |[1, drug1, [x, y]]|[1, disease1, [heart, 1]]|
|2 |[2, drug2, [a]] |[2, disease2, [eye, 2]] |
+---+------------------+-------------------------+
// This will NOT necessarily be equal unless the metadata and nullability
// of all fields is preserved (as the code above does)
assertResult(df.schema.toString)(df2.schema.toString)
A limitation of this is that it cannot add new fields, only update existing ones (though the map can be changed into a flatMap and the function to return Array[Column] for that, if you don't care about preserving nullability/metadata).
Additionally, here is a more generic version for Dataset[T]:
case class Record(id: Int, drug: Drug, disease: Disease)
def mutateDS[T](df: Dataset[T], fn: Column => Column)(implicit enc: Encoder[T]): Dataset[T] = {
df.sqlContext.createDataFrame(df.select(traverse(df.schema, fn):_*).rdd, enc.schema).as[T]
}
// To call as typed dataset:
val fn: Column => Column = c => if (c.toString == "disease.organ.count") c - 1 else c
mutateDS(df.as[Record], fn).show(false)
// To call as untyped dataset:
implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) // This is necessary regardless of sparkSession.implicits._ imports
mutateDS(df, fn).show(false)

One way, although cumbersome is to fully unpack and recreate the column by explicitly referencing each element of the original struct.
dataFrame.withColumn("person",
struct(
col("person.age").alias("age),
struct(
col("person.name.first").alias("first"),
lit("some new value").alias("last")).alias("name")))

Why Spark SQL translates String "null" to Object null for Float/Double types?

I have a dataframe containing float and double values.
scala> val df = List((Float.NaN, Double.NaN), (1f, 0d)).toDF("x", "y")
df: org.apache.spark.sql.DataFrame = [x: float, y: double]
scala> df.show
+---+---+
| x| y|
+---+---+
|NaN|NaN|
|1.0|0.0|
+---+---+
scala> df.printSchema
root
|-- x: float (nullable = false)
|-- y: double (nullable = false)
When I replace NaN values with null value, I gave null as String to the Map in fill operation.
scala> val map = df.columns.map((_, "null")).toMap
map: scala.collection.immutable.Map[String,String] = Map(x -> null, y -> null)
scala> df.na.fill(map).printSchema
root
|-- x: float (nullable = true)
|-- y: double (nullable = true)
scala> df.na.fill(map).show
+----+----+
| x| y|
+----+----+
|null|null|
| 1.0| 0.0|
+----+----+
And I got correct value. But I was not able to understand as to How/Why Spark SQL is translating null as a String to a null object ?

If you looked in to the fill function in Dataset, It checks the datatype and tries to convert to datatype of its column's schema. If it can be converted then it converts otherwise it returns null.
It does not convert to "null" to object null but it returns null if exception occurs while converting.
val map = df.columns.map((_, "WHATEVER")).toMap
gives null
and val map = df.columns.map((_, "9999.99")).toMap
gives 9999.99
If you want to update the NAN with same datatype, you can get result as expected.
Hope this helps you to understand!

It's not that "null" as a String translates to a null object. You can try using the transformation with any String and still get null (with the exception of strings that can directly be cast to double/float, see below). For example, using
val map = df.columns.map((_, "abc")).toMap
would give the same result. My guess is that as the columns are of type float and double converting the NaN values to a string will give null. Using a number instead would work as expected, e.g.
val map = df.columns.map((_, 1)).toMap
As some strings can directly be cast to double or float those are also possible to use in this case.
val map = df.columns.map((_, "1")).toMap

I've looked into the source-code, in fill your string is casted to a double/float:
private def fillCol[T](col: StructField, replacement: T): Column = {
col.dataType match {
case DoubleType | FloatType =>
coalesce(nanvl(df.col("`" + col.name + "`"), lit(null)),
lit(replacement).cast(col.dataType)).as(col.name)
case _ =>
coalesce(df.col("`" + col.name + "`"), lit(replacement).cast(col.dataType)).as(col.name)
}
}
The relevant source code for casting is here (similar code for Floats):
Cast.scala (taken from Spark 1.6.3) :
// DoubleConverter
private[this] def castToDouble(from: DataType): Any => Any = from match {
case StringType =>
buildCast[UTF8String](_, s => try s.toString.toDouble catch {
case _: NumberFormatException => null
})
case BooleanType =>
buildCast[Boolean](_, b => if (b) 1d else 0d)
case DateType =>
buildCast[Int](_, d => null)
case TimestampType =>
buildCast[Long](_, t => timestampToDouble(t))
case x: NumericType =>
b => x.numeric.asInstanceOf[Numeric[Any]].toDouble(b)
}
So Spark trys to convert the String to a Double (s.toString.toDouble), if thats not possible (i.e. you get a NumberFormatException) you get a null. So instead of "null" you could also use "foo" witht he same outcome. But if you use "1.0" in your map, than NaNs and nulls will be replaced with 1.0 because the String "1.0" is indeed parsable to a Double.

explode a row of spark dataset into several rows with added column using flatmap

I have a DataFrame with the following schema :
root
|-- journal: string (nullable = true)
|-- topicDistribution: vector (nullable = true)
The topicDistribution field is a vector of doubles: [0.1, 0.2 0.15 ...]
What I want is is to explode each row into several rows to obtain the following schema:
root
|-- journal: string
|-- topic-prob: double // this is the value from the vector
|-- topic-id : integer // this is the index of the value from the vector
To clarify, I've created a case class:
case class JournalDis(journal: String, topic_id: Integer, prob: Double)
I've managed to achieve this using dataset.explode in a very awkward way:
val df1 = df.explode("topicDistribution", "topic") {
topics: DenseVector => topics.toArray.zipWithIndex
}.select("journal", "topic")
val df2 = df1.withColumn("topic_id", df1("topic").getItem("_2")).withColumn("topic_prob", df1("topic").getItem("_1")).drop(df1("topic"))
But dataset.explode is deprecated. I wonder how to achieve this using flatmap method?

Not tested but should work:
import spark.implicits._
import org.apache.spark.ml.linalg.Vector
df.as[(String, Vector)].flatMap {
case (j, ps) => ps.toArray.zipWithIndex.map {
case (p, ti) => JournalDis(j, ti, p)
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Adding the result of a function in a Dataframe column [Spark Scala] - scala

Related

How to fix : java.lang.StringIndexOutOfBoundsException from spark UDF function

zip function with 3 parameter

Change value of nested column in DataFrame

Why Spark SQL translates String "null" to Object null for Float/Double types?

explode a row of spark dataset into several rows with added column using flatmap

Categories

Resources