Spark - recursive function as udf generates an Exception - scala

I am working with DataFrames which elements have got a schema similar to:
root
|-- NPAData: struct (nullable = true)
| |-- NPADetails: struct (nullable = true)
| | |-- location: string (nullable = true)
| | |-- manager: string (nullable = true)
| |-- service: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- serviceName: string (nullable = true)
| | | |-- serviceCode: string (nullable = true)
|-- NPAHeader: struct (nullable = true)
| | |-- npaNumber: string (nullable = true)
| | |-- date: string (nullable = true)
In my DataFrame I want to group all elements which has the same NPAHeader.code, so to do that I am using the following line:
val groupedNpa = orderedNpa.groupBy($"NPAHeader.code" ).agg(collect_list(struct($"NPAData",$"NPAHeader")).as("npa"))
After this I have a dataframe with the following schema:
StructType(StructField(npaNumber,StringType,true), StructField(npa,ArrayType(StructType(StructField(NPAData...)))))
An example of each Row would be something similar to:
[1234,WrappedArray([npaNew,npaOlder,...npaOldest])]
Now what I want is to generate another DataFrame with picks up just one of the element in the WrappedArray, so I want an output similar to:
[1234,npaNew]
Note: The chosen element from the WrappedArray is the one that matches a complext logic after iterating over the whole WrappedArray. But to simplify the question, I will pick up always the last element of the WrappedArray (after iterating all over it).
To do so, I want to define a recurside udf
import org.apache.spark.sql.functions.udf
def returnRow(elementList : Row)(index:Int): Row = {
val dif = elementList.size - index
val row :Row = dif match{
case 0 => elementList.getAs[Row](index)
case _ => returnRow(elementList)(index + 1)
}
row
}
val returnRow_udf = udf(returnRow _)
groupedNpa.map{row => (row.getAs[String]("npaNumber"),returnRow_udf(groupedNpa("npa")(0)))}
But I am getting the following error in the map:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type Int => Unit is not supported
What am I doing wrong?
As an aside, I am not sure if I am passing correctly the npa column with groupedNpa("npa"). I am accesing the WrappedArray as a Row, because I don't know how to iterate over Array[Row] (the get(index) method is not present in Array[Row])

TL;DR Just use one of the methods described in How to select the first row of each group?
If you want to use complex logic, and return Row you can skip SQL API and use groupByKey:
val f: (String, Iterator[org.apache.spark.sql.Row]) => Row
val encoder: Encoder
df.groupByKey(_.getAs[String]("NPAHeader.code")).mapGroups(f)(encoder)
or better:
val g: (Row, Row) => Row
df.groupByKey(_.getAs[String]("NPAHeader.code")).reduceGroups(g)
where encoder is a valid RowEncoder (Encoder error while trying to map dataframe row to updated row).
Your code is faulty in multiple ways:
groupBy doesn't guarantee the order of values. So:
orderBy(...).groupBy(....).agg(collect_list(...))
can have non-deterministic output. If you really decide to go this route you should skip orderBy and sort collected array explicitly.
You cannot pass curried function to udf. You'd have to uncurry it first, but it would require different order of arguments (see example below).
If you could, this might be the correct way to call it (Note that you omit the second argument):
returnRow_udf(groupedNpa("npa")(0))
To make it worse, you call it inside map, where udfs are not applicable at all.
udf cannot return Row. It has to return external Scala type.
External representation for array<struct> is Seq[Row]. You cannot just substitute it with Row.
SQL arrays can be accessed by index with apply:
df.select($"array"(size($"array") - 1))
but it is not a correct method due to non-determinism. You could apply sort_array, but as pointed out at the beginning, there are more efficient solutions.
Surprisingly recursion is not so relevant. You could design function like this:
def size(i: Int=0)(xs: Seq[Any]): Int = xs match {
case Seq() => i
case null => i
case Seq(h, t # _*) => size(i + 1)(t)
}
val size_ = udf(size() _)
and it would work just fine:
Seq((1, Seq("a", "b", "c"))).toDF("id", "array")
.select(size_($"array"))
although recursion is an overkill, if you can just iterate over Seq.

Related

Scala/Spark - Convert Word2vec output to Dataset[_]

I believe the case class type should match with DataFrame. However, I'm confused what should be my case class type for text column?
My code below:
case class vectorData(value: Array[String], vectors: Array[Float])
def main(args: Array[String]) {
val word2vec = new Word2Vec()
.setInputCol("value").setOutputCol("vectors")
.setVectorSize(5).setMinCount(0).setWindowSize(5)
val dataset = spark.createDataset(data)
val model = word2vec.fit(dataset)
val encoder = org.apache.spark.sql.Encoders.product[vectorData]
val result = model.transform(dataset)
result.foreach(row => println(row.get(0)))
println("###################################")
result.foreach(row => println(row.get(1)))
val output = result.as(encoder)
}
As shown, when I print the first column, I get this:
WrappedArray(#marykatherine_q, know!, I, heard, afternoon, wondered, thing., Moscow, times)
WrappedArray(laying, bed, voice..)
WrappedArray(I'm, sooo, sad!!!, killed, Kutner, House, whyyyyyyyy)
when I print the second column, I get this:
[-0.0495405454809467,0.03403271486361821,0.011959535030958552,-0.008446224654714266,0.0014322120696306229]
[-0.06924172700382769,0.02562551060691476,0.01857258938252926,-0.0269106051127892,-0.011274430900812149]
[-0.06266747579416808,0.007715661790879334,0.047578315007472956,-0.02747830021989477,-0.015755867421188775]
The error I'm getting:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`text`' given input columns: [result, value];
It seems apparent that my case class has type mismatch with actual result. What should be the correct one? I want val output to be DataSet[_].
Thank you
EDIT:
I've modified the case class column names to be same as the word2vec output. Now I'm getting this error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: need an array field but got struct<type:tinyint,size:int,indices:array<int>,values:array<double>>;
From what I see, it is just a problem of attribute naming. What spark is telling you is that it cannot find the attribute text in the dataframe result.
You do not say how you create the data object but it must have an attribute value since Word2vec manages to find it. model.transform simply adds a result column to that dataset, and turns it into a dataframe of the following type:
root
|-- value: array (nullable = true)
| |-- element: string (containsNull = true)
|-- vector: array (nullable = true)
| |-- element: float (containsNull = false)
|-- result: vector (nullable = true)
So when you try to turn it into a dataset, spark tries to find a text column and throws that exception. Just rename the value column and it will work:
val output = result.withColumnRenamed("value", "text").as(encoder)
After checking the source code of word2vec, I managed to realise that the output of transform is actually not Array[Float], it is actually Vector (from o.a.s.ml.linalg).
It worked by changing case class as below:
case class vectorData(value: Array[String], vectors: Vector)

Dump array of map column of a spark dataframe into csv file

I have the following spark dataframe and its corresponding schema
+----+--------------------+
|name| subject_list|
+----+--------------------+
| Tom|[[Math -> 99], [P...|
| Amy| [[Physics -> 77]]|
+----+--------------------+
root
|-- name: string (nullable = true)
|-- subject_list: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: integer (valueContainsNull = false)
How can I dump this dataframe into a csv file seperated by "\t" as following
Tom [(Math, 99), (Physics, 88)]
Amy [(Physics, 77)]
Here's link to a similar post to this question but it is for dumping an array of string, not an array of map.
Appreciate for any help, thanks.
The reason why it throws error and other details are listed in same link that you have shared. Here is the modified version of stringify for array of map:
def stringify = udf((vs: Seq[Map[String, Int]]) => vs match {
case null => null
case x => "[" + x.flatMap(_.toList).mkString(",") + "]"
})
credits: link
You can write an udf to convert Map to string as you want like
val mapToString = udf((marks: Map[String, String]) => {
marks.map{case (k, v) => (s"(${k},${v})")}.mkString("[",",", "]")
})
dff.withColumn("marks", mapToString($"marks"))
.write.option("delimiter", "\t")
.csv("csvoutput")
Output:
Tom [(Math,99),(Physics,88)]
Amy [(Physics,77)]
But I don't recommend you to do this, You gonna have problem while reading again and have to parse manually
Its better to flatten those map as
dff.select($"name", explode($"marks")).write.csv("csvNewoutput")
Which will store as
Tom,Math,99
Tom,Physics,88
Amy,Physics,77

Apache Spark Null Value when casting incompatible DecimalType vs ClassCastException

Casting DecimalType(10,5) e.g. 99999.99999 to DecimalType(5,4) in Apache Spark silently returns null
Is it possible to change this behavior and allow Spark to throw an exception(for example some CastException) in this case and fail the job instead of silently return null ?
As per the Git hub documentation, https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L499
/** * Change the precision / scale in a given decimal to those set
in decimalType (if any), * returning null if it overflows or
modifying value in-place and returning it if successful. * *
NOTE: this modifies value in-place, so don't call it on external
data. */
There is also another thread, suggesting there may not be a direct method to fail the code if not able to cast. Spark: cast decimal without changing nullable property of column.
So, probably you can try checking for the nullvalue in the casted column and create a logic to fail if any?
As I mentioned in my comment above you can try to achieve what you want using a UserDefinedFunction. I'm currently facing the same problem but managed to solve mine using a UDF. The problem I was facing is that I wanted try to cast a column to DoubleType but I don't know the type upfront and I wan't my application to fail when the parsing fails, so not a silent 'null' like you are talking about.
In the code below you can see I've written an udf which takes in a struct as parameter.I'll try to parse the only value in this struct to a double. If this fails I will throw an exception which causes my job to fail.
import spark.implicits._
val cast_to_double = udf((number: Row) => {
try {
number.get(0) match {
case s: String => s.toDouble
case d: Double => d
case l: Long => l.toDouble
case i: Int => i.toDouble
case _ => throw new NumberFormatException
}
} catch {
case _: NumberFormatException => throw new IllegalArgumentException("Can't parse this so called number of yours.")
}
})
try {
val intDF = List(1).toDF("something")
val secondIntDF = intDF.withColumn("something_else", cast_to_double(struct(col("something"))))
secondIntDF.printSchema()
secondIntDF.show()
val stringIntDF = List("1").toDF("something")
val secondStringIntDF = stringIntDF.withColumn("something_else", cast_to_double(struct(col("something"))))
secondStringIntDF.printSchema()
secondStringIntDF.show()
val stringDF = List("string").toDF("something")
val secondStringDF = stringDF.withColumn("something_else", cast_to_double(struct(col("something"))))
secondStringDF.printSchema()
secondStringDF.show()
} catch {
case se: SparkException => println(se.getCause.getMessage)
}
OUTPUT:
root
|-- something: integer (nullable = false)
|-- something_else: double (nullable = false)
+---------+--------------+
|something|something_else|
+---------+--------------+
| 1| 1.0|
+---------+--------------+
root
|-- something: string (nullable = true)
|-- something_else: double (nullable = false)
+---------+--------------+
|something|something_else|
+---------+--------------+
| 1| 1.0|
+---------+--------------+
root
|-- something: string (nullable = true)
|-- something_else: double (nullable = false)
Can't parse this so called number of yours.

Filter an array column based on a provided list

I have the following types in a dataframe:
root
|-- id: string (nullable = true)
|-- items: array (nullable = true)
| |-- element: string (containsNull = true)
input:
val rawData = Seq(("id1",Array("item1","item2","item3","item4")),("id2",Array("item1","item2","item3")))
val data = spark.createDataFrame(rawData)
and a list of items:
val filter_list = List("item1", "item2")
I would like to filter out items that are non in the filter_list, similar to how array_contains would function, but its not working on a provided list of strings, only a single value.
so the output would look like this:
val rawData = Seq(("id1",Array("item1","item2")),("id2",Array("item1","item2")))
val data = spark.createDataFrame(rawData)
I tried solving this with the following UDF, but I probably mix types between Scala and Spark:
def filterItems(flist: List[String]) = udf {
(recs: List[String]) => recs.filter(item => flist.contains(item))
}
I'm using Spark 2.2
thanks!
You code is almost right. All you have to do is replace List with Seq
def filterItems(flist: List[String]) = udf {
(recs: Seq[String]) => recs.filter(item => flist.contains(item))
}
It would also make sense to change signature from List[String] => UserDefinedFunction to SeqString] => UserDefinedFunction, but it is not required.
Reference SQL Programming Guide - Data Types.

Selecting a row from array<struct> based on given condition

I've a dataframe with following schema -
|-- ID: string (nullable = true)
|-- VALUES: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _v1: string (nullable = true)
| | |-- _v2: string (nullable = true)
VALUES are like -
[["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]]
[["PQR","g"]]
[["TUV","f"],["ABC","e"]]
I've to select a single struct from this array based on the value of _v1. There is a hierarchy in these values like -
"ABC" -> "XYZ" -> "PQR" -> "TUV"
Now, if "TUV" is present, we will select the row with "TUV" in its _v1. Else we will check for "PQR". If "PQR" is present, take its row. Else check for "XYZ" and so on.
The result df should look like - (which will be StructType now, not Array[Struct])
["TUV","d"]
["PQR","g"]
["TUV","f"]
Can someone please guide me how can I approach this problem by creating a udf ?
Thanks in advance.
you can do something like below
import org.apache.spark.sql.functions._
def getIndex = udf((array : mutable.WrappedArray[String]) => {
if(array.contains("TUV")) array.indexOf("TUV")
else if(array.contains("PQR")) array.indexOf("PQR")
else if(array.contains("XYZ")) array.indexOf("XYZ")
else if(array.contains("ABC")) array.indexOf("ABC")
else 0
})
df.select($"VALUES"(getIndex($"VALUES._v1")).as("selected"))
You should have following output
+--------+
|selected|
+--------+
|[TUV,d] |
|[PQR,g] |
|[TUV,f] |
+--------+
I hope the answer is helpful
Updated
You can select the elements of struct column by using . notation. Here $"VALUES._v1" is selecting all the _v1 of struct and passing them to udf function as Array in the same order.
for example : for [["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]], $"VALUES._v1" would return ["ABC","PQR","XYZ","TUV"] which is passed to udf function
Inside udf function, index of array where the strings matched is returned. for example : for ["ABC","PQR","XYZ","TUV"], "TUV" matches so it would return 3.
for the first row, getIndex($"VALUES._v1") would return 3 so $"VALUES"(getIndex($"VALUES._v1") is equivalent to $"VALUES"(3) which is the fourth element of [["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]] i.e. ["TUV","d"] .
I hope the explanation is clear.
This should work as long as each row only contains each _v1 values at most once. The UDF will return the index of the best value in the hierarchy list. Then the stuct containing this value in _v1 will be selected and put into the "select" column.
val hierarchy = List("TUV", "PQR", "XYZ", "ABC")
val findIndex = udf((rows: Seq[String]) => {
val s = rows.toSet
val best = hierarchy.filter(h => s contains h).head
rows.indexOf(best)
})
df.withColumn("select", $"VALUES"(findIndex($"VALUES._v2")))
A list is used for the order to make it easy to extend to more than 4 values.