How to fix : java.lang.StringIndexOutOfBoundsException from spark UDF function - scala

i have a the following dataframe : let's say DF1 as
root
|-- VARIANTS: string (nullable = true)
|-- VARIANT_ID: long (nullable = false)
|-- CASE_ID: string (nullable = true)
|-- APP_ID: integer (nullable = false)
Where Variants (string) look like :
Activity_1,Activity_2,Activity_2,Activity_3,Activity_5...
Am trying to get a new column like
Variants_stats as (per Row) :
Activity_1:1, Activity_2:2, Activity_3:1, Activity_5:1
The approach i have took so far is :
1) Create an UDF :
val countActivityFrequences = udf((value: String) => value.split(",").map(_.trim).groupBy(identity).mapValues(_.length).map{case (k, v) => k + ":" + v}.mkString(","))
val dfNew = df1.withColumn("Variants_stats", countActivityFrequences($"VARIANTS"))
It seems to be ok (at least spark doesn't complain), until i try to do any SQL or dfNew.show(false) call, which always give me back :
java.lang.StringIndexOutOfBoundsException: String index out of range: -84
at java.lang.String.substring(String.java:1931)
at java.lang.Class.getSimpleBinaryName(Class.java:1448)
at java.lang.Class.getSimpleName(Class.java:1309)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage$lzycompute(ScalaUDF.scala:1055)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage(ScalaUDF.scala:1054)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.doGenCode(ScalaUDF.scala:1006)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
at scala.Option.getOrElse(Option.scala:121)
I can't figure out what am doing wrong here ?
Am using Spark 2.1+
To reproduce :
val items = List(
"A_001,A_002,A_010,A_0200,A_0201,A_0201,A_0202,A_0206,A_0207,A_0208,A_0208,A_0209,A_070,A_071,A_072,A_073,A_073,A_074",
"A_001,A_002,A_010,A_0201,A_0201,A_0201,A_0202,A_0206,A_0207,A_0208,A_0208,A_0209,A_070,A_071,A_072,A_073,A_073,A_073")
val df = sc.parallelize(items).toDF("VARIANTS")
df.show(false)
df.printSchema
// create UDF function
val countActivityFrequences = udf((value: String) => value.split(",").map(_.trim).groupBy(identity).mapValues(_.length).map{case (k, v) => k + ":" + v}.mkString(","))
// Apply UDF against our little DF
var dfNew = df.withColumn("Variants_stats", countActivityFrequences($"VARIANTS"))
dfNew.printSchema
// Error Thrown : (either Malforned class name, or java.lang.StringIndexOutOfBoundsException )
dfNew.show(false)
Update :
The issue was only appearing in our AWS EMR environment, under zeppelin.
Restarting the interpreter made it work.

Related

convert string value to map using scala

I have a CSV file with one of the fields with a map as mentioned below
"Map(12345 -> 45678, 23465 -> 9876)"
When I am trying to load the csv into dataframe, it is considering it as string.
So, I have written a UDF to convert the string to map as below
val convertToMap = udf((pMap: String) => {
val mpp = pMap
// "Map(12345 -> 45678, 23465 -> 9876)"
val stg = mpp.substr(4, mpp.length() -1) val stg1=stg.split(regex=",").toList
val mp=stg1.map(_.split(regex=" ").toList)
val mp1 = mp.map(mp =>
(mp(0), mp(2))).toMap
} )
Now I need help in applying the UDF to the column where it is being taken as string and return the DF with the converted column.
You are pretty close, but it looks like your UDF has some mix of scala and python, and the parsing logic needs a little work. There may be a better way to parse a map literal string, but this works with the provided example:
val convertToMap = udf { (pMap: String) =>
val stg = pMap.substring(4, pMap.length() - 1)
val stg1 = stg.split(",").toList.map(_.trim)
val mp = stg1.map(_.split(" ").toList)
mp.map(mp =>(mp(0), mp(2))).toMap
}
val df = spark.createDataset(Seq("Map(12345 -> 45678, 23465 -> 9876)")).toDF("strMap")
With the corrected UDF, you simply invoke it with a .select() or a .withColumn():
df.select(convertToMap($"strMap").as("map")).show(false)
Which gives:
+----------------------------------+
|map |
+----------------------------------+
|Map(12345 -> 45678, 23465 -> 9876)|
+----------------------------------+
With the schema:
root
|-- map: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)

zip function with 3 parameter

I want to transpose multiple columns in Spark SQL table
I found this solution for only two columns, I want to know how to work with zip function with three column varA, varB and varC.
import org.apache.spark.sql.functions.{udf, explode}
val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
$"userId", $"someString",
$"vars._1".alias("varA"), $"vars._2".alias("varB")).show
this is my dataframe schema :
`root
|-- owningcustomerid: string (nullable = true)
|-- event_stoptime: string (nullable = true)
|-- balancename: string (nullable = false)
|-- chargedvalue: string (nullable = false)
|-- newbalance: string (nullable = false)
`
i tried this code :
val zip = udf((xs: Seq[String], ys: Seq[String], zs: Seq[String]) => (xs, ys, zs).zipped.toSeq)
df.printSchema
val df4=df.withColumn("vars", explode(zip($"balancename", $"chargedvalue",$"newbalance"))).select(
$"owningcustomerid", $"event_stoptime",
$"vars._1".alias("balancename"), $"vars._2".alias("chargedvalue"),$"vars._2".alias("newbalance"))
i got this error :
cannot resolve 'UDF(balancename, chargedvalue, newbalance)' due to data type mismatch: argument 1 requires array<string> type, however, '`balancename`' is of string type. argument 2 requires array<string> type, however, '`chargedvalue`' is of string type. argument 3 requires array<string> type, however, '`newbalance`' is of string type.;;
'Project [owningcustomerid#1085, event_stoptime#1086, balancename#1159, chargedvalue#1160, newbalance#1161, explode(UDF(balancename#1159, chargedvalue#1160, newbalance#1161)) AS vars#1167]
In Scala in general you can use Tuple3.zipped
val zip = udf((xs: Seq[Long], ys: Seq[Long], zs: Seq[Long]) =>
(xs, ys, zs).zipped.toSeq)
zip($"varA", $"varB", $"varC")
Specifically in Spark SQL (>= 2.4) you can use arrays_zip function:
import org.apache.spark.sql.functions.arrays_zip
arrays_zip($"varA", $"varB", $"varC")
However you have to note that your data doesn't contain array<string> but plain strings - hence Spark arrays_zip or explode are not allowed and you should parse your data first.
val zip = udf((a: Seq[String], b: Seq[String], c: Seq[String], d: Seq[String]) => {a.indices.map(i=> (a(i), b(i), c(i), d(i)))})

function to each row of Spark Dataframe

I have a spark Dataframe (df) with 2 column's (Report_id and Cluster_number).
I want to apply a function (getClusterInfo) to df which will return the name for each cluster i.e. if cluster number is '3' then for a specific report_id, the 3 below mentioned rows will be written:
{"cluster_id":"1","influencers":[{"screenName":"A"},{"screenName":"B"},{"screenName":"C"},...]}
{"cluster_id":"2","influencers":[{"screenName":"D"},{"screenName":"E"},{"screenName":"F"},...]}
{"cluster_id":"3","influencers":[{"screenName":"G"},{"screenName":"H"},{"screenName":"E"},...]}
I am using foreach on df to apply getClusterInfo function, but can't figure out how to convert o/p to a Dataframe (Report_id, Array[cluster_info]).
Here is the code snippet:
df.foreach(row => {
val report_id = row(0)
val cluster_no = row(1).toString
val cluster_numbers = new Range(0, cluster_no.toInt - 1, 1)
for (cluster <- cluster_numbers.by(1)) {
val cluster_id = report_id + "_" + cluster
//get cluster influencers
val result = getClusterInfo(cluster_id)
println(result.get)
val res : String = result.get.toString()
// TODO ?
}
.. //TODO ?
})
Geenrally speaking, you shouldn't use foreach when you want to map something into something else; foreach is good for applying functions that only have side-effects and return nothing.
In this case, if I got the details right (probably not), you can use a User-Defined Function (UDF) and explode the result:
import org.apache.spark.sql.functions._
import spark.implicits._
// I'm assuming we have these case classes (or similar)
case class Influencer(screenName: String)
case class ClusterInfo(cluster_id: String, influencers: Array[Influencer])
// I'm assuming this method is supplied - with your own implementation
def getClusterInfo(clusterId: String): ClusterInfo =
ClusterInfo(clusterId, Array(Influencer(clusterId)))
// some sample data - assuming both columns are integers:
val df = Seq((222, 3), (333, 4)).toDF("Report_id", "Cluster_number")
// actual solution:
// UDF that returns an array of ClusterInfo;
// Array size is 'clusterNo', creates cluster id for each element and maps it to info
val clusterInfoUdf = udf { (clusterNo: Int, reportId: Int) =>
(1 to clusterNo).map(v => s"${reportId}_$v").map(getClusterInfo)
}
// apply UDF to each record and explode - to create one record per array item
val result = df.select(explode(clusterInfoUdf($"Cluster_number", $"Report_id")))
result.printSchema()
// root
// |-- col: struct (nullable = true)
// | |-- cluster_id: string (nullable = true)
// | |-- influencers: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- screenName: string (nullable = true)
result.show(truncate = false)
// +-----------------------------+
// |col |
// +-----------------------------+
// |[222_1,WrappedArray([222_1])]|
// |[222_2,WrappedArray([222_2])]|
// |[222_3,WrappedArray([222_3])]|
// |[333_1,WrappedArray([333_1])]|
// |[333_2,WrappedArray([333_2])]|
// |[333_3,WrappedArray([333_3])]|
// |[333_4,WrappedArray([333_4])]|
// +-----------------------------+

Filter an array column based on a provided list

I have the following types in a dataframe:
root
|-- id: string (nullable = true)
|-- items: array (nullable = true)
| |-- element: string (containsNull = true)
input:
val rawData = Seq(("id1",Array("item1","item2","item3","item4")),("id2",Array("item1","item2","item3")))
val data = spark.createDataFrame(rawData)
and a list of items:
val filter_list = List("item1", "item2")
I would like to filter out items that are non in the filter_list, similar to how array_contains would function, but its not working on a provided list of strings, only a single value.
so the output would look like this:
val rawData = Seq(("id1",Array("item1","item2")),("id2",Array("item1","item2")))
val data = spark.createDataFrame(rawData)
I tried solving this with the following UDF, but I probably mix types between Scala and Spark:
def filterItems(flist: List[String]) = udf {
(recs: List[String]) => recs.filter(item => flist.contains(item))
}
I'm using Spark 2.2
thanks!
You code is almost right. All you have to do is replace List with Seq
def filterItems(flist: List[String]) = udf {
(recs: Seq[String]) => recs.filter(item => flist.contains(item))
}
It would also make sense to change signature from List[String] => UserDefinedFunction to SeqString] => UserDefinedFunction, but it is not required.
Reference SQL Programming Guide - Data Types.

explode a row of spark dataset into several rows with added column using flatmap

I have a DataFrame with the following schema :
root
|-- journal: string (nullable = true)
|-- topicDistribution: vector (nullable = true)
The topicDistribution field is a vector of doubles: [0.1, 0.2 0.15 ...]
What I want is is to explode each row into several rows to obtain the following schema:
root
|-- journal: string
|-- topic-prob: double // this is the value from the vector
|-- topic-id : integer // this is the index of the value from the vector
To clarify, I've created a case class:
case class JournalDis(journal: String, topic_id: Integer, prob: Double)
I've managed to achieve this using dataset.explode in a very awkward way:
val df1 = df.explode("topicDistribution", "topic") {
topics: DenseVector => topics.toArray.zipWithIndex
}.select("journal", "topic")
val df2 = df1.withColumn("topic_id", df1("topic").getItem("_2")).withColumn("topic_prob", df1("topic").getItem("_1")).drop(df1("topic"))
But dataset.explode is deprecated. I wonder how to achieve this using flatmap method?
Not tested but should work:
import spark.implicits._
import org.apache.spark.ml.linalg.Vector
df.as[(String, Vector)].flatMap {
case (j, ps) => ps.toArray.zipWithIndex.map {
case (p, ti) => JournalDis(j, ti, p)
}
}