convert string value to map using scala - scala

I have a CSV file with one of the fields with a map as mentioned below
"Map(12345 -> 45678, 23465 -> 9876)"
When I am trying to load the csv into dataframe, it is considering it as string.
So, I have written a UDF to convert the string to map as below
val convertToMap = udf((pMap: String) => {
val mpp = pMap
// "Map(12345 -> 45678, 23465 -> 9876)"
val stg = mpp.substr(4, mpp.length() -1) val stg1=stg.split(regex=",").toList
val mp=stg1.map(_.split(regex=" ").toList)
val mp1 = mp.map(mp =>
(mp(0), mp(2))).toMap
} )
Now I need help in applying the UDF to the column where it is being taken as string and return the DF with the converted column.

You are pretty close, but it looks like your UDF has some mix of scala and python, and the parsing logic needs a little work. There may be a better way to parse a map literal string, but this works with the provided example:
val convertToMap = udf { (pMap: String) =>
val stg = pMap.substring(4, pMap.length() - 1)
val stg1 = stg.split(",").toList.map(_.trim)
val mp = stg1.map(_.split(" ").toList)
mp.map(mp =>(mp(0), mp(2))).toMap
}
val df = spark.createDataset(Seq("Map(12345 -> 45678, 23465 -> 9876)")).toDF("strMap")
With the corrected UDF, you simply invoke it with a .select() or a .withColumn():
df.select(convertToMap($"strMap").as("map")).show(false)
Which gives:
+----------------------------------+
|map |
+----------------------------------+
|Map(12345 -> 45678, 23465 -> 9876)|
+----------------------------------+
With the schema:
root
|-- map: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)

Related

Scala transform and split String Column to MapType Column in Dataframe

I have a DataFrame with a column, containing tracking request urls with fields inside, that looks like this
df.show(truncate = false)
+--------------------------------
| request_uri
+-----------------------------------
| /i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349 ...
| /i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00 ...
| ...
I need to transform this column to something that looks like this
df.show(truncate = false)
+--------------------------------
| request_uri
+--------------------------------
| (aid -> fptplay, ast -> 1582163970763, tz -> [timezone datatype], nt -> wifi , ...)
| (p -> fplay-ottbox-2019, av -> 2.0.18, ov -> 9, tv -> 1.0.0 , ...)
| ...
Basically I have to split the field names (delimiter = "&" ) and their values into a MapType of some sort, and add that to the column.
Can someone give me pointers how to write a custom function to split the string column into a MapType column?
I'm told to use withColumn() and mapPartition but I don't know how to implement it in a way that will split the strings and cast them to MapType.
Any help even though minimal would be heartily appreciated. I'm completely new to Scala and have been stuck on this for a week.
The solution is to use UserDefinedFunctions.
Let's take the problem one step at a time.
// We need a function which converts strings into maps
// based on the format of request uris
def requestUriToMap(s: String): Map[String, String] = {
s.stripPrefix("/i?").split("&").map(elem => {
val pair = elem.split("=")
(pair(0), pair(1)) // evaluate each element to a tuple
}).toMap
}
// Now we convert this function into a UserDefinedFunction.
import org.apache.spark.sql.functions.{col, udf}
// Given to a request uri string, convert it to map, the correct format is assumed.
val requestUriToMapUdf = udf((requestUri: String) => requestUriToMap(requestUri))
Now, we test.
// Test data
val df = Seq(
("/i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349"),
("/i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00")
).toDF("request_uri")
df.show(false)
//+-----------------------------------------------------------------------+
//|request_uri |
//+-----------------------------------------------------------------------+
//|/i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349 |
//|/i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00|
//+-----------------------------------------------------------------------+
// Now we execute our UDF to create a column, using the same name replaces that column
val mappedDf = df.withColumn("request_uri", requestUriToMapUdf(col("request_uri")))
mappedDf.show(false)
//+---------------------------------------------------------------------------------------------+
//|request_uri |
//+---------------------------------------------------------------------------------------------+
//|[aid -> fptplay, ast -> 1582163970763, av -> 4.6.1, did -> 83295772a8fee349] |
//|[av -> 2.0.18, ov -> 9, tz -> GMT%2B07%3A00, tv -> 1.0.0, p -> fplay-ottbox-2019, nt -> wifi]|
//+---------------------------------------------------------------------------------------------+
mappedDf.printSchema
//root
// |-- request_uri: map (nullable = true)
// | |-- key: string
// | |-- value: string (valueContainsNull = true)
mappedDf.schema
//org.apache.spark.sql.types.StructType = StructType(StructField(request_uri,MapType(StringType,StringType,true),true))
And that's what you wanted.
Alternative: If you're not sure if your string conforms, you can try a different variation of the function which succeeds even if the string doesn't conform to assumed format (doesn't contain = or input is an empty String).
def requestUriToMapImproved(s: String): Map[String, String] = {
s.stripPrefix("/i?").split("&").map(elem => {
val pair = elem.split("=")
pair.length match {
case 0 => ("", "") // in case the given string produces an array with no elements e.g. "=".split("=") == Array()
case 1 => (pair(0), "") // in case the given string contains no = and produces a single element e.g. "potato".split("=") == Array("potato")
case _ => (pair(0), pair(1)) // normal case e.g. "potato=masher".split("=") == Array("potato", "masher")
}
}).toMap
}
The following code performs a two phase split process. Because uris donĀ“t have an specific structure you can do it with an UDF like:
val keys = List("aid", "p", "ast", "av", "did", "nt", "ov", "tv", "tz")
def convertToMap(keys: List[String]) = udf {
(in : mutable.WrappedArray[String]) =>
in.foldLeft[Map[String, String]](Map()){ (a, str) =>
keys.flatMap { key =>
val regex = s"""${key}="""
val arr = str.split(regex)
val value = {
if(arr.length == 2) arr(1)
else ""
}
if(!value.isEmpty)
a + (key -> value)
else
a
}.toMap
}
}
df.withColumn("_tmp",
split($"request_uri","""((&)|(\?))"""))
.withColumn("map_result", convertToMap(keys)($"_tmp"))
.select($"map_result")
.show(false)
it gives a MapType column:
+------------------------------------------------------------------------------------------------+
|map_result |
+------------------------------------------------------------------------------------------------+
|Map(aid -> fptplay, ast -> 1582163970763, av -> 4.6.1, did -> 83295772a8fee349) |
|Map(av -> 2.0.18, ov -> 9, tz -> GMT%2B07%3A00, tv -> 1.0.0, p -> fplay-ottbox-2019, nt -> wifi)|
+------------------------------------------------------------------------------------------------+

How to fix : java.lang.StringIndexOutOfBoundsException from spark UDF function

i have a the following dataframe : let's say DF1 as
root
|-- VARIANTS: string (nullable = true)
|-- VARIANT_ID: long (nullable = false)
|-- CASE_ID: string (nullable = true)
|-- APP_ID: integer (nullable = false)
Where Variants (string) look like :
Activity_1,Activity_2,Activity_2,Activity_3,Activity_5...
Am trying to get a new column like
Variants_stats as (per Row) :
Activity_1:1, Activity_2:2, Activity_3:1, Activity_5:1
The approach i have took so far is :
1) Create an UDF :
val countActivityFrequences = udf((value: String) => value.split(",").map(_.trim).groupBy(identity).mapValues(_.length).map{case (k, v) => k + ":" + v}.mkString(","))
val dfNew = df1.withColumn("Variants_stats", countActivityFrequences($"VARIANTS"))
It seems to be ok (at least spark doesn't complain), until i try to do any SQL or dfNew.show(false) call, which always give me back :
java.lang.StringIndexOutOfBoundsException: String index out of range: -84
at java.lang.String.substring(String.java:1931)
at java.lang.Class.getSimpleBinaryName(Class.java:1448)
at java.lang.Class.getSimpleName(Class.java:1309)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage$lzycompute(ScalaUDF.scala:1055)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage(ScalaUDF.scala:1054)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.doGenCode(ScalaUDF.scala:1006)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
at scala.Option.getOrElse(Option.scala:121)
I can't figure out what am doing wrong here ?
Am using Spark 2.1+
To reproduce :
val items = List(
"A_001,A_002,A_010,A_0200,A_0201,A_0201,A_0202,A_0206,A_0207,A_0208,A_0208,A_0209,A_070,A_071,A_072,A_073,A_073,A_074",
"A_001,A_002,A_010,A_0201,A_0201,A_0201,A_0202,A_0206,A_0207,A_0208,A_0208,A_0209,A_070,A_071,A_072,A_073,A_073,A_073")
val df = sc.parallelize(items).toDF("VARIANTS")
df.show(false)
df.printSchema
// create UDF function
val countActivityFrequences = udf((value: String) => value.split(",").map(_.trim).groupBy(identity).mapValues(_.length).map{case (k, v) => k + ":" + v}.mkString(","))
// Apply UDF against our little DF
var dfNew = df.withColumn("Variants_stats", countActivityFrequences($"VARIANTS"))
dfNew.printSchema
// Error Thrown : (either Malforned class name, or java.lang.StringIndexOutOfBoundsException )
dfNew.show(false)
Update :
The issue was only appearing in our AWS EMR environment, under zeppelin.
Restarting the interpreter made it work.

function to each row of Spark Dataframe

I have a spark Dataframe (df) with 2 column's (Report_id and Cluster_number).
I want to apply a function (getClusterInfo) to df which will return the name for each cluster i.e. if cluster number is '3' then for a specific report_id, the 3 below mentioned rows will be written:
{"cluster_id":"1","influencers":[{"screenName":"A"},{"screenName":"B"},{"screenName":"C"},...]}
{"cluster_id":"2","influencers":[{"screenName":"D"},{"screenName":"E"},{"screenName":"F"},...]}
{"cluster_id":"3","influencers":[{"screenName":"G"},{"screenName":"H"},{"screenName":"E"},...]}
I am using foreach on df to apply getClusterInfo function, but can't figure out how to convert o/p to a Dataframe (Report_id, Array[cluster_info]).
Here is the code snippet:
df.foreach(row => {
val report_id = row(0)
val cluster_no = row(1).toString
val cluster_numbers = new Range(0, cluster_no.toInt - 1, 1)
for (cluster <- cluster_numbers.by(1)) {
val cluster_id = report_id + "_" + cluster
//get cluster influencers
val result = getClusterInfo(cluster_id)
println(result.get)
val res : String = result.get.toString()
// TODO ?
}
.. //TODO ?
})
Geenrally speaking, you shouldn't use foreach when you want to map something into something else; foreach is good for applying functions that only have side-effects and return nothing.
In this case, if I got the details right (probably not), you can use a User-Defined Function (UDF) and explode the result:
import org.apache.spark.sql.functions._
import spark.implicits._
// I'm assuming we have these case classes (or similar)
case class Influencer(screenName: String)
case class ClusterInfo(cluster_id: String, influencers: Array[Influencer])
// I'm assuming this method is supplied - with your own implementation
def getClusterInfo(clusterId: String): ClusterInfo =
ClusterInfo(clusterId, Array(Influencer(clusterId)))
// some sample data - assuming both columns are integers:
val df = Seq((222, 3), (333, 4)).toDF("Report_id", "Cluster_number")
// actual solution:
// UDF that returns an array of ClusterInfo;
// Array size is 'clusterNo', creates cluster id for each element and maps it to info
val clusterInfoUdf = udf { (clusterNo: Int, reportId: Int) =>
(1 to clusterNo).map(v => s"${reportId}_$v").map(getClusterInfo)
}
// apply UDF to each record and explode - to create one record per array item
val result = df.select(explode(clusterInfoUdf($"Cluster_number", $"Report_id")))
result.printSchema()
// root
// |-- col: struct (nullable = true)
// | |-- cluster_id: string (nullable = true)
// | |-- influencers: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- screenName: string (nullable = true)
result.show(truncate = false)
// +-----------------------------+
// |col |
// +-----------------------------+
// |[222_1,WrappedArray([222_1])]|
// |[222_2,WrappedArray([222_2])]|
// |[222_3,WrappedArray([222_3])]|
// |[333_1,WrappedArray([333_1])]|
// |[333_2,WrappedArray([333_2])]|
// |[333_3,WrappedArray([333_3])]|
// |[333_4,WrappedArray([333_4])]|
// +-----------------------------+

How to create a Row from a given case class?

Imagine that you have the following case classes:
case class B(key: String, value: Int)
case class A(name: String, data: B)
Given an instance of A, how do I create a Spark Row? e.g.
val a = A("a", B("b", 0))
val row = ???
NOTE: Given row I need to be able to get data with:
val name: String = row.getAs[String]("name")
val b: Row = row.getAs[Row]("data")
The following seems to match what you're looking for.
scala> spark.version
res0: String = 2.3.0
scala> val a = A("a", B("b", 0))
a: A = A(a,B(b,0))
import org.apache.spark.sql.Encoders
val schema = Encoders.product[A].schema
scala> schema.printTreeString
root
|-- name: string (nullable = true)
|-- data: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: integer (nullable = false)
val values = a.productIterator.toSeq.toArray
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
val row: Row = new GenericRowWithSchema(values, schema)
scala> val name: String = row.getAs[String]("name")
name: String = a
// the following won't work since B =!= Row
scala> val b: Row = row.getAs[Row]("data")
java.lang.ClassCastException: B cannot be cast to org.apache.spark.sql.Row
... 55 elided
Very short but probably not the fastest as it first creates a dataframe and then collects it again :
import session.implicits._
val row = Seq(a).toDF().first()
#Jacek Laskowski answer is great!
To complete:
Here some syntactic sugar:
val row = Row(a.productIterator.toSeq: _*)
And a recursive method if you happen to have nested case classes
def productToRow(product: Product): Row = {
val sequence = product.productIterator.toSeq.map {
case product : Product => productToRow(product)
case e => e
}
Row(sequence : _*)
}
I don't think there exist a public API that can do it directly. Internally Spark uses Encoder.toRow method to convert objects org.apache.spark.sql.catalyst.expressions.UnsafeRow, but this method is private. You could try to:
Obtain Encoder for the class:
val enc: Encoder[A] = ExpressionEncoder()
Use reflection to access toRow method and set it to accessible.
Call it to convert object to UnsafeRow.
Obtain RowEncoder for the expected schema (enc.schema).
Convert UnsafeRow to Row.
I haven't tried this, so I cannot guarantee it will work or not.

Filter an array column based on a provided list

I have the following types in a dataframe:
root
|-- id: string (nullable = true)
|-- items: array (nullable = true)
| |-- element: string (containsNull = true)
input:
val rawData = Seq(("id1",Array("item1","item2","item3","item4")),("id2",Array("item1","item2","item3")))
val data = spark.createDataFrame(rawData)
and a list of items:
val filter_list = List("item1", "item2")
I would like to filter out items that are non in the filter_list, similar to how array_contains would function, but its not working on a provided list of strings, only a single value.
so the output would look like this:
val rawData = Seq(("id1",Array("item1","item2")),("id2",Array("item1","item2")))
val data = spark.createDataFrame(rawData)
I tried solving this with the following UDF, but I probably mix types between Scala and Spark:
def filterItems(flist: List[String]) = udf {
(recs: List[String]) => recs.filter(item => flist.contains(item))
}
I'm using Spark 2.2
thanks!
You code is almost right. All you have to do is replace List with Seq
def filterItems(flist: List[String]) = udf {
(recs: Seq[String]) => recs.filter(item => flist.contains(item))
}
It would also make sense to change signature from List[String] => UserDefinedFunction to SeqString] => UserDefinedFunction, but it is not required.
Reference SQL Programming Guide - Data Types.